Pretrain, finetune ANY AI model of ANY size on 1 or 10,000+ GPUs with zero code changes.
Add non-existing resume_from_checkpoint acceptance for auto-resubmit (#4402)
* Add empty resume_from_checkpoint acceptance #4366 * Fix general error catch with focused file check * Add fsspec HTTP extras Add fsspec's HTTPFileSystem support through http extras. pl has supported remote http file (e.g. #2925), so this commit do not add new functionality. * Fix potential too much logging in DDP * Add PR changelog * Add well-written argument explanation Co-authored-by: Adrian Wälchli <[email protected]> * Fix DDP-compatible restore logging Notify from where the states are restored. This feature temporally deleted as a result of PR review. With succeeding review, added with DDP compatibility. * Fix utility import pathes * Refactor load step commentaries * Refactor hpc ckpt suffix acquisition * Refactor restore/hpc_load match * Refactor hpc load trial * Refactor checkpoint dir check * Refactor unneeded function nest * Refactor nested If * Refactor duplicated cache clear * Refactor attempt flow with if/elif * Fix pip8 * Refactor hook commentary Co-authored-by: chaton <[email protected]> * Fix pep8 * Refactor hpc load checkpoint path acquisition * Fix pip8 * Fix typo Co-authored-by: Adrian Wälchli <[email protected]> * Fix typo Co-authored-by: Adrian Wälchli <[email protected]> * Fix doc Co-authored-by: Adrian Wälchli <[email protected]> * Refactor None Union type with Optional * Fix build-doc CI failure debuged in #5329 * Fix fsspec import during build-doc #5329 * Fix test epoch Co-authored-by: Adrian Wälchli <[email protected]> * Fix test with latest test models * . Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: chaton <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Sean Naren <[email protected]> Co-authored-by: Roger Shieh <[email protected]> (cherry picked from commit b0051e8c036fa3312ad4d37aa7141bea64ac6148)
T
tarepan committed
bb366232e753138f6c9c1553a7ec525df17f8665
Parent: cc607d5
Committed by Jirka Borovec <[email protected]>
on 1/6/2021, 11:55:38 AM