SIGN IN SIGN UP

Pretrain, finetune ANY AI model of ANY size on 1 or 10,000+ GPUs with zero code changes.

30975 0 0 Python

Add non-existing resume_from_checkpoint acceptance for auto-resubmit (#4402)

* Add empty resume_from_checkpoint acceptance #4366

* Fix general error catch with focused file check

* Add fsspec HTTP extras

Add fsspec's HTTPFileSystem  support through http extras.
pl has supported remote http file (e.g. #2925),
so this commit do not add new functionality.

* Fix potential too much logging in DDP

* Add PR changelog

* Add well-written argument explanation

Co-authored-by: Adrian Wälchli <[email protected]>

* Fix DDP-compatible restore logging

Notify from where the states are restored.
This feature temporally deleted as a result of PR review.
With succeeding review, added with DDP compatibility.

* Fix utility import pathes

* Refactor load step commentaries

* Refactor hpc ckpt suffix acquisition

* Refactor restore/hpc_load match

* Refactor hpc load trial

* Refactor checkpoint dir check

* Refactor unneeded function nest

* Refactor nested If

* Refactor duplicated cache clear

* Refactor attempt flow with if/elif

* Fix pip8

* Refactor hook commentary

Co-authored-by: chaton <[email protected]>

* Fix pep8

* Refactor hpc load checkpoint path acquisition

* Fix pip8

* Fix typo

Co-authored-by: Adrian Wälchli <[email protected]>

* Fix typo

Co-authored-by: Adrian Wälchli <[email protected]>

* Fix doc

Co-authored-by: Adrian Wälchli <[email protected]>

* Refactor None Union type with Optional

* Fix build-doc CI failure debuged in #5329

* Fix fsspec import during build-doc #5329

* Fix test epoch

Co-authored-by: Adrian Wälchli <[email protected]>

* Fix test with latest test models

* .

Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: chaton <[email protected]>
Co-authored-by: Jirka Borovec <[email protected]>
Co-authored-by: Sean Naren <[email protected]>
Co-authored-by: Roger Shieh <[email protected]>

(cherry picked from commit b0051e8c036fa3312ad4d37aa7141bea64ac6148)
T
tarepan committed
bb366232e753138f6c9c1553a7ec525df17f8665
Parent: cc607d5
Committed by Jirka Borovec <[email protected]> on 1/6/2021, 11:55:38 AM