Pretrain, finetune ANY AI model of ANY size on 1 or 10,000+ GPUs with zero code changes.
fix(callbacks): Defer step/time-triggered ModelCheckpoint saves until validation metrics are available (#21106)
* fix(callbacks): defer step/time-triggered ModelCheckpoint saves until validation metrics are available
Root cause:
- With `every_n_train_steps` (or `train_time_interval`), checkpoints could save at train batch end before validation ran, so the monitored val metric was missing/stale and `best_model_score` was incorrect. (Refs #20919)
Fix:
- In [src/lightning/pytorch/callbacks/model_checkpoint.py:ModelCheckpoint.on_train_batch_end]:
- Defer saves when the monitored key is missing from [trainer.callback_metrics]
- If on the last train batch and not saving at train-epoch-end, defer only when validation will run next:
- `trainer.enable_validation` is True
- `trainer.num_val_batches` > 0
- `trainer.check_val_every_n_epoch` schedule matches the upcoming epoch
- Perform deferred saves in [on_validation_end], ensuring fresh validation metrics are used.
- Allow zero `timedelta` for `train_time_interval` and broadcast the time-trigger decision across ranks.
- Do not defer when monitoring a train metric or when no validation is scheduled.
Tests:
- Repro (previously failing, now passing):
- [tests/tests_pytorch/callbacks/test_model_checkpoint_step_interval_val_metric.py]
- Additional validations:
- [tests/tests_pytorch/callbacks/test_model_checkpoint_additional_cases.py]
- [tests/tests_pytorch/callbacks/test_model_checkpoint_edge_cases.py]
Outcome:
- `best_model_score` matches the validation metric after the epoch.
- Step/time-interval checkpointing behaves correctly without premature or skipped saves.
* test: disable logger in model checkpoint tests to avoid side effects
* chlog
---------
Co-authored-by: Jirka B <[email protected]>
(cherry picked from commit b1cc925d941c0a829a804c8286fa06d7e4dcc2ce) L
littlebullGit committed
ecf95235e52335a774f18a3207882692e54a4e64
Parent: 729a146
Committed by Luca Antiga <[email protected]>
on 9/5/2025, 1:14:02 PM