SIGN IN SIGN UP

Pretrain, finetune ANY AI model of ANY size on 1 or 10,000+ GPUs with zero code changes.

30975 0 0 Python

fix(callbacks): Defer step/time-triggered ModelCheckpoint saves until validation metrics are available (#21106)

* fix(callbacks): defer step/time-triggered ModelCheckpoint saves until validation metrics are available

Root cause:
- With `every_n_train_steps` (or `train_time_interval`), checkpoints could save at train batch end before validation ran, so the monitored val metric was missing/stale and `best_model_score` was incorrect. (Refs #20919)

Fix:
- In [src/lightning/pytorch/callbacks/model_checkpoint.py:ModelCheckpoint.on_train_batch_end]:
  - Defer saves when the monitored key is missing from [trainer.callback_metrics]
  - If on the last train batch and not saving at train-epoch-end, defer only when validation will run next:
    - `trainer.enable_validation` is True
    - `trainer.num_val_batches` > 0
    - `trainer.check_val_every_n_epoch` schedule matches the upcoming epoch
- Perform deferred saves in [on_validation_end], ensuring fresh validation metrics are used.
- Allow zero `timedelta` for `train_time_interval` and broadcast the time-trigger decision across ranks.
- Do not defer when monitoring a train metric or when no validation is scheduled.

Tests:
- Repro (previously failing, now passing):
  - [tests/tests_pytorch/callbacks/test_model_checkpoint_step_interval_val_metric.py]
- Additional validations:
  - [tests/tests_pytorch/callbacks/test_model_checkpoint_additional_cases.py]
  - [tests/tests_pytorch/callbacks/test_model_checkpoint_edge_cases.py]

Outcome:
- `best_model_score` matches the validation metric after the epoch.
- Step/time-interval checkpointing behaves correctly without premature or skipped saves.

* test: disable logger in model checkpoint tests to avoid side effects

* chlog

---------

Co-authored-by: Jirka B <[email protected]>
(cherry picked from commit b1cc925d941c0a829a804c8286fa06d7e4dcc2ce)
L
littlebullGit committed
ecf95235e52335a774f18a3207882692e54a4e64
Parent: 729a146
Committed by Luca Antiga <[email protected]> on 9/5/2025, 1:14:02 PM