Skip to content

Conversation

drivanov
Copy link
Contributor

@drivanov drivanov commented Oct 3, 2025

The test test_train previously asserted on the presence of the "Sanity Checking" message in stdout. This was brittle because in multi-GPU/DistributedDataParallel runs, only rank 0 prints this message, so tests running on other ranks failed.

This PR updates the test to:

  • Remove the fragile stdout assertion.
  • Assert trainer state (!trainer.sanity_checking, current_epoch >= 0).
  • Use LoggerCallback to verify that both training and validation ran.

This makes the test deterministic and robust across single-GPU, multi-GPU, and CI environments.

PassingLog.TXT

torch.distributed.destroy_process_group()


ggg = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats this for? can we remove it? otherwise lgtm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link

codecov bot commented Oct 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.09%. Comparing base (c211214) to head (38cfc51).
⚠️ Report is 119 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10478      +/-   ##
==========================================
- Coverage   86.11%   85.09%   -1.02%     
==========================================
  Files         496      510      +14     
  Lines       33655    35962    +2307     
==========================================
+ Hits        28981    30602    +1621     
- Misses       4674     5360     +686     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@puririshi98 puririshi98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @akihironitta to merge

out, err = capfd.readouterr()
assert 'Sanity Checking' in out
assert 'Epoch 0:' in out
assert not trainer.sanity_checking
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this diff correctly, this assert not trainer.sanity_checking tests nothing. Previously, it tested that the code ran sanity checking, but with this PR, it doesn't. sanity_checking is only True when the trainer is running a few validation steps at the start of fit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just remove this assert? Testing whether it performed sanity checking doens't make much sense as a test anyway.

Copy link
Contributor Author

@drivanov drivanov Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line was suggested by ChatGPT. The following explanation is adapted from its reasoning:

     assert not trainer.sanity_checking

could fail (i.e., trainer.sanity_checking == True after trainer.fit() finishes).

What would cause it

If training never moved past the sanity check

  • Example: an error in the training loop caused Lightning to stop during sanity checking.
  • Then trainer.fit() would return early, and trainer.sanity_checking could remain True.

If sanity checking was the only phase run

  • If max_steps=0 or max_epochs=0, training effectively does nothing after the sanity check.
  • Depending on version, trainer.sanity_checking might stay True.

Misconfigured validation loop

  • If validation loader is empty or raises errors, Lightning can get stuck in a state where it ends sanity checking improperly.
  • Edge cases in earlier Lightning versions sometimes left the flag not reset.

Bug in Lightning

  • If the sanity_checking flag isn’t toggled back (it’s set True at the start of sanity check, False afterward).
  • Rare, but regressions could happen across versions.

Normal expectation

  • After a successful trainer.fit(), training always goes past sanity check, so trainer.sanity_checking must be False.
  • Therefore, this assertion is a robust guarantee that training ran at least one epoch/step beyond the pre-validation check.

So in short:

It would fail only if trainer.fit() exited abnormally (error, misconfig, or 0 training steps) or Lightning had a bug in resetting the flag.

As we all know, ChatGPT isn't always accurate. Unfortunately, in this case, I am unable to assess the accuracy of its claims. Please advise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants