-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Fix test_train
to not rely on 'Sanity Checking' stdout in multi-GPU runs
#10478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
test/graphgym/test_graphgym.py
Outdated
torch.distributed.destroy_process_group() | ||
|
||
|
||
ggg = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats this for? can we remove it? otherwise lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #10478 +/- ##
==========================================
- Coverage 86.11% 85.09% -1.02%
==========================================
Files 496 510 +14
Lines 33655 35962 +2307
==========================================
+ Hits 28981 30602 +1621
- Misses 4674 5360 +686 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
for more information, see https://pre-commit.ci
…ometric into sanity_check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, @akihironitta to merge
out, err = capfd.readouterr() | ||
assert 'Sanity Checking' in out | ||
assert 'Epoch 0:' in out | ||
assert not trainer.sanity_checking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this diff correctly, this assert not trainer.sanity_checking
tests nothing. Previously, it tested that the code ran sanity checking, but with this PR, it doesn't. sanity_checking
is only True
when the trainer is running a few validation steps at the start of fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just remove this assert? Testing whether it performed sanity checking doens't make much sense as a test anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line was suggested by ChatGPT. The following explanation is adapted from its reasoning:
assert not trainer.sanity_checking
could fail (i.e., trainer.sanity_checking == True
after trainer.fit()
finishes).
What would cause it
If training never moved past the sanity check
- Example: an error in the training loop caused Lightning to stop during sanity checking.
- Then
trainer.fit()
would return early, andtrainer.sanity_checking
could remain True.
If sanity checking was the only phase run
- If
max_steps=0
ormax_epochs=0
, training effectively does nothing after the sanity check. - Depending on version,
trainer.sanity_checking
might stay True.
Misconfigured validation loop
- If validation loader is empty or raises errors, Lightning can get stuck in a state where it ends sanity checking improperly.
- Edge cases in earlier Lightning versions sometimes left the flag not reset.
Bug in Lightning
- If the
sanity_checking
flag isn’t toggled back (it’s set True at the start of sanity check, False afterward). - Rare, but regressions could happen across versions.
Normal expectation
- After a successful
trainer.fit()
, training always goes past sanity check, sotrainer.sanity_checking
must be False. - Therefore, this assertion is a robust guarantee that training ran at least one epoch/step beyond the pre-validation check.
⚡ So in short:
It would fail only if trainer.fit(
) exited abnormally (error, misconfig, or 0 training steps) or Lightning had a bug in resetting the flag.
As we all know, ChatGPT isn't always accurate. Unfortunately, in this case, I am unable to assess the accuracy of its claims. Please advise.
The test
test_train
previously asserted on the presence of the "Sanity Checking" message in stdout. This was brittle because in multi-GPU/DistributedDataParallel runs, only rank 0 prints this message, so tests running on other ranks failed.This PR updates the test to:
!trainer.sanity_checking
,current_epoch >= 0
).This makes the test deterministic and robust across single-GPU, multi-GPU, and CI environments.
PassingLog.TXT