Fix `test_train` to not rely on 'Sanity Checking' stdout in multi-GPU runs #10478

drivanov · 2025-10-03T00:10:02Z

The test test_train previously asserted on the presence of the "Sanity Checking" message in stdout. This was brittle because in multi-GPU/DistributedDataParallel runs, only rank 0 prints this message, so tests running on other ranks failed.

This PR updates the test to:

Remove the fragile stdout assertion.
Assert trainer state (!trainer.sanity_checking, current_epoch >= 0).
Use LoggerCallback to verify that both training and validation ran.

This makes the test deterministic and robust across single-GPU, multi-GPU, and CI environments.

PassingLog.TXT

… runs

for more information, see https://pre-commit.ci

puririshi98 · 2025-10-03T00:17:16Z

test/graphgym/test_graphgym.py

        torch.distributed.destroy_process_group()


+ggg = 0


whats this for? can we remove it? otherwise lgtm

codecov · 2025-10-03T00:19:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.09%. Comparing base (c211214) to head (38cfc51).
⚠️ Report is 119 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10478      +/-   ##
==========================================
- Coverage   86.11%   85.09%   -1.02%     
==========================================
  Files         496      510      +14     
  Lines       33655    35962    +2307     
==========================================
+ Hits        28981    30602    +1621     
- Misses       4674     5360     +686

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

for more information, see https://pre-commit.ci

…ometric into sanity_check

puririshi98

LGTM, @akihironitta to merge

CHANGELOG.md

akihironitta · 2025-10-03T02:19:20Z

test/graphgym/test_graphgym.py

-    out, err = capfd.readouterr()
-    assert 'Sanity Checking' in out
-    assert 'Epoch 0:' in out
+    assert not trainer.sanity_checking


If I understand this diff correctly, this assert not trainer.sanity_checking tests nothing. Previously, it tested that the code ran sanity checking, but with this PR, it doesn't. sanity_checking is only True when the trainer is running a few validation steps at the start of fit.

Can we just remove this assert? Testing whether it performed sanity checking doens't make much sense as a test anyway.

This line was suggested by ChatGPT. The following explanation is adapted from its reasoning:

assert not trainer.sanity_checking

could fail (i.e., trainer.sanity_checking == True after trainer.fit() finishes).

What would cause it

If training never moved past the sanity check

Example: an error in the training loop caused Lightning to stop during sanity checking.

Then trainer.fit() would return early, and trainer.sanity_checking could remain True.

If sanity checking was the only phase run

If max_steps=0 or max_epochs=0, training effectively does nothing after the sanity check.

Depending on version, trainer.sanity_checking might stay True.

Misconfigured validation loop

If validation loader is empty or raises errors, Lightning can get stuck in a state where it ends sanity checking improperly.

Edge cases in earlier Lightning versions sometimes left the flag not reset.

Bug in Lightning

If the sanity_checking flag isn’t toggled back (it’s set True at the start of sanity check, False afterward).

Rare, but regressions could happen across versions.

Normal expectation

After a successful trainer.fit(), training always goes past sanity check, so trainer.sanity_checking must be False.

Therefore, this assertion is a robust guarantee that training ran at least one epoch/step beyond the pre-validation check.

⚡ So in short:

It would fail only if trainer.fit() exited abnormally (error, misconfig, or 0 training steps) or Lightning had a bug in resetting the flag.

As we all know, ChatGPT isn't always accurate. Unfortunately, in this case, I am unable to assess the accuracy of its claims. Please advise.

Fix test_train to not rely on 'Sanity Checking' stdout in multi-GPU…

7f62ca2

… runs

drivanov requested review from rusty1s, wsad1 and akihironitta as code owners October 3, 2025 00:10

[pre-commit.ci] auto fixes from pre-commit.com hooks

510e6a0

for more information, see https://pre-commit.ci

puririshi98 requested changes Oct 3, 2025

View reviewed changes

drivanov and others added 5 commits October 2, 2025 17:25

Removing some things that were used for debugging

cca4150

Removing some things that were used for debugging

28d1ed2

[pre-commit.ci] auto fixes from pre-commit.com hooks

0e0006f

for more information, see https://pre-commit.ci

Updating CHANGELOG.md

22c18f0

Merge branch 'sanity_check' of https://github.com/drivanov/pytorch_ge…

e8f77a8

…ometric into sanity_check

puririshi98 approved these changes Oct 3, 2025

View reviewed changes

akihironitta reviewed Oct 3, 2025

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Update CHANGELOG.md

38cfc51

akihironitta added skip-changelog test labels Oct 3, 2025

akihironitta approved these changes Oct 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix `test_train` to not rely on 'Sanity Checking' stdout in multi-GPU runs #10478

Fix `test_train` to not rely on 'Sanity Checking' stdout in multi-GPU runs #10478

Uh oh!

drivanov commented Oct 3, 2025 •

edited

Loading

Uh oh!

puririshi98 Oct 3, 2025

Uh oh!

drivanov Oct 3, 2025

Uh oh!

codecov bot commented Oct 3, 2025 •

edited

Loading

Uh oh!

puririshi98 left a comment

Uh oh!

Uh oh!

akihironitta Oct 3, 2025

Uh oh!

akihironitta Oct 3, 2025

Uh oh!

drivanov Oct 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

		torch.distributed.destroy_process_group()


		ggg = 0

Fix test_train to not rely on 'Sanity Checking' stdout in multi-GPU runs #10478

Are you sure you want to change the base?

Fix test_train to not rely on 'Sanity Checking' stdout in multi-GPU runs #10478

Uh oh!

Conversation

drivanov commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

puririshi98 Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

drivanov Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

puririshi98 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

akihironitta Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

akihironitta Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

drivanov Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

What would cause it

Normal expectation

⚡ So in short:

Uh oh!

Uh oh!

Fix `test_train` to not rely on 'Sanity Checking' stdout in multi-GPU runs #10478

Fix `test_train` to not rely on 'Sanity Checking' stdout in multi-GPU runs #10478

drivanov commented Oct 3, 2025 •

edited

Loading

codecov bot commented Oct 3, 2025 •

edited

Loading

drivanov Oct 3, 2025 •

edited

Loading