Add optional `start_method` argument to `_cli.py` #1476

amorehead · 2025-09-09T22:58:56Z

Adds an optional start_method argument to _cli.py, to allow one to use fork over spawn for instance

src/fairchem/core/_cli.py

amorehead · 2025-09-10T21:39:50Z

I've also opened a PR for torchtnt that will make multi-GPU (local) training work when scheduler.start_method=fork! This allows one to train on a non-SLURM cluster with multiple GPUs and multiple dataloader workers (i.e., num_dataloader_workers>0).

amorehead · 2025-09-11T17:28:45Z

I have noticed that even with this fix, at random intervals, a dataloader worker will raise the following error and crash a training job: malloc(): invalid next size (unsorted). This error would have been raised as a segmentation fault in PyTorch 2.6.0, but with PyTorch 2.8.0, it appears as a malloc error. This error can be avoided entirely by using dataloader_workers=0 (but then training is incredibly slow), and I have verified it is unrelated to corrupted files being present in the training dataset, as I have downloaded and extracted a training set twice to rule this out.

Even odder is now I'm getting segmentation faults when using cluster.mode=SLURM with dataloader_workers>0.

rayg1234

Thanks for adding this option!

rayg1234 · 2025-09-12T16:52:02Z

I have noticed that even with this fix, at random intervals, a dataloader worker will raise the following error and crash a training job: malloc(): invalid next size (unsorted). This error would have been raised as a segmentation fault in PyTorch 2.6.0, but with PyTorch 2.8.0, it appears as a malloc error. This error can be avoided entirely by using dataloader_workers=0 (but then training is incredibly slow), and I have verified it is unrelated to corrupted files being present in the training dataset, as I have downloaded and extracted a training set twice to rule this out.

btw we are planning to do the torch2.8 upgrade soon, i don't have a good guess where this is coming from but will keep a lookout cc @misko

amorehead · 2025-09-14T15:01:50Z

@rayg1234, I just discovered that I can fix the segfault error (with cluster.mode=SLURM) I mentioned before by pip installing fairchem==2.4.0 (versus the latest commit in main) which installs lmdb==1.7.0 vs. lmdb==1.7.3. I think the newer lmdb version (in the latest fairchem commit in main) was causing side effects downstream in the fairchem codebase. Just a heads up.

misko · 2025-09-24T17:41:04Z

@amorehead Hi Alex! Sorry for being so late to this party! Thank you very much for identifying this issue and giving us a heads up. I just encountered it, and knowing to try a previous version of lmdb saved me a bunch of time, thank you! The only way im able to get training running is with lmdb==1.6.2 . I am investigating why this is now :(

Pytorch 2.8 branch is passing tests, and I am launching training runs today to verify it works as expected. Hopefully will have some more details on malloc error soon!

misko · 2025-09-25T23:11:50Z

I have noticed that even with this fix, at random intervals, a dataloader worker will raise the following error and crash a training job: malloc(): invalid next size (unsorted). This error would have been raised as a segmentation fault in PyTorch 2.6.0, but with PyTorch 2.8.0, it appears as a malloc error. This error can be avoided entirely by using dataloader_workers=0 (but then training is incredibly slow), and I have verified it is unrelated to corrupted files being present in the training dataset, as I have downloaded and extracted a training set twice to rule this out.

Even odder is now I'm getting segmentation faults when using cluster.mode=SLURM with dataloader_workers>0.

I created a PR with ase-db-backends to resolve the segmentation fault, for now we have reverted to py-lmdb-1.6.2

https://gitlab.com/ase/ase-db-backends/-/merge_requests/6

amorehead · 2025-09-26T18:00:26Z

Thanks for the update, @misko!

Update _cli.py

1692413

meta-cla bot added the cla signed label Sep 9, 2025

Update _cli.py

32b05b1

rayg1234 added enhancement New feature or request patch Patch version release labels Sep 10, 2025

rayg1234 reviewed Sep 10, 2025

View reviewed changes

src/fairchem/core/_cli.py Outdated Show resolved Hide resolved

amorehead added 6 commits September 10, 2025 11:55

Update _cli.py

0b5cd38

Update _cli.py

42af537

Update _cli.py

c9b7b15

Update _cli.py

0096018

Update _cli.py

c4a422f

Update _cli.py

489dcd1

amorehead mentioned this pull request Sep 10, 2025

Make test_utils.py fork-safe for torchelastic meta-pytorch/tnt#1030

Open

amorehead added 2 commits September 10, 2025 21:40

Update dataloader_builder.py

dbcb8ac

Merge branch 'main' into start_method

b360b52

amorehead added 2 commits September 11, 2025 11:34

Update _cli.py

e637b82

Merge branch 'facebookresearch:main' into start_method

e9e9500

rayg1234 previously approved these changes Sep 12, 2025

View reviewed changes

Merge branch 'main' into start_method

fcbf5e8

amorehead dismissed rayg1234’s stale review via fcbf5e8 September 12, 2025 02:37

Merge branch 'main' into start_method

60bf983

misko mentioned this pull request Sep 24, 2025

revert to lmdb 1.6.2 #1522

Merged

Merge branch 'main' into start_method

091fb30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add optional `start_method` argument to `_cli.py` #1476

Add optional `start_method` argument to `_cli.py` #1476

Uh oh!

amorehead commented Sep 9, 2025

Uh oh!

Uh oh!

amorehead commented Sep 10, 2025

Uh oh!

amorehead commented Sep 11, 2025 •

edited

Loading

Uh oh!

rayg1234 left a comment

Uh oh!

rayg1234 commented Sep 12, 2025

Uh oh!

amorehead commented Sep 14, 2025 •

edited

Loading

Uh oh!

misko commented Sep 24, 2025

Uh oh!

misko commented Sep 25, 2025

Uh oh!

amorehead commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add optional start_method argument to _cli.py #1476

Are you sure you want to change the base?

Add optional start_method argument to _cli.py #1476

Uh oh!

Conversation

amorehead commented Sep 9, 2025

Uh oh!

Uh oh!

amorehead commented Sep 10, 2025

Uh oh!

amorehead commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rayg1234 left a comment

Choose a reason for hiding this comment

Uh oh!

rayg1234 commented Sep 12, 2025

Uh oh!

amorehead commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

misko commented Sep 24, 2025

Uh oh!

misko commented Sep 25, 2025

Uh oh!

amorehead commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add optional `start_method` argument to `_cli.py` #1476

Add optional `start_method` argument to `_cli.py` #1476

amorehead commented Sep 11, 2025 •

edited

Loading

amorehead commented Sep 14, 2025 •

edited

Loading