Skip to content

Conversation

@amorehead
Copy link

  • Adds an optional start_method argument to _cli.py, to allow one to use fork over spawn for instance

@meta-cla meta-cla bot added the cla signed label Sep 9, 2025
@rayg1234 rayg1234 added enhancement New feature or request patch Patch version release labels Sep 10, 2025
@amorehead
Copy link
Author

I've also opened a PR for torchtnt that will make multi-GPU (local) training work when scheduler.start_method=fork! This allows one to train on a non-SLURM cluster with multiple GPUs and multiple dataloader workers (i.e., num_dataloader_workers>0).

@amorehead
Copy link
Author

amorehead commented Sep 11, 2025

I have noticed that even with this fix, at random intervals, a dataloader worker will raise the following error and crash a training job: malloc(): invalid next size (unsorted). This error would have been raised as a segmentation fault in PyTorch 2.6.0, but with PyTorch 2.8.0, it appears as a malloc error. This error can be avoided entirely by using dataloader_workers=0 (but then training is incredibly slow), and I have verified it is unrelated to corrupted files being present in the training dataset, as I have downloaded and extracted a training set twice to rule this out.

Even odder is now I'm getting segmentation faults when using cluster.mode=SLURM with dataloader_workers>0.

rayg1234
rayg1234 previously approved these changes Sep 12, 2025
Copy link
Contributor

@rayg1234 rayg1234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this option!

@rayg1234
Copy link
Contributor

I have noticed that even with this fix, at random intervals, a dataloader worker will raise the following error and crash a training job: malloc(): invalid next size (unsorted). This error would have been raised as a segmentation fault in PyTorch 2.6.0, but with PyTorch 2.8.0, it appears as a malloc error. This error can be avoided entirely by using dataloader_workers=0 (but then training is incredibly slow), and I have verified it is unrelated to corrupted files being present in the training dataset, as I have downloaded and extracted a training set twice to rule this out.

btw we are planning to do the torch2.8 upgrade soon, i don't have a good guess where this is coming from but will keep a lookout cc @misko

@amorehead
Copy link
Author

amorehead commented Sep 14, 2025

@rayg1234, I just discovered that I can fix the segfault error (with cluster.mode=SLURM) I mentioned before by pip installing fairchem==2.4.0 (versus the latest commit in main) which installs lmdb==1.7.0 vs. lmdb==1.7.3. I think the newer lmdb version (in the latest fairchem commit in main) was causing side effects downstream in the fairchem codebase. Just a heads up.

@misko
Copy link
Contributor

misko commented Sep 24, 2025

@amorehead Hi Alex! Sorry for being so late to this party! Thank you very much for identifying this issue and giving us a heads up. I just encountered it, and knowing to try a previous version of lmdb saved me a bunch of time, thank you! The only way im able to get training running is with lmdb==1.6.2 . I am investigating why this is now :(

Pytorch 2.8 branch is passing tests, and I am launching training runs today to verify it works as expected. Hopefully will have some more details on malloc error soon!

@misko misko mentioned this pull request Sep 24, 2025
@misko
Copy link
Contributor

misko commented Sep 25, 2025

I have noticed that even with this fix, at random intervals, a dataloader worker will raise the following error and crash a training job: malloc(): invalid next size (unsorted). This error would have been raised as a segmentation fault in PyTorch 2.6.0, but with PyTorch 2.8.0, it appears as a malloc error. This error can be avoided entirely by using dataloader_workers=0 (but then training is incredibly slow), and I have verified it is unrelated to corrupted files being present in the training dataset, as I have downloaded and extracted a training set twice to rule this out.

Even odder is now I'm getting segmentation faults when using cluster.mode=SLURM with dataloader_workers>0.

I created a PR with ase-db-backends to resolve the segmentation fault, for now we have reverted to py-lmdb-1.6.2

https://gitlab.com/ase/ase-db-backends/-/merge_requests/6

@amorehead
Copy link
Author

Thanks for the update, @misko!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed enhancement New feature or request patch Patch version release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants