-
Notifications
You must be signed in to change notification settings - Fork 400
Add optional start_method argument to _cli.py
#1476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I've also opened a PR for torchtnt that will make multi-GPU (local) training work when |
|
I have noticed that even with this fix, at random intervals, a dataloader worker will raise the following error and crash a training job: Even odder is now I'm getting segmentation faults when using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this option!
btw we are planning to do the torch2.8 upgrade soon, i don't have a good guess where this is coming from but will keep a lookout cc @misko |
|
@rayg1234, I just discovered that I can fix the segfault error (with |
|
@amorehead Hi Alex! Sorry for being so late to this party! Thank you very much for identifying this issue and giving us a heads up. I just encountered it, and knowing to try a previous version of lmdb saved me a bunch of time, thank you! The only way im able to get training running is with lmdb==1.6.2 . I am investigating why this is now :( Pytorch 2.8 branch is passing tests, and I am launching training runs today to verify it works as expected. Hopefully will have some more details on malloc error soon! |
I created a PR with ase-db-backends to resolve the segmentation fault, for now we have reverted to py-lmdb-1.6.2 |
|
Thanks for the update, @misko! |
start_methodargument to_cli.py, to allow one to useforkoverspawnfor instance