-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[trainer] fix: Add data.seed
to config
#3815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a data.seed
configuration option to ensure reproducible data shuffling, which is a good improvement for experiment consistency. The changes are applied across various configuration files and documentation. However, there are a couple of important issues. First, two auto-generated configuration files have been modified manually, which violates the repository's process. These should be updated by running the generation script after changing the source configs. Second, the implementation does not handle the case where data.seed
is set to null
, which will lead to a runtime crash. The documentation should be updated to warn users about this, and ideally, the code should be made more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a data.seed
configuration option to control the seed for data shuffling, which is a useful feature for reproducibility. The implementation correctly handles cases where the seed is provided as an integer or is explicitly set to null
(or omitted), allowing for both deterministic and non-deterministic shuffling. However, there is a critical issue in the documentation that needs to be addressed.
Also support not setting the seed value (null) Signed-off-by: Hollow Man <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully introduces the data.seed
configuration option to enable reproducible data shuffling, which is a valuable addition for experiment consistency. The changes are applied consistently across various configuration files, documentation, and trainer implementations. My review includes a few suggestions to enhance robustness by adding type validation for the new seed
parameter, which will help prevent runtime errors from invalid configurations.
What does this PR do?
Resolves #2234
Also support not setting the seed value (null)
Checklist Before Starting
[{modules}] {type}: {description}
(This will be checked by the CI){modules}
includefsdp
,megatron
,sglang
,vllm
,rollout
,trainer
,ci
,training_utils
,recipe
,hardware
,deployment
,ray
,worker
,single_controller
,misc
,perf
,model
,algo
,env
,tool
,ckpt
,doc
,data
,
like[megatron, fsdp, doc]
{type}
is infeat
,fix
,refactor
,chore
,test
[BREAKING]
to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batching
Test
API and Usage Example
# Add code snippet or script demonstrating how to use this
Design & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
ci-request
channel in theverl
Slack workspace. (If not accessible, please try the Feishu group (飞书群).)