-
Couldn't load subscription status.
- Fork 2.4k
[data, trainer] feat: add support for limiting samples from dataset #3812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8bbb2e9 to
f011a90
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a useful feature for limiting dataset samples. The implementation is straightforward, but the sampling logic is duplicated across four different dataset classes (MultiTurnSFTDataset, RLHFDataset, RMDataset, and SFTDataset). For better long-term maintainability, this duplicated code could be refactored into a shared utility function. My review includes specific comments to improve the reproducibility of the random sampling by adding seeding.
fe0124f to
f261865
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for limiting the number of samples from datasets by introducing train_max_samples and val_max_samples configuration options. The changes are applied across various dataset classes, trainer scripts, and configuration files. While the overall approach is sound and includes new tests, I've found several critical issues in the implementation within the dataset classes that will lead to runtime errors. Specifically, self.config is used without being defined in SFTDataset and MultiTurnSFTDataset, and self.seed is used without being defined in RMDataset. Additionally, a new test for SFTDataset is calling the constructor incorrectly. These issues need to be addressed to ensure the new feature works as intended.
125168d to
4d1b201
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a feature to limit the number of samples from datasets, which is particularly useful for large datasets and computationally expensive preprocessing steps. The implementation adds train_max_samples and val_max_samples configuration options and applies this limit across various dataset classes, including RLHFDataset, SFTDataset, MultiTurnSFTDataset, and RMDataset. The changes are consistently applied throughout the codebase, with corresponding updates to documentation, example configurations, and tests.
One area for improvement is the duplication of the sampling logic across four different dataset classes. While the current implementation is functional, this redundancy could pose a maintainability challenge in the future. I have provided a suggestion to refactor this logic into a centralized utility function to enhance code quality and reduce the risk of inconsistencies.
06da5e3 to
ea60cfb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for limiting the number of samples from datasets, which is a useful feature for handling large datasets efficiently. The changes are applied consistently across several dataset classes and their usages. I've found one high-severity issue related to deterministic shuffling in RMDataset that should be addressed. Otherwise, the implementation looks good and the new tests are a great addition.
ea60cfb to
5f7c105
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a useful feature to limit the number of samples from a dataset, which can be particularly helpful for large datasets where pre-processing steps are expensive. The implementation correctly adds train_max_samples and val_max_samples to the configuration and plumbs this through to the various dataset classes (RLHFDataset, SFTDataset, MultiTurnSFTDataset, RMDataset). The changes are consistent across different recipes and trainers, and new tests have been added to verify the functionality.
My main feedback is regarding the code duplication of the sampling logic across the four modified dataset classes. Extracting this logic into a single, shared utility function would significantly improve maintainability and reduce the risk of future bugs. I have provided specific comments on the relevant files with a suggestion for refactoring.
|
@HollowMan6 Please rebase main and fix ci failure. |
e.g.: For RLHFDataset, `filter_overlong_prompts` can be very expensive and it will be good to add support to limit the sample size before we do this when the dataset is very large. Also add support for other kinds of datasets for unification.
5f7c105 to
b0cf1b6
Compare
|
@wuxibin89 The previous CI failures are fixed after rebase. I think the current failed 2 checks are caused by random network issues. Would you mind retriggering these failed CI checks? (It seems like I can't retriggering those failed CIs on my side) |
What does this PR do?
e.g.: For RLHFDataset,
filter_overlong_promptscan be very expensive and it will be good to add support to limit the sample size before we do this when the dataset is very large.Also add support for other kinds of datasets for unification.
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)