Skip to content

Conversation

@HollowMan6
Copy link
Collaborator

@HollowMan6 HollowMan6 commented Oct 18, 2025

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

e.g.: For RLHFDataset, filter_overlong_prompts can be very expensive and it will be good to add support to limit the sample size before we do this when the dataset is very large.

Also add support for other kinds of datasets for unification.

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a useful feature for limiting dataset samples. The implementation is straightforward, but the sampling logic is duplicated across four different dataset classes (MultiTurnSFTDataset, RLHFDataset, RMDataset, and SFTDataset). For better long-term maintainability, this duplicated code could be refactored into a shared utility function. My review includes specific comments to improve the reproducibility of the random sampling by adding seeding.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for limiting the number of samples from datasets by introducing train_max_samples and val_max_samples configuration options. The changes are applied across various dataset classes, trainer scripts, and configuration files. While the overall approach is sound and includes new tests, I've found several critical issues in the implementation within the dataset classes that will lead to runtime errors. Specifically, self.config is used without being defined in SFTDataset and MultiTurnSFTDataset, and self.seed is used without being defined in RMDataset. Additionally, a new test for SFTDataset is calling the constructor incorrectly. These issues need to be addressed to ensure the new feature works as intended.

@HollowMan6 HollowMan6 force-pushed the max_samples branch 2 times, most recently from 125168d to 4d1b201 Compare October 18, 2025 14:36
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a feature to limit the number of samples from datasets, which is particularly useful for large datasets and computationally expensive preprocessing steps. The implementation adds train_max_samples and val_max_samples configuration options and applies this limit across various dataset classes, including RLHFDataset, SFTDataset, MultiTurnSFTDataset, and RMDataset. The changes are consistently applied throughout the codebase, with corresponding updates to documentation, example configurations, and tests.

One area for improvement is the duplication of the sampling logic across four different dataset classes. While the current implementation is functional, this redundancy could pose a maintainability challenge in the future. I have provided a suggestion to refactor this logic into a centralized utility function to enhance code quality and reduce the risk of inconsistencies.

@HollowMan6 HollowMan6 force-pushed the max_samples branch 2 times, most recently from 06da5e3 to ea60cfb Compare October 18, 2025 17:22
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for limiting the number of samples from datasets, which is a useful feature for handling large datasets efficiently. The changes are applied consistently across several dataset classes and their usages. I've found one high-severity issue related to deterministic shuffling in RMDataset that should be addressed. Otherwise, the implementation looks good and the new tests are a great addition.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a useful feature to limit the number of samples from a dataset, which can be particularly helpful for large datasets where pre-processing steps are expensive. The implementation correctly adds train_max_samples and val_max_samples to the configuration and plumbs this through to the various dataset classes (RLHFDataset, SFTDataset, MultiTurnSFTDataset, RMDataset). The changes are consistent across different recipes and trainers, and new tests have been added to verify the functionality.

My main feedback is regarding the code duplication of the sampling logic across the four modified dataset classes. Extracting this logic into a single, shared utility function would significantly improve maintainability and reduce the risk of future bugs. I have provided specific comments on the relevant files with a suggestion for refactoring.

@wuxibin89
Copy link
Collaborator

@HollowMan6 Please rebase main and fix ci failure.

e.g.: For RLHFDataset, `filter_overlong_prompts` can be very
expensive and it will be good to add support to limit the sample
size before we do this when the dataset is very large.

Also add support for other kinds of datasets for unification.
@HollowMan6
Copy link
Collaborator Author

HollowMan6 commented Oct 20, 2025

@wuxibin89 The previous CI failures are fixed after rebase. I think the current failed 2 checks are caused by random network issues. Would you mind retriggering these failed CI checks? (It seems like I can't retriggering those failed CIs on my side)

@wuxibin89 wuxibin89 merged commit 60d8a62 into volcengine:main Oct 20, 2025
80 of 84 checks passed
@HollowMan6 HollowMan6 deleted the max_samples branch October 20, 2025 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants