[data, trainer] feat: add support for limiting samples from dataset #3812

HollowMan6 · 2025-10-18T14:18:01Z

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

e.g.: For RLHFDataset, filter_overlong_prompts can be very expensive and it will be good to add support to limit the sample size before we do this when the dataset is very large.

Also add support for other kinds of datasets for unification.

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

gemini-code-assist

Code Review

This pull request introduces a useful feature for limiting dataset samples. The implementation is straightforward, but the sampling logic is duplicated across four different dataset classes (MultiTurnSFTDataset, RLHFDataset, RMDataset, and SFTDataset). For better long-term maintainability, this duplicated code could be refactored into a shared utility function. My review includes specific comments to improve the reproducibility of the random sampling by adding seeding.

verl/utils/dataset/multiturn_sft_dataset.py

verl/utils/dataset/rl_dataset.py

verl/utils/dataset/rm_dataset.py

verl/utils/dataset/sft_dataset.py

gemini-code-assist

Code Review

This pull request adds support for limiting the number of samples from datasets by introducing train_max_samples and val_max_samples configuration options. The changes are applied across various dataset classes, trainer scripts, and configuration files. While the overall approach is sound and includes new tests, I've found several critical issues in the implementation within the dataset classes that will lead to runtime errors. Specifically, self.config is used without being defined in SFTDataset and MultiTurnSFTDataset, and self.seed is used without being defined in RMDataset. Additionally, a new test for SFTDataset is calling the constructor incorrectly. These issues need to be addressed to ensure the new feature works as intended.

verl/utils/dataset/multiturn_sft_dataset.py

verl/utils/dataset/rm_dataset.py

verl/utils/dataset/sft_dataset.py

tests/utils/dataset/test_sft_dataset_on_cpu.py

gemini-code-assist

Code Review

This pull request introduces a feature to limit the number of samples from datasets, which is particularly useful for large datasets and computationally expensive preprocessing steps. The implementation adds train_max_samples and val_max_samples configuration options and applies this limit across various dataset classes, including RLHFDataset, SFTDataset, MultiTurnSFTDataset, and RMDataset. The changes are consistently applied throughout the codebase, with corresponding updates to documentation, example configurations, and tests.

One area for improvement is the duplication of the sampling logic across four different dataset classes. While the current implementation is functional, this redundancy could pose a maintainability challenge in the future. I have provided a suggestion to refactor this logic into a centralized utility function to enhance code quality and reduce the risk of inconsistencies.

verl/utils/dataset/rl_dataset.py

gemini-code-assist

Code Review

This pull request adds support for limiting the number of samples from datasets, which is a useful feature for handling large datasets efficiently. The changes are applied consistently across several dataset classes and their usages. I've found one high-severity issue related to deterministic shuffling in RMDataset that should be addressed. Otherwise, the implementation looks good and the new tests are a great addition.

verl/utils/dataset/rm_dataset.py

gemini-code-assist

Code Review

This pull request introduces a useful feature to limit the number of samples from a dataset, which can be particularly helpful for large datasets where pre-processing steps are expensive. The implementation correctly adds train_max_samples and val_max_samples to the configuration and plumbs this through to the various dataset classes (RLHFDataset, SFTDataset, MultiTurnSFTDataset, RMDataset). The changes are consistent across different recipes and trainers, and new tests have been added to verify the functionality.

My main feedback is regarding the code duplication of the sampling logic across the four modified dataset classes. Extracting this logic into a single, shared utility function would significantly improve maintainability and reduce the risk of future bugs. I have provided specific comments on the relevant files with a suggestion for refactoring.

verl/utils/dataset/multiturn_sft_dataset.py

verl/utils/dataset/rl_dataset.py

verl/utils/dataset/rm_dataset.py

verl/utils/dataset/sft_dataset.py

wuxibin89 · 2025-10-20T05:06:21Z

@HollowMan6 Please rebase main and fix ci failure.

e.g.: For RLHFDataset, `filter_overlong_prompts` can be very expensive and it will be good to add support to limit the sample size before we do this when the dataset is very large. Also add support for other kinds of datasets for unification.

HollowMan6 · 2025-10-20T07:16:40Z

@wuxibin89 The previous CI failures are fixed after rebase. I think the current failed 2 checks are caused by random network issues. Would you mind retriggering these failed CI checks? (It seems like I can't retriggering those failed CIs on my side)

HollowMan6 requested review from PeterSH6, eric-haibin-lin, tongyx361, vermouth1992 and zhaochenyang20 as code owners October 18, 2025 14:18

HollowMan6 force-pushed the max_samples branch from 8bbb2e9 to f011a90 Compare October 18, 2025 14:20