Skip to content

Conversation

@szrlee
Copy link
Contributor

@szrlee szrlee commented Oct 18, 2025

Summary

Fixes #3787 by removing torch.quantile()-based percentile metrics (rollout_is_p25, rollout_is_p50, rollout_is_p75) that caused RuntimeError: quantile() input tensor is too large when using large batch sizes or response lengths.

Problem

When using configurations with large tensor sizes (e.g., max_response_length: 32k, rollout.n: 16, train_batch_size: 16), the torch.quantile() function fails with a runtime error due to PyTorch's internal tensor size limitations (~2^24 to 2^27 elements depending on version, GPU memory, and dtype).

The error occurred in verl/trainer/ppo/mismatch_helper.py:

metrics["rollout_is_p25"] = torch.quantile(flat_weights, 0.25)
metrics["rollout_is_p50"] = torch.quantile(flat_weights, 0.50)
metrics["rollout_is_p75"] = torch.quantile(flat_weights, 0.75)

Solution

Removed the three quantile-based percentile metrics from the Rollout IS framework. The remaining metrics (rollout_is_mean, rollout_is_std, rollout_is_min, rollout_is_max, rollout_is_eff_sample_size, etc.) provide sufficient monitoring capabilities for importance sampling health without triggering tensor size limitations.

Changes

  • Modified: verl/trainer/ppo/mismatch_helper.py
    • Removed rollout_is_p25, rollout_is_p50, rollout_is_p75 metric calculations
    • All other rollout IS and mismatch metrics remain functional

Testing

Verified that:

  • Rollout IS framework continues to function correctly without percentile metrics
  • No runtime errors with large tensor configurations
  • All other metrics (mean, std, min, max, ESS, veto fraction, etc.) are computed correctly

Resolves #3787

Remove p25, p50, p75, p95, and p99 percentile metrics from rollout
importance sampling. These metrics used torch.quantile() which can be
computationally expensive. The remaining distribution metrics (mean,
std, min, max, eff_sample_size) provide sufficient monitoring coverage.

Changes:
- Remove quantile computation from compute_is_metrics()
- Update test expectations to remove percentile metrics
- Remove percentile metrics from documentation and examples
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves the torch.quantile tensor size limit error by removing the percentile-based metrics. The core code change in mismatch_helper.py is correct, and the associated updates to tests and most of the documentation are consistent with this removal. I've identified one high-severity issue in the documentation that needs to be addressed to ensure the provided examples are runnable.

…uide

Remove references to torch.quantile-based percentile metrics (p25, p50, p75, p95, p99) from the plotting function and metrics history example to align with the codebase changes that removed these metrics.
- Update all references to use rollout_is naming consistently
@szrlee szrlee force-pushed the yingru/rollout-is-fix-metrics branch from e924de2 to 2eb77b8 Compare October 19, 2025 08:58
@szrlee szrlee changed the title Fix: Remove torch.quantile-based percentile metrics to resolve tensor size limit error [data] fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error Oct 20, 2025
@szrlee szrlee changed the title [data] fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error Oct 20, 2025
@wuxibin89 wuxibin89 changed the title fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error [algo] fix: remove torch.quantile-based percentile metrics to resolve tensor size limit error Oct 20, 2025
@wuxibin89 wuxibin89 merged commit 4f1c489 into volcengine:main Oct 20, 2025
87 of 97 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug when calculating the rollout_is metrics

2 participants