[CPU] Fix torch version in x86 CPU backend and refine default configurations #19258

bigPYJ1151 · 2025-06-06T07:22:54Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results

Purpose

Fix torch version used in the x86 CPU backend to avoid random sampling performance regression.

Test Plan

offline inference

Test Result

main

VLLM_CPU_KVCACHE_SPACE=40 \
VLLM_CPU_OMP_THREADS_BIND="128-158|160-190|192-222|224-254" \
python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json -tp=4 --model=meta-llama/Meta-Llama-3-8B-Instruct --num-prompts=1000 --dtype=bfloat16 --trust-remote-code

Throughput: 1.32 requests/s, 544.45 total tokens/s, 261.13 output tokens/s

PR

VLLM_CPU_KVCACHE_SPACE=40 \
VLLM_CPU_OMP_THREADS_BIND="128-158|160-190|192-222|224-254" \
python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json -tp=4 --model=meta-llama/Meta-Llama-3-8B-Instruct --num-prompts=1000 --dtype=bfloat16 --trust-remote-code

Throughput: 2.97 requests/s, 1228.82 total tokens/s, 589.37 output tokens/s

This PR:

Fix torch version as 2.6.0 for x86 CPU backend, because of random number generator performance regression in torch 2.7.0, affects vLLM random sampling performance significantly.
Add new requirement files for the CPU backend building and testing.
Change llava tests' dtype to BF16, as torch 2.6 FP16 GEMM performs slow on this model.
Refine V1 CPU backend compile config and default batch size config.

gemini-code-assist

Hello @bigPYJ1151, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini or gemini-code-assist here, providing a summary of this pull request. This PR primarily focuses on improving the x86 CPU backend by addressing a performance regression found in PyTorch 2.7.0, pinning the version to 2.6.0. It also aims to accelerate the CPU CI pipeline and refines some default configurations specific to the CPU backend. Additionally, it updates how quantization configurations are handled in some VLM models to avoid unnecessary imports.

Highlights

PyTorch Version Pinning: The PyTorch version for the x86 CPU backend is explicitly pinned to 2.6.0 in the requirement files (requirements/cpu.txt, requirements/cpu-build.txt, requirements/cpu-test.in). This is done to avoid a performance regression in random number generation observed in PyTorch 2.7.0 on x86 platforms, which significantly impacts vLLM's random sampling performance.
CPU CI Acceleration: New requirement files (cpu-build.txt, cpu-test.in) are introduced specifically for CPU builds and tests. The CI script (run-cpu-test.sh) is modified to set a VLLM_CPU_CI_ENV variable. This variable is then used in vllm/platforms/cpu.py to potentially switch the V1 compilation backend to 'eager' mode during CI runs, aiming to reduce compilation time and accelerate tests.
Refined CPU Configurations: Default values for max_num_batched_tokens and max_num_seqs are refined specifically for the CPU backend in vllm/engine/arg_utils.py. The logic for setting default max_num_seqs is updated to be context-aware. The forced enforce_eager=True setting for CPU in vllm/platforms/cpu.py is also removed.
Quantization Config Handling: In several VLM model files (minicpmo.py, ovis.py, qwen2_5_vl.py, qwen2_vl.py), the check for GPTQ/Marlin quantization configurations is changed from using isinstance() to checking the quantization config's name via get_name(). This avoids needing to import the specific GPTQ/Marlin config classes in these files.

Changelog

Click here to see the changelog

.buildkite/scripts/hardware_ci/run-cpu-test.sh
- Added VLLM_CPU_CI_ENV=1 environment variable to docker run commands for CI (lines 27, 28).
- Added VLLM_CPU_CI_ENV=0 before the benchmark_serving.py command (line 72).
docker/Dockerfile.cpu
- Updated Dockerfile to use requirements/cpu-build.txt for build dependencies (line 69).
- Updated Dockerfile to use requirements/cpu-test.in for test dependencies (line 112).
requirements/cpu-build.txt
- Added new file specifying build dependencies for the CPU backend.
- Pinned torch to version 2.6.0+cpu for x86_64 (line 9).
- Pinned intel_extension_for_pytorch to version 2.6.0 for x86_64 (implied by build requirements).
requirements/cpu-test.in
- Added new file specifying test dependencies for the CPU backend.
- Pinned torch to version 2.6.0 (line 27).
- Pinned transformers to version 4.52.4 (line 39) and tokenizers to version 0.21.1 (line 40).
requirements/cpu.txt
- Pinned torch to version 2.6.0+cpu for x86_64 (line 11).
- Pinned intel_extension_for_pytorch to version 2.6.0 for x86_64 (line 29).
tests/models/multimodal/generation/test_common.py
- Added explicit dtype="bfloat16" for the minicpmo test configuration when running on CPU (line 110).
vllm/engine/arg_utils.py
- Changed default_max_num_seqs to be a dictionary based on UsageContext (lines 1551, 1558).
- Added CPU-specific default values for max_num_batched_tokens and max_num_seqs (lines 1582-1590).
- Updated logic to use the context-aware default for max_num_seqs (lines 1612-1614).
vllm/model_executor/models/minicpmo.py
- Removed imports for GPTQConfig and GPTQMarlinConfig (lines 38-40).
- Changed quantization config check from isinstance() to quant_config.get_name() in (...) (lines 520-521).
vllm/model_executor/models/ovis.py
- Removed imports for GPTQConfig and GPTQMarlinConfig (lines 34-36).
- Changed quantization config check from isinstance() to quant_config.get_name() in (...) (lines 434-435).
vllm/model_executor/models/qwen2_5_vl.py
- Removed imports for GPTQConfig and GPTQMarlinConfig (lines 52-54).
- Changed quantization config check from isinstance() to quant_config.get_name() in (...) (lines 862-863).
vllm/model_executor/models/qwen2_vl.py
- Removed imports for GPTQConfig and GPTQMarlinConfig (lines 51-53).
- Changed quantization config check from isinstance() to quant_config.get_name() in (...) (lines 1111-1112).
vllm/platforms/cpu.py
- Removed the logic that forced model_config.enforce_eager = True for CPU (lines 94-95).
- Adjusted V1 compilation backend selection to use 'eager' in CI if VLLM_CPU_CI_ENV is set, otherwise use 'inductor' (lines 178-181).
vllm/v1/worker/cpu_model_runner.py
- Removed unused import contextmanager (line 2).
- Removed the _set_global_compilation_settings context manager function (lines 73-86).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses the torch performance regression on the x86 CPU backend by pinning to version 2.6.0. The introduction of VLLM_CPU_CI_ENV for CI-specific configurations and the new CPU-specific requirement files are positive changes. The refactoring in VLM model files to check quantization method by name instead of type improves modularity.

There are a couple of areas related to the V1 CPU backend compilation configuration in vllm/platforms/cpu.py and vllm/v1/worker/cpu_model_runner.py where further clarification on the intent and impact of the changes would be helpful. Please see the detailed comments on those files.

The PR description's "Test Result" section is currently marked as TODO; please consider updating this.

Summary of Findings

V1 CPU Backend Compilation Configuration: Changes to the V1 CPU backend compilation settings, specifically the removal of global inductor configurations for freezing and force_disable_caches in cpu_model_runner.py, and the removal of custom_ops += ["none"] in platforms/cpu.py, warrant clarification to ensure no unintended performance or behavior regressions in non-CI CPU V1 scenarios.
Torch Version Pinning: The core fix of pinning torch to 2.6.0+cpu and intel_extension_for_pytorch to 2.6.0 for the x86 CPU backend is correctly implemented and addresses the reported performance regression.
CI Optimization: The use of VLLM_CPU_CI_ENV to switch to an "eager" backend for CI tests is a good strategy to accelerate CI execution.
Quantization Check Refinement: Refactoring VLM models to check quantization method by name (quant_config.get_name()) instead of isinstance checks against specific GPTQ/Marlin config classes is a good improvement for modularity and reduces direct dependencies.

Merge Readiness

The pull request makes significant improvements and fixes a key performance issue. However, before merging, it would be beneficial to get clarification on the V1 CPU backend compilation changes mentioned in the review comments to ensure they align with the intended behavior and performance expectations. I am unable to approve this pull request myself; please ensure it is reviewed and approved by other maintainers after addressing the feedback. I recommend addressing the medium severity comments before merging.

vllm/platforms/cpu.py

github-actions · 2025-06-06T07:34:45Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-06-10T06:23:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bigPYJ1151.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: jiang.li <[email protected]> avoid import marlin globally Signed-off-by: jiang.li <[email protected]> llava test uses bf16 Signed-off-by: jiang.li <[email protected]> refine compile config Signed-off-by: jiang.li <[email protected]> opt cpu default batchsize Signed-off-by: jiang.li <[email protected]> format Signed-off-by: jiang.li <[email protected]> fix llava embedding Signed-off-by: jiang.li <[email protected]> format Signed-off-by: jiang.li <[email protected]> fix import Signed-off-by: jiang.li <[email protected]> Revert "avoid import marlin globally" This reverts commit d0ebbd265a443d90b99c2342abd88faf42aa9481. Signed-off-by: jiang.li <[email protected]> fix ipex quant Signed-off-by: jiang.li <[email protected]> list packages Signed-off-by: jiang.li <[email protected]> refine test deps Signed-off-by: jiang1.li <[email protected]> update compile config Signed-off-by: jiang1.li <[email protected]>

bigPYJ1151 marked this pull request as ready for review June 6, 2025 07:23

bigPYJ1151 requested review from DarkLight1337, ywang96, WoosukKwon, robertgshaw2-redhat, njhill, comaniac and alexm-redhat as code owners June 6, 2025 07:23

mergify bot added ci/build multi-modality Related to multi-modality (#4194) v1 labels Jun 6, 2025

gemini-code-assist bot reviewed Jun 6, 2025

View reviewed changes

gemini-code-assist bot suggested changes Jun 6, 2025

View reviewed changes

vllm/platforms/cpu.py Show resolved Hide resolved

bigPYJ1151 force-pushed the downgrade_torch branch from 23d2c69 to c0f9736 Compare June 6, 2025 14:49

bigPYJ1151 requested review from mgoin and tlrmchlsmth as code owners June 6, 2025 15:47

bigPYJ1151 force-pushed the downgrade_torch branch 2 times, most recently from 919332b to 3ba343e Compare June 10, 2025 05:42

mergify bot added the needs-rebase label Jun 10, 2025

bigPYJ1151 force-pushed the downgrade_torch branch 6 times, most recently from 0dc8e7c to 4d35c56 Compare June 12, 2025 03:36

mergify bot removed the needs-rebase label Jun 12, 2025

bigPYJ1151 force-pushed the downgrade_torch branch 2 times, most recently from 018a6e7 to 17b3485 Compare June 12, 2025 06:16

bigPYJ1151 force-pushed the downgrade_torch branch from a097b7d to c54495a Compare June 12, 2025 08:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CPU] Fix torch version in x86 CPU backend and refine default configurations #19258

[CPU] Fix torch version in x86 CPU backend and refine default configurations #19258

Uh oh!

bigPYJ1151 commented Jun 6, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

mergify bot commented Jun 10, 2025

Uh oh!

Uh oh!

Uh oh!

[CPU] Fix torch version in x86 CPU backend and refine default configurations #19258

Are you sure you want to change the base?

[CPU] Fix torch version in x86 CPU backend and refine default configurations #19258

Uh oh!

Conversation

bigPYJ1151 commented Jun 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

mergify bot commented Jun 10, 2025

Uh oh!

Uh oh!

bigPYJ1151 commented Jun 6, 2025 •

edited by github-actions bot

Loading