Skip to content

[CPU] Fix torch version in x86 CPU backend and refine default configurations #19258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bigPYJ1151
Copy link
Contributor

@bigPYJ1151 bigPYJ1151 commented Jun 6, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results

Purpose

  • Fix torch version used in the x86 CPU backend to avoid random sampling performance regression.

Test Plan

offline inference

Test Result

  • main
VLLM_CPU_KVCACHE_SPACE=40 \
VLLM_CPU_OMP_THREADS_BIND="128-158|160-190|192-222|224-254" \
python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json -tp=4 --model=meta-llama/Meta-Llama-3-8B-Instruct --num-prompts=1000 --dtype=bfloat16 --trust-remote-code

Throughput: 1.32 requests/s, 544.45 total tokens/s, 261.13 output tokens/s
  • PR
VLLM_CPU_KVCACHE_SPACE=40 \
VLLM_CPU_OMP_THREADS_BIND="128-158|160-190|192-222|224-254" \
python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json -tp=4 --model=meta-llama/Meta-Llama-3-8B-Instruct --num-prompts=1000 --dtype=bfloat16 --trust-remote-code

Throughput: 2.97 requests/s, 1228.82 total tokens/s, 589.37 output tokens/s

This PR:

  • Fix torch version as 2.6.0 for x86 CPU backend, because of random number generator performance regression in torch 2.7.0, affects vLLM random sampling performance significantly.
  • Add new requirement files for the CPU backend building and testing.
  • Change llava tests' dtype to BF16, as torch 2.6 FP16 GEMM performs slow on this model.
  • Refine V1 CPU backend compile config and default batch size config.

@bigPYJ1151 bigPYJ1151 marked this pull request as ready for review June 6, 2025 07:23
@mergify mergify bot added ci/build multi-modality Related to multi-modality (#4194) v1 labels Jun 6, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @bigPYJ1151, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini or gemini-code-assist here, providing a summary of this pull request. This PR primarily focuses on improving the x86 CPU backend by addressing a performance regression found in PyTorch 2.7.0, pinning the version to 2.6.0. It also aims to accelerate the CPU CI pipeline and refines some default configurations specific to the CPU backend. Additionally, it updates how quantization configurations are handled in some VLM models to avoid unnecessary imports.

Highlights

  • PyTorch Version Pinning: The PyTorch version for the x86 CPU backend is explicitly pinned to 2.6.0 in the requirement files (requirements/cpu.txt, requirements/cpu-build.txt, requirements/cpu-test.in). This is done to avoid a performance regression in random number generation observed in PyTorch 2.7.0 on x86 platforms, which significantly impacts vLLM's random sampling performance.
  • CPU CI Acceleration: New requirement files (cpu-build.txt, cpu-test.in) are introduced specifically for CPU builds and tests. The CI script (run-cpu-test.sh) is modified to set a VLLM_CPU_CI_ENV variable. This variable is then used in vllm/platforms/cpu.py to potentially switch the V1 compilation backend to 'eager' mode during CI runs, aiming to reduce compilation time and accelerate tests.
  • Refined CPU Configurations: Default values for max_num_batched_tokens and max_num_seqs are refined specifically for the CPU backend in vllm/engine/arg_utils.py. The logic for setting default max_num_seqs is updated to be context-aware. The forced enforce_eager=True setting for CPU in vllm/platforms/cpu.py is also removed.
  • Quantization Config Handling: In several VLM model files (minicpmo.py, ovis.py, qwen2_5_vl.py, qwen2_vl.py), the check for GPTQ/Marlin quantization configurations is changed from using isinstance() to checking the quantization config's name via get_name(). This avoids needing to import the specific GPTQ/Marlin config classes in these files.

Changelog

Click here to see the changelog
  • .buildkite/scripts/hardware_ci/run-cpu-test.sh
    • Added VLLM_CPU_CI_ENV=1 environment variable to docker run commands for CI (lines 27, 28).
    • Added VLLM_CPU_CI_ENV=0 before the benchmark_serving.py command (line 72).
  • docker/Dockerfile.cpu
    • Updated Dockerfile to use requirements/cpu-build.txt for build dependencies (line 69).
    • Updated Dockerfile to use requirements/cpu-test.in for test dependencies (line 112).
  • requirements/cpu-build.txt
    • Added new file specifying build dependencies for the CPU backend.
    • Pinned torch to version 2.6.0+cpu for x86_64 (line 9).
    • Pinned intel_extension_for_pytorch to version 2.6.0 for x86_64 (implied by build requirements).
  • requirements/cpu-test.in
    • Added new file specifying test dependencies for the CPU backend.
    • Pinned torch to version 2.6.0 (line 27).
    • Pinned transformers to version 4.52.4 (line 39) and tokenizers to version 0.21.1 (line 40).
  • requirements/cpu.txt
    • Pinned torch to version 2.6.0+cpu for x86_64 (line 11).
    • Pinned intel_extension_for_pytorch to version 2.6.0 for x86_64 (line 29).
  • tests/models/multimodal/generation/test_common.py
    • Added explicit dtype="bfloat16" for the minicpmo test configuration when running on CPU (line 110).
  • vllm/engine/arg_utils.py
    • Changed default_max_num_seqs to be a dictionary based on UsageContext (lines 1551, 1558).
    • Added CPU-specific default values for max_num_batched_tokens and max_num_seqs (lines 1582-1590).
    • Updated logic to use the context-aware default for max_num_seqs (lines 1612-1614).
  • vllm/model_executor/models/minicpmo.py
    • Removed imports for GPTQConfig and GPTQMarlinConfig (lines 38-40).
    • Changed quantization config check from isinstance() to quant_config.get_name() in (...) (lines 520-521).
  • vllm/model_executor/models/ovis.py
    • Removed imports for GPTQConfig and GPTQMarlinConfig (lines 34-36).
    • Changed quantization config check from isinstance() to quant_config.get_name() in (...) (lines 434-435).
  • vllm/model_executor/models/qwen2_5_vl.py
    • Removed imports for GPTQConfig and GPTQMarlinConfig (lines 52-54).
    • Changed quantization config check from isinstance() to quant_config.get_name() in (...) (lines 862-863).
  • vllm/model_executor/models/qwen2_vl.py
    • Removed imports for GPTQConfig and GPTQMarlinConfig (lines 51-53).
    • Changed quantization config check from isinstance() to quant_config.get_name() in (...) (lines 1111-1112).
  • vllm/platforms/cpu.py
    • Removed the logic that forced model_config.enforce_eager = True for CPU (lines 94-95).
    • Adjusted V1 compilation backend selection to use 'eager' in CI if VLLM_CPU_CI_ENV is set, otherwise use 'inductor' (lines 178-181).
  • vllm/v1/worker/cpu_model_runner.py
    • Removed unused import contextmanager (line 2).
    • Removed the _set_global_compilation_settings context manager function (lines 73-86).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the torch performance regression on the x86 CPU backend by pinning to version 2.6.0. The introduction of VLLM_CPU_CI_ENV for CI-specific configurations and the new CPU-specific requirement files are positive changes. The refactoring in VLM model files to check quantization method by name instead of type improves modularity.

There are a couple of areas related to the V1 CPU backend compilation configuration in vllm/platforms/cpu.py and vllm/v1/worker/cpu_model_runner.py where further clarification on the intent and impact of the changes would be helpful. Please see the detailed comments on those files.

The PR description's "Test Result" section is currently marked as TODO; please consider updating this.

Summary of Findings

  • V1 CPU Backend Compilation Configuration: Changes to the V1 CPU backend compilation settings, specifically the removal of global inductor configurations for freezing and force_disable_caches in cpu_model_runner.py, and the removal of custom_ops += ["none"] in platforms/cpu.py, warrant clarification to ensure no unintended performance or behavior regressions in non-CI CPU V1 scenarios.
  • Torch Version Pinning: The core fix of pinning torch to 2.6.0+cpu and intel_extension_for_pytorch to 2.6.0 for the x86 CPU backend is correctly implemented and addresses the reported performance regression.
  • CI Optimization: The use of VLLM_CPU_CI_ENV to switch to an "eager" backend for CI tests is a good strategy to accelerate CI execution.
  • Quantization Check Refinement: Refactoring VLM models to check quantization method by name (quant_config.get_name()) instead of isinstance checks against specific GPTQ/Marlin config classes is a good improvement for modularity and reduces direct dependencies.

Merge Readiness

The pull request makes significant improvements and fixes a key performance issue. However, before merging, it would be beneficial to get clarification on the V1 CPU backend compilation changes mentioned in the review comments to ensure they align with the intended behavior and performance expectations. I am unable to approve this pull request myself; please ensure it is reviewed and approved by other maintainers after addressing the feedback. I recommend addressing the medium severity comments before merging.

Copy link

github-actions bot commented Jun 6, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@bigPYJ1151 bigPYJ1151 force-pushed the downgrade_torch branch 2 times, most recently from 919332b to 3ba343e Compare June 10, 2025 05:42
Copy link

mergify bot commented Jun 10, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bigPYJ1151.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 10, 2025
@bigPYJ1151 bigPYJ1151 force-pushed the downgrade_torch branch 6 times, most recently from 0dc8e7c to 4d35c56 Compare June 12, 2025 03:36
@mergify mergify bot removed the needs-rebase label Jun 12, 2025
@bigPYJ1151 bigPYJ1151 force-pushed the downgrade_torch branch 2 times, most recently from 018a6e7 to 17b3485 Compare June 12, 2025 06:16
Signed-off-by: jiang.li <[email protected]>

avoid import marlin globally

Signed-off-by: jiang.li <[email protected]>

llava test uses bf16

Signed-off-by: jiang.li <[email protected]>

refine compile config

Signed-off-by: jiang.li <[email protected]>

opt cpu default batchsize

Signed-off-by: jiang.li <[email protected]>

format

Signed-off-by: jiang.li <[email protected]>

fix llava embedding

Signed-off-by: jiang.li <[email protected]>

format

Signed-off-by: jiang.li <[email protected]>

fix import

Signed-off-by: jiang.li <[email protected]>

Revert "avoid import marlin globally"

This reverts commit d0ebbd265a443d90b99c2342abd88faf42aa9481.

Signed-off-by: jiang.li <[email protected]>

fix ipex quant

Signed-off-by: jiang.li <[email protected]>

list packages

Signed-off-by: jiang.li <[email protected]>

refine test deps

Signed-off-by: jiang1.li <[email protected]>

update compile config

Signed-off-by: jiang1.li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build multi-modality Related to multi-modality (#4194) v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant