-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
[CPU] Fix torch version in x86 CPU backend and refine default configurations #19258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @bigPYJ1151, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini or gemini-code-assist here, providing a summary of this pull request. This PR primarily focuses on improving the x86 CPU backend by addressing a performance regression found in PyTorch 2.7.0, pinning the version to 2.6.0. It also aims to accelerate the CPU CI pipeline and refines some default configurations specific to the CPU backend. Additionally, it updates how quantization configurations are handled in some VLM models to avoid unnecessary imports.
Highlights
- PyTorch Version Pinning: The PyTorch version for the x86 CPU backend is explicitly pinned to 2.6.0 in the requirement files (
requirements/cpu.txt
,requirements/cpu-build.txt
,requirements/cpu-test.in
). This is done to avoid a performance regression in random number generation observed in PyTorch 2.7.0 on x86 platforms, which significantly impacts vLLM's random sampling performance. - CPU CI Acceleration: New requirement files (
cpu-build.txt
,cpu-test.in
) are introduced specifically for CPU builds and tests. The CI script (run-cpu-test.sh
) is modified to set aVLLM_CPU_CI_ENV
variable. This variable is then used invllm/platforms/cpu.py
to potentially switch the V1 compilation backend to 'eager' mode during CI runs, aiming to reduce compilation time and accelerate tests. - Refined CPU Configurations: Default values for
max_num_batched_tokens
andmax_num_seqs
are refined specifically for the CPU backend invllm/engine/arg_utils.py
. The logic for setting defaultmax_num_seqs
is updated to be context-aware. The forcedenforce_eager=True
setting for CPU invllm/platforms/cpu.py
is also removed. - Quantization Config Handling: In several VLM model files (
minicpmo.py
,ovis.py
,qwen2_5_vl.py
,qwen2_vl.py
), the check for GPTQ/Marlin quantization configurations is changed from usingisinstance()
to checking the quantization config's name viaget_name()
. This avoids needing to import the specific GPTQ/Marlin config classes in these files.
Changelog
Click here to see the changelog
- .buildkite/scripts/hardware_ci/run-cpu-test.sh
- Added
VLLM_CPU_CI_ENV=1
environment variable to docker run commands for CI (lines 27, 28). - Added
VLLM_CPU_CI_ENV=0
before thebenchmark_serving.py
command (line 72).
- Added
- docker/Dockerfile.cpu
- Updated Dockerfile to use
requirements/cpu-build.txt
for build dependencies (line 69). - Updated Dockerfile to use
requirements/cpu-test.in
for test dependencies (line 112).
- Updated Dockerfile to use
- requirements/cpu-build.txt
- Added new file specifying build dependencies for the CPU backend.
- Pinned
torch
to version2.6.0+cpu
for x86_64 (line 9). - Pinned
intel_extension_for_pytorch
to version2.6.0
for x86_64 (implied by build requirements).
- requirements/cpu-test.in
- Added new file specifying test dependencies for the CPU backend.
- Pinned
torch
to version2.6.0
(line 27). - Pinned
transformers
to version4.52.4
(line 39) andtokenizers
to version0.21.1
(line 40).
- requirements/cpu.txt
- Pinned
torch
to version2.6.0+cpu
for x86_64 (line 11). - Pinned
intel_extension_for_pytorch
to version2.6.0
for x86_64 (line 29).
- Pinned
- tests/models/multimodal/generation/test_common.py
- Added explicit
dtype="bfloat16"
for theminicpmo
test configuration when running on CPU (line 110).
- Added explicit
- vllm/engine/arg_utils.py
- Changed
default_max_num_seqs
to be a dictionary based onUsageContext
(lines 1551, 1558). - Added CPU-specific default values for
max_num_batched_tokens
andmax_num_seqs
(lines 1582-1590). - Updated logic to use the context-aware default for
max_num_seqs
(lines 1612-1614).
- Changed
- vllm/model_executor/models/minicpmo.py
- Removed imports for
GPTQConfig
andGPTQMarlinConfig
(lines 38-40). - Changed quantization config check from
isinstance()
toquant_config.get_name() in (...)
(lines 520-521).
- Removed imports for
- vllm/model_executor/models/ovis.py
- Removed imports for
GPTQConfig
andGPTQMarlinConfig
(lines 34-36). - Changed quantization config check from
isinstance()
toquant_config.get_name() in (...)
(lines 434-435).
- Removed imports for
- vllm/model_executor/models/qwen2_5_vl.py
- Removed imports for
GPTQConfig
andGPTQMarlinConfig
(lines 52-54). - Changed quantization config check from
isinstance()
toquant_config.get_name() in (...)
(lines 862-863).
- Removed imports for
- vllm/model_executor/models/qwen2_vl.py
- Removed imports for
GPTQConfig
andGPTQMarlinConfig
(lines 51-53). - Changed quantization config check from
isinstance()
toquant_config.get_name() in (...)
(lines 1111-1112).
- Removed imports for
- vllm/platforms/cpu.py
- Removed the logic that forced
model_config.enforce_eager = True
for CPU (lines 94-95). - Adjusted V1 compilation backend selection to use 'eager' in CI if
VLLM_CPU_CI_ENV
is set, otherwise use 'inductor' (lines 178-181).
- Removed the logic that forced
- vllm/v1/worker/cpu_model_runner.py
- Removed unused import
contextmanager
(line 2). - Removed the
_set_global_compilation_settings
context manager function (lines 73-86).
- Removed unused import
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses the torch performance regression on the x86 CPU backend by pinning to version 2.6.0. The introduction of VLLM_CPU_CI_ENV
for CI-specific configurations and the new CPU-specific requirement files are positive changes. The refactoring in VLM model files to check quantization method by name instead of type improves modularity.
There are a couple of areas related to the V1 CPU backend compilation configuration in vllm/platforms/cpu.py
and vllm/v1/worker/cpu_model_runner.py
where further clarification on the intent and impact of the changes would be helpful. Please see the detailed comments on those files.
The PR description's "Test Result" section is currently marked as TODO; please consider updating this.
Summary of Findings
- V1 CPU Backend Compilation Configuration: Changes to the V1 CPU backend compilation settings, specifically the removal of global inductor configurations for
freezing
andforce_disable_caches
incpu_model_runner.py
, and the removal ofcustom_ops += ["none"]
inplatforms/cpu.py
, warrant clarification to ensure no unintended performance or behavior regressions in non-CI CPU V1 scenarios. - Torch Version Pinning: The core fix of pinning
torch
to2.6.0+cpu
andintel_extension_for_pytorch
to2.6.0
for the x86 CPU backend is correctly implemented and addresses the reported performance regression. - CI Optimization: The use of
VLLM_CPU_CI_ENV
to switch to an "eager" backend for CI tests is a good strategy to accelerate CI execution. - Quantization Check Refinement: Refactoring VLM models to check quantization method by name (
quant_config.get_name()
) instead ofisinstance
checks against specific GPTQ/Marlin config classes is a good improvement for modularity and reduces direct dependencies.
Merge Readiness
The pull request makes significant improvements and fixes a key performance issue. However, before merging, it would be beneficial to get clarification on the V1 CPU backend compilation changes mentioned in the review comments to ensure they align with the intended behavior and performance expectations. I am unable to approve this pull request myself; please ensure it is reviewed and approved by other maintainers after addressing the feedback. I recommend addressing the medium severity comments before merging.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
23d2c69
to
c0f9736
Compare
919332b
to
3ba343e
Compare
This pull request has merge conflicts that must be resolved before it can be |
0dc8e7c
to
4d35c56
Compare
018a6e7
to
17b3485
Compare
Signed-off-by: jiang.li <[email protected]> avoid import marlin globally Signed-off-by: jiang.li <[email protected]> llava test uses bf16 Signed-off-by: jiang.li <[email protected]> refine compile config Signed-off-by: jiang.li <[email protected]> opt cpu default batchsize Signed-off-by: jiang.li <[email protected]> format Signed-off-by: jiang.li <[email protected]> fix llava embedding Signed-off-by: jiang.li <[email protected]> format Signed-off-by: jiang.li <[email protected]> fix import Signed-off-by: jiang.li <[email protected]> Revert "avoid import marlin globally" This reverts commit d0ebbd265a443d90b99c2342abd88faf42aa9481. Signed-off-by: jiang.li <[email protected]> fix ipex quant Signed-off-by: jiang.li <[email protected]> list packages Signed-off-by: jiang.li <[email protected]> refine test deps Signed-off-by: jiang1.li <[email protected]> update compile config Signed-off-by: jiang1.li <[email protected]>
a097b7d
to
c54495a
Compare
Essential Elements of an Effective PR Description Checklist
Purpose
Test Plan
offline inference
Test Result
VLLM_CPU_KVCACHE_SPACE=40 \ VLLM_CPU_OMP_THREADS_BIND="128-158|160-190|192-222|224-254" \ python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json -tp=4 --model=meta-llama/Meta-Llama-3-8B-Instruct --num-prompts=1000 --dtype=bfloat16 --trust-remote-code Throughput: 1.32 requests/s, 544.45 total tokens/s, 261.13 output tokens/s
VLLM_CPU_KVCACHE_SPACE=40 \ VLLM_CPU_OMP_THREADS_BIND="128-158|160-190|192-222|224-254" \ python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json -tp=4 --model=meta-llama/Meta-Llama-3-8B-Instruct --num-prompts=1000 --dtype=bfloat16 --trust-remote-code Throughput: 2.97 requests/s, 1228.82 total tokens/s, 589.37 output tokens/s
This PR: