-
-
Notifications
You must be signed in to change notification settings - Fork 8.3k
[Tests] V1 EAGLE Tests for Acceptance Rate #19104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]>
…edding-init Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @benchislett, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini or gemini-code-assist here, providing a summary of this pull request.
This PR, authored by @benchislett, introduces new tests for the V1 EAGLE speculative decoding feature, specifically focusing on measuring and asserting the acceptance rate of drafted tokens. It includes a helper function to calculate various acceptance rate metrics, new prompt fixtures tailored for testing acceptance rates, and adds new test functions to evaluate both the ngram and EAGLE/EAGLE3 methods based on these metrics. Additionally, it refines the model loading logic for EAGLE models to handle embed_tokens
sharing more robustly, particularly in the context of pipeline parallelism, and includes minor adjustments to existing correctness test assertions and memory profiling utilities.
Highlights
- New Acceptance Rate Tests: Adds new end-to-end tests (
test_ngram_acceptance_rate
,test_eagle_acceptance_rate
) intests/v1/e2e/test_spec_decode.py
to measure and assert the acceptance rate of speculative decoding for both ngram and EAGLE/EAGLE3 methods. - Acceptance Metrics Helper: Introduces a new helper function
get_spec_acceptance_metrics
intests/v1/e2e/test_spec_decode.py
to parse speculative decoding metrics and calculate key acceptance statistics like number of drafts, accepted tokens, acceptance rate per position, and mean acceptance length. - New Test Prompts: Adds new pytest fixtures (
test_ngram_acceptance_rate_prompts
,test_draft_acceptance_rate_prompts
) intests/v1/e2e/test_spec_decode.py
to provide specific prompts suitable for testing speculative decoding acceptance rates. - Refined EAGLE Model Loading: Updates the
EagleProposer
model loading logic invllm/v1/spec_decode/eagle.py
and the EAGLE/EAGLE3 model definitions (llama_eagle.py
,llama_eagle3.py
) to correctly handle sharing ofembed_tokens
with the target model only when pipeline parallelism is not used and the embedding shapes match. This ensures correctness when PP > 1 or when draft/target models have different embedding sizes.
Changelog
Click here to see the changelog
- benchmarks/kernels/bench_fp8_gemm.py
- Changes the import of
triton
from the global namespace tovllm.triton_utils
(line 11).
- Changes the import of
- tests/v1/e2e/test_spec_decode.py
- Adds import for
numpy
(line 7). - Adds import for
Metric
fromvllm.v1.metrics.reader
(line 11). - Adds
get_spec_acceptance_metrics
function to calculate acceptance rate metrics (lines 14-35). - Adds
test_ngram_acceptance_rate_prompts
fixture (lines 72-84). - Adds
test_draft_acceptance_rate_prompts
fixture (lines 88-96). - Adjusts the correctness assertion heuristic in
test_ngram_correctness
from 70% to 65% and changes the denominator calculation (line 157). - Adjusts the correctness assertion heuristic in
test_eagle_correctness
from 66% to 65% (line 206). - Adds
test_ngram_acceptance_rate
function to test ngram acceptance rate (lines 210-247). - Adds
test_eagle_acceptance_rate
function to test EAGLE/EAGLE3 acceptance rate, parameterized byuse_eagle3
(lines 250-294).
- Adds import for
- tests/v1/spec_decode/test_eagle.py
- Adds import for
LlamaForCausalLM
(line 11). - Parametrizes
test_load_model
withpp_size
anduse_distinct_embed_tokens
(lines 120-121). - Updates mock setup in
test_load_model
to usecreate_autospec
for the target model and handle differentembed_tokens
shapes for the draft model mock (lines 128-134). - Updates the mock setup for
pp_group.world_size
to use the parameterizedpp_size
(line 153). - Adds assertions in
test_load_model
to verifyembed_tokens
sharing logic based onpp_size
anduse_distinct_embed_tokens
(lines 185-194).
- Adds import for
- vllm/model_executor/models/llama_eagle.py
- Removes the conditional creation of
embed_tokens
based onget_pp_group().world_size
(lines 57-63).VocabParallelEmbedding
is now always created. - Removes the
return
statement fromload_weights
(line 164).
- Removes the conditional creation of
- vllm/model_executor/models/llama_eagle3.py
- Removes import for
get_pp_group
(line 12). - Removes the conditional creation of
embed_tokens
based onget_pp_group().world_size
(lines 97-103).VocabParallelEmbedding
is now always created. - Adds a flag
includes_embed_tokens
inload_weights
to track if embed tokens are present in the loaded weights (line 239). - Adds logic in
load_weights
to skip loadingembed_tokens
if they are not included in the provided weights (lines 255-256).
- Removes import for
- vllm/platforms/cuda.py
- Adds
torch.cuda.empty_cache()
beforetorch.cuda.reset_peak_memory_stats
inget_current_memory_usage
(line 160).
- Adds
- vllm/utils.py
- Adds
gc.collect()
before callingcurrent_platform.get_current_memory_usage
inMemoryProfiler.current_memory_usage
(line 901).
- Adds
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Tests for speed and rate,
Speculative tokens await.
Will they be accepted?
Metrics are collected,
Efficiency is the fate.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces valuable end-to-end tests for the EAGLE speculative decoding acceptance rate and includes necessary fixes related to embed_tokens
sharing in EAGLE models under pipeline parallelism. The new tests are well-structured and provide good coverage for the acceptance rate aspect. The changes to the EAGLE model loading logic seem correct and address potential issues with different hidden sizes and parallel configurations.
I've reviewed the changes focusing on correctness, efficiency, and maintainability, adhering to common Python practices (like PEP 8) as no specific style guide was provided. Overall, the code is clean and readable. I've identified one high-severity issue that needs to be addressed before merging, along with some minor points noted in the summary.
Summary of Findings
load_weights
return value: Theload_weights
method invllm/model_executor/models/llama_eagle.py
no longer returns the set of loaded parameters, which might break callers expecting this return value. (High severity)- Magic numbers in tests: The specific acceptance rate thresholds (e.g., 0.90, 0.75, 0.4) used in the new acceptance rate tests are hardcoded. While common in tests, adding comments explaining the rationale behind these specific values would improve clarity for future maintainers. (Low severity)
- Memory profiling accuracy: Added
torch.cuda.empty_cache()
andgc.collect()
calls in memory profiling related functions (vllm/platforms/cuda.py
andvllm/utils.py
) to improve the accuracy of memory usage measurements. (Low severity) - Triton import: Changed the import of
triton
inbenchmarks/kernels/bench_fp8_gemm.py
to use the internalvllm.triton_utils
module. (Low severity)
Merge Readiness
This pull request introduces important tests and fixes. However, the high-severity issue regarding the load_weights
return value in vllm/model_executor/models/llama_eagle.py
needs to be addressed before this can be merged. Once that is resolved, the PR should be in good shape. I am unable to approve this pull request; please have other reviewers review and approve this code before merging.
This pull request has merge conflicts that must be resolved before it can be |
First draft at some additional EAGLE tests.
Depends on my other PR (#19033) because I haven't taken the time to separate the branches, so marked as a draft for now.