[Tests] V1 EAGLE Tests for Acceptance Rate #19104

benchislett · 2025-06-03T20:56:59Z

First draft at some additional EAGLE tests.

Depends on my other PR (#19033) because I haven't taken the time to separate the branches, so marked as a draft for now.

Signed-off-by: Benjamin Chislett <[email protected]>

…edding-init Signed-off-by: Benjamin Chislett <[email protected]>

Signed-off-by: Benjamin Chislett <[email protected]>

…edding-init

Signed-off-by: Benjamin Chislett <[email protected]>

github-actions · 2025-06-03T20:57:06Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Hello @benchislett, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini or gemini-code-assist here, providing a summary of this pull request.

This PR, authored by @benchislett, introduces new tests for the V1 EAGLE speculative decoding feature, specifically focusing on measuring and asserting the acceptance rate of drafted tokens. It includes a helper function to calculate various acceptance rate metrics, new prompt fixtures tailored for testing acceptance rates, and adds new test functions to evaluate both the ngram and EAGLE/EAGLE3 methods based on these metrics. Additionally, it refines the model loading logic for EAGLE models to handle embed_tokens sharing more robustly, particularly in the context of pipeline parallelism, and includes minor adjustments to existing correctness test assertions and memory profiling utilities.

Highlights

New Acceptance Rate Tests: Adds new end-to-end tests (test_ngram_acceptance_rate, test_eagle_acceptance_rate) in tests/v1/e2e/test_spec_decode.py to measure and assert the acceptance rate of speculative decoding for both ngram and EAGLE/EAGLE3 methods.
Acceptance Metrics Helper: Introduces a new helper function get_spec_acceptance_metrics in tests/v1/e2e/test_spec_decode.py to parse speculative decoding metrics and calculate key acceptance statistics like number of drafts, accepted tokens, acceptance rate per position, and mean acceptance length.
New Test Prompts: Adds new pytest fixtures (test_ngram_acceptance_rate_prompts, test_draft_acceptance_rate_prompts) in tests/v1/e2e/test_spec_decode.py to provide specific prompts suitable for testing speculative decoding acceptance rates.
Refined EAGLE Model Loading: Updates the EagleProposer model loading logic in vllm/v1/spec_decode/eagle.py and the EAGLE/EAGLE3 model definitions (llama_eagle.py, llama_eagle3.py) to correctly handle sharing of embed_tokens with the target model only when pipeline parallelism is not used and the embedding shapes match. This ensures correctness when PP > 1 or when draft/target models have different embedding sizes.

Changelog

Click here to see the changelog

benchmarks/kernels/bench_fp8_gemm.py
- Changes the import of triton from the global namespace to vllm.triton_utils (line 11).
tests/v1/e2e/test_spec_decode.py
- Adds import for numpy (line 7).
- Adds import for Metric from vllm.v1.metrics.reader (line 11).
- Adds get_spec_acceptance_metrics function to calculate acceptance rate metrics (lines 14-35).
- Adds test_ngram_acceptance_rate_prompts fixture (lines 72-84).
- Adds test_draft_acceptance_rate_prompts fixture (lines 88-96).
- Adjusts the correctness assertion heuristic in test_ngram_correctness from 70% to 65% and changes the denominator calculation (line 157).
- Adjusts the correctness assertion heuristic in test_eagle_correctness from 66% to 65% (line 206).
- Adds test_ngram_acceptance_rate function to test ngram acceptance rate (lines 210-247).
- Adds test_eagle_acceptance_rate function to test EAGLE/EAGLE3 acceptance rate, parameterized by use_eagle3 (lines 250-294).
tests/v1/spec_decode/test_eagle.py
- Adds import for LlamaForCausalLM (line 11).
- Parametrizes test_load_model with pp_size and use_distinct_embed_tokens (lines 120-121).
- Updates mock setup in test_load_model to use create_autospec for the target model and handle different embed_tokens shapes for the draft model mock (lines 128-134).
- Updates the mock setup for pp_group.world_size to use the parameterized pp_size (line 153).
- Adds assertions in test_load_model to verify embed_tokens sharing logic based on pp_size and use_distinct_embed_tokens (lines 185-194).
vllm/model_executor/models/llama_eagle.py
- Removes the conditional creation of embed_tokens based on get_pp_group().world_size (lines 57-63). VocabParallelEmbedding is now always created.
- Removes the return statement from load_weights (line 164).
vllm/model_executor/models/llama_eagle3.py
- Removes import for get_pp_group (line 12).
- Removes the conditional creation of embed_tokens based on get_pp_group().world_size (lines 97-103). VocabParallelEmbedding is now always created.
- Adds a flag includes_embed_tokens in load_weights to track if embed tokens are present in the loaded weights (line 239).
- Adds logic in load_weights to skip loading embed_tokens if they are not included in the provided weights (lines 255-256).
vllm/platforms/cuda.py
- Adds torch.cuda.empty_cache() before torch.cuda.reset_peak_memory_stats in get_current_memory_usage (line 160).
vllm/utils.py
- Adds gc.collect() before calling current_platform.get_current_memory_usage in MemoryProfiler.current_memory_usage (line 901).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Tests for speed and rate,
Speculative tokens await.
Will they be accepted?
Metrics are collected,
Efficiency is the fate.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces valuable end-to-end tests for the EAGLE speculative decoding acceptance rate and includes necessary fixes related to embed_tokens sharing in EAGLE models under pipeline parallelism. The new tests are well-structured and provide good coverage for the acceptance rate aspect. The changes to the EAGLE model loading logic seem correct and address potential issues with different hidden sizes and parallel configurations.

I've reviewed the changes focusing on correctness, efficiency, and maintainability, adhering to common Python practices (like PEP 8) as no specific style guide was provided. Overall, the code is clean and readable. I've identified one high-severity issue that needs to be addressed before merging, along with some minor points noted in the summary.

Summary of Findings

load_weights return value: The load_weights method in vllm/model_executor/models/llama_eagle.py no longer returns the set of loaded parameters, which might break callers expecting this return value. (High severity)
Magic numbers in tests: The specific acceptance rate thresholds (e.g., 0.90, 0.75, 0.4) used in the new acceptance rate tests are hardcoded. While common in tests, adding comments explaining the rationale behind these specific values would improve clarity for future maintainers. (Low severity)
Memory profiling accuracy: Added torch.cuda.empty_cache() and gc.collect() calls in memory profiling related functions (vllm/platforms/cuda.py and vllm/utils.py) to improve the accuracy of memory usage measurements. (Low severity)
Triton import: Changed the import of triton in benchmarks/kernels/bench_fp8_gemm.py to use the internal vllm.triton_utils module. (Low severity)

Merge Readiness

This pull request introduces important tests and fixes. However, the high-severity issue regarding the load_weights return value in vllm/model_executor/models/llama_eagle.py needs to be addressed before this can be merged. Once that is resolved, the PR should be in good shape. I am unable to approve this pull request; please have other reviewers review and approve this code before merging.

vllm/model_executor/models/llama_eagle.py

mergify · 2025-06-13T03:10:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

benchislett added 9 commits May 29, 2025 15:20

fix eagle logits bug

ecee9fe

Signed-off-by: Benjamin Chislett <[email protected]>

refactor to use autoweightsloader

ef6fae6

Signed-off-by: Benjamin Chislett <[email protected]>

load eagle vocab embedding more carefully

c0083f0

Signed-off-by: Benjamin Chislett <[email protected]>

Merge remote-tracking branch 'upstream/main' into eagle-fix-vocab-emb…

758be48

…edding-init Signed-off-by: Benjamin Chislett <[email protected]>

remove broken allclose check

44a6871

Signed-off-by: Benjamin Chislett <[email protected]>

fix EAGLE1 loading also

6a10c52

Signed-off-by: Benjamin Chislett <[email protected]>

Merge remote-tracking branch 'upstream/main' into eagle-fix-vocab-emb…

f365cd1

…edding-init

update tests

b197594

Signed-off-by: Benjamin Chislett <[email protected]>

enhance v1 e2e spec tests with per-position acceptance rate checks

0975cb0

Signed-off-by: Benjamin Chislett <[email protected]>

gemini-code-assist bot reviewed Jun 3, 2025

View reviewed changes

mergify bot added the v1 label Jun 3, 2025

gemini-code-assist bot suggested changes Jun 3, 2025

View reviewed changes

vllm/model_executor/models/llama_eagle.py Show resolved Hide resolved

benchislett mentioned this pull request Jun 4, 2025

[Bugfix] Fix EAGLE vocab embedding construction for Llama 70B #19033

Merged

mergify bot added the llama Related to Llama models label Jun 9, 2025

mergify bot added the needs-rebase label Jun 13, 2025

mergify bot added the performance Performance-related issues label Jun 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Tests] V1 EAGLE Tests for Acceptance Rate #19104

[Tests] V1 EAGLE Tests for Acceptance Rate #19104

Uh oh!

benchislett commented Jun 3, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mergify bot commented Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

[Tests] V1 EAGLE Tests for Acceptance Rate #19104

Are you sure you want to change the base?

[Tests] V1 EAGLE Tests for Acceptance Rate #19104

Uh oh!

Conversation

benchislett commented Jun 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

mergify bot commented Jun 13, 2025

Uh oh!

Uh oh!

benchislett commented Jun 3, 2025 •

edited by github-actions bot

Loading