Skip to content

[Tests] V1 EAGLE Tests for Acceptance Rate #19104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

benchislett
Copy link
Collaborator

@benchislett benchislett commented Jun 3, 2025

First draft at some additional EAGLE tests.

Depends on my other PR (#19033) because I haven't taken the time to separate the branches, so marked as a draft for now.

Copy link

github-actions bot commented Jun 3, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @benchislett, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini or gemini-code-assist here, providing a summary of this pull request.

This PR, authored by @benchislett, introduces new tests for the V1 EAGLE speculative decoding feature, specifically focusing on measuring and asserting the acceptance rate of drafted tokens. It includes a helper function to calculate various acceptance rate metrics, new prompt fixtures tailored for testing acceptance rates, and adds new test functions to evaluate both the ngram and EAGLE/EAGLE3 methods based on these metrics. Additionally, it refines the model loading logic for EAGLE models to handle embed_tokens sharing more robustly, particularly in the context of pipeline parallelism, and includes minor adjustments to existing correctness test assertions and memory profiling utilities.

Highlights

  • New Acceptance Rate Tests: Adds new end-to-end tests (test_ngram_acceptance_rate, test_eagle_acceptance_rate) in tests/v1/e2e/test_spec_decode.py to measure and assert the acceptance rate of speculative decoding for both ngram and EAGLE/EAGLE3 methods.
  • Acceptance Metrics Helper: Introduces a new helper function get_spec_acceptance_metrics in tests/v1/e2e/test_spec_decode.py to parse speculative decoding metrics and calculate key acceptance statistics like number of drafts, accepted tokens, acceptance rate per position, and mean acceptance length.
  • New Test Prompts: Adds new pytest fixtures (test_ngram_acceptance_rate_prompts, test_draft_acceptance_rate_prompts) in tests/v1/e2e/test_spec_decode.py to provide specific prompts suitable for testing speculative decoding acceptance rates.
  • Refined EAGLE Model Loading: Updates the EagleProposer model loading logic in vllm/v1/spec_decode/eagle.py and the EAGLE/EAGLE3 model definitions (llama_eagle.py, llama_eagle3.py) to correctly handle sharing of embed_tokens with the target model only when pipeline parallelism is not used and the embedding shapes match. This ensures correctness when PP > 1 or when draft/target models have different embedding sizes.

Changelog

Click here to see the changelog
  • benchmarks/kernels/bench_fp8_gemm.py
    • Changes the import of triton from the global namespace to vllm.triton_utils (line 11).
  • tests/v1/e2e/test_spec_decode.py
    • Adds import for numpy (line 7).
    • Adds import for Metric from vllm.v1.metrics.reader (line 11).
    • Adds get_spec_acceptance_metrics function to calculate acceptance rate metrics (lines 14-35).
    • Adds test_ngram_acceptance_rate_prompts fixture (lines 72-84).
    • Adds test_draft_acceptance_rate_prompts fixture (lines 88-96).
    • Adjusts the correctness assertion heuristic in test_ngram_correctness from 70% to 65% and changes the denominator calculation (line 157).
    • Adjusts the correctness assertion heuristic in test_eagle_correctness from 66% to 65% (line 206).
    • Adds test_ngram_acceptance_rate function to test ngram acceptance rate (lines 210-247).
    • Adds test_eagle_acceptance_rate function to test EAGLE/EAGLE3 acceptance rate, parameterized by use_eagle3 (lines 250-294).
  • tests/v1/spec_decode/test_eagle.py
    • Adds import for LlamaForCausalLM (line 11).
    • Parametrizes test_load_model with pp_size and use_distinct_embed_tokens (lines 120-121).
    • Updates mock setup in test_load_model to use create_autospec for the target model and handle different embed_tokens shapes for the draft model mock (lines 128-134).
    • Updates the mock setup for pp_group.world_size to use the parameterized pp_size (line 153).
    • Adds assertions in test_load_model to verify embed_tokens sharing logic based on pp_size and use_distinct_embed_tokens (lines 185-194).
  • vllm/model_executor/models/llama_eagle.py
    • Removes the conditional creation of embed_tokens based on get_pp_group().world_size (lines 57-63). VocabParallelEmbedding is now always created.
    • Removes the return statement from load_weights (line 164).
  • vllm/model_executor/models/llama_eagle3.py
    • Removes import for get_pp_group (line 12).
    • Removes the conditional creation of embed_tokens based on get_pp_group().world_size (lines 97-103). VocabParallelEmbedding is now always created.
    • Adds a flag includes_embed_tokens in load_weights to track if embed tokens are present in the loaded weights (line 239).
    • Adds logic in load_weights to skip loading embed_tokens if they are not included in the provided weights (lines 255-256).
  • vllm/platforms/cuda.py
    • Adds torch.cuda.empty_cache() before torch.cuda.reset_peak_memory_stats in get_current_memory_usage (line 160).
  • vllm/utils.py
    • Adds gc.collect() before calling current_platform.get_current_memory_usage in MemoryProfiler.current_memory_usage (line 901).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Tests for speed and rate,
Speculative tokens await.
Will they be accepted?
Metrics are collected,
Efficiency is the fate.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the v1 label Jun 3, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable end-to-end tests for the EAGLE speculative decoding acceptance rate and includes necessary fixes related to embed_tokens sharing in EAGLE models under pipeline parallelism. The new tests are well-structured and provide good coverage for the acceptance rate aspect. The changes to the EAGLE model loading logic seem correct and address potential issues with different hidden sizes and parallel configurations.

I've reviewed the changes focusing on correctness, efficiency, and maintainability, adhering to common Python practices (like PEP 8) as no specific style guide was provided. Overall, the code is clean and readable. I've identified one high-severity issue that needs to be addressed before merging, along with some minor points noted in the summary.

Summary of Findings

  • load_weights return value: The load_weights method in vllm/model_executor/models/llama_eagle.py no longer returns the set of loaded parameters, which might break callers expecting this return value. (High severity)
  • Magic numbers in tests: The specific acceptance rate thresholds (e.g., 0.90, 0.75, 0.4) used in the new acceptance rate tests are hardcoded. While common in tests, adding comments explaining the rationale behind these specific values would improve clarity for future maintainers. (Low severity)
  • Memory profiling accuracy: Added torch.cuda.empty_cache() and gc.collect() calls in memory profiling related functions (vllm/platforms/cuda.py and vllm/utils.py) to improve the accuracy of memory usage measurements. (Low severity)
  • Triton import: Changed the import of triton in benchmarks/kernels/bench_fp8_gemm.py to use the internal vllm.triton_utils module. (Low severity)

Merge Readiness

This pull request introduces important tests and fixes. However, the high-severity issue regarding the load_weights return value in vllm/model_executor/models/llama_eagle.py needs to be addressed before this can be merged. Once that is resolved, the PR should be in good shape. I am unable to approve this pull request; please have other reviewers review and approve this code before merging.

@mergify mergify bot added the llama Related to Llama models label Jun 9, 2025
Copy link

mergify bot commented Jun 13, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 13, 2025
@mergify mergify bot added the performance Performance-related issues label Jun 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llama Related to Llama models needs-rebase performance Performance-related issues v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant