Skip to content

[V1] [Spec decode] Llama4 type eagle support in v1 #18369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

RonaldBXu
Copy link
Contributor

@RonaldBXu RonaldBXu commented May 19, 2025

This PR adds the capability for llama4-type eagle heads to be used for speculative decoding in vLLM v1. This is my first major PR in vLLM, so feedback is appreciated : )

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@sarckk
Copy link
Collaborator

sarckk commented May 19, 2025

cc @zixi-qi @morgendave

@RonaldBXu RonaldBXu closed this May 19, 2025
@RonaldBXu RonaldBXu reopened this May 19, 2025
@RonaldBXu RonaldBXu closed this May 19, 2025
@RonaldBXu RonaldBXu reopened this May 21, 2025
@RonaldBXu
Copy link
Contributor Author

Ready for review now

@WoosukKwon
Copy link
Collaborator

@RonaldBXu Looks good to me overall. Could you please add a test? Also, is there any available EAGLE head we can test this on?

@aarnphm
Copy link
Collaborator

aarnphm commented Jun 5, 2025

Could you please add a test? Also, is there any available EAGLE head we can test this on?

I found this from nvidia: https://huggingface.co/nvidia/Llama-4-Maverick-17B-128E-Eagle3, but it seems they are using eagle3 architecture

@mergify mergify bot added the llama Related to Llama models label Jun 9, 2025
@RonaldBXu
Copy link
Contributor Author

Hi @WoosukKwon when you say add a test do you mean an e2e test like in https://github.com/vllm-project/vllm/blob/main/tests/spec_decode/e2e/test_eagle_correctness.py or https://github.com/vllm-project/vllm/blob/main/tests/models/registry.py#L407? I think I'd have to open-source a compatible eagle head first, right?

Could you point me to other tests I could work on while I wait for approval for a compatible eagle head? Thanks!

@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 15, 2025
@WoosukKwon
Copy link
Collaborator

Hi @RonaldBXu, the PR looks good to me overall, but we'd like to have a test or at least a way to run the code.

Please refer to https://github.com/vllm-project/vllm/blob/main/tests/v1/spec_decode/test_eagle.py and

def test_eagle_correctness(

I think I'd have to open-source a compatible eagle head first, right?

Yes. We need an eagle head for Llama 4. Could we use https://huggingface.co/nvidia/Llama-4-Maverick-17B-128E-Eagle3 (@aarnphm mentioned)?

@RonaldBXu
Copy link
Contributor Author

Thanks, I'll look at those tests. I don't think we can use that head since it is EAGLE3, but the good news is I got approval to release a compatible eagle head for my code. I should hopefully have it ready sometime next week!

@RonaldBXu
Copy link
Contributor Author

Hi @WoosukKwon , I added the tests. Just wanted to call out that for Llama4 Maverick, tp=1 was not sufficient (cuda out of memory error) so I made my test initialize the LLM on tp=8. Although I guess I could change it to Llama4 scout.. Please let me know what you think would be the best option here. Thanks!

Signed-off-by: Ronald Xu <[email protected]>
Signed-off-by: Ronald Xu <[email protected]>
Signed-off-by: Ronald Xu <[email protected]>
@aarnphm
Copy link
Collaborator

aarnphm commented Jun 23, 2025

Let's make a separate script for testing llama 4 head. I don't think we want to run llama 4 on CI right now.

and can you add an entry to for instructing how to run this tests locally on users that have tp=8 setup? I can test this as well (have access to a 8xH100 box atm).

@RonaldBXu
Copy link
Contributor Author

RonaldBXu commented Jun 23, 2025

Sure, I can put the tests in a separate file and add some instructions. Is there something I should edit to make the CI skip my new specific file for llama4? In here? https://github.com/vllm-project/vllm/blob/main/.buildkite/test-pipeline.yaml

edit: no, I don't need to edit the file since the test manually runs each file. So if I make a new file it won't be run in the CI.

@aarnphm
Copy link
Collaborator

aarnphm commented Jun 23, 2025

yeah just include the test file there, and note in the file the instructions to run it with pytest should be good enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llama Related to Llama models ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants