-
-
Notifications
You must be signed in to change notification settings - Fork 8.3k
[V1] [Spec decode] Llama4 type eagle support in v1 #18369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ronald Xu <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Ready for review now |
@RonaldBXu Looks good to me overall. Could you please add a test? Also, is there any available EAGLE head we can test this on? |
I found this from nvidia: https://huggingface.co/nvidia/Llama-4-Maverick-17B-128E-Eagle3, but it seems they are using eagle3 architecture |
Hi @WoosukKwon when you say add a test do you mean an e2e test like in https://github.com/vllm-project/vllm/blob/main/tests/spec_decode/e2e/test_eagle_correctness.py or https://github.com/vllm-project/vllm/blob/main/tests/models/registry.py#L407? I think I'd have to open-source a compatible eagle head first, right? Could you point me to other tests I could work on while I wait for approval for a compatible eagle head? Thanks! |
Signed-off-by: Ronald Xu <[email protected]>
Signed-off-by: Ronald Xu <[email protected]>
Signed-off-by: Ronald Xu <[email protected]>
Hi @RonaldBXu, the PR looks good to me overall, but we'd like to have a test or at least a way to run the code. Please refer to https://github.com/vllm-project/vllm/blob/main/tests/v1/spec_decode/test_eagle.py and vllm/tests/v1/e2e/test_spec_decode.py Line 109 in ee1531b
Yes. We need an eagle head for Llama 4. Could we use https://huggingface.co/nvidia/Llama-4-Maverick-17B-128E-Eagle3 (@aarnphm mentioned)? |
Thanks, I'll look at those tests. I don't think we can use that head since it is EAGLE3, but the good news is I got approval to release a compatible eagle head for my code. I should hopefully have it ready sometime next week! |
Signed-off-by: Ronald Xu <[email protected]>
Hi @WoosukKwon , I added the tests. Just wanted to call out that for Llama4 Maverick, tp=1 was not sufficient (cuda out of memory error) so I made my test initialize the LLM on tp=8. Although I guess I could change it to Llama4 scout.. Please let me know what you think would be the best option here. Thanks! |
Signed-off-by: Ronald Xu <[email protected]>
Signed-off-by: Ronald Xu <[email protected]>
Signed-off-by: Ronald Xu <[email protected]>
Let's make a separate script for testing llama 4 head. I don't think we want to run llama 4 on CI right now. and can you add an entry to for instructing how to run this tests locally on users that have tp=8 setup? I can test this as well (have access to a 8xH100 box atm). |
Sure, I can put the tests in a separate file and add some instructions. Is there something I should edit to make the CI skip my new specific file for llama4? In here? https://github.com/vllm-project/vllm/blob/main/.buildkite/test-pipeline.yaml edit: no, I don't need to edit the file since the test manually runs each file. So if I make a new file it won't be run in the CI. |
yeah just include the test file there, and note in the file the instructions to run it with pytest should be good enough. |
This PR adds the capability for llama4-type eagle heads to be used for speculative decoding in vLLM v1. This is my first major PR in vLLM, so feedback is appreciated : )