-
-
Notifications
You must be signed in to change notification settings - Fork 8.3k
LoRA support on llama4 #19819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
LoRA support on llama4 #19819
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @frank-wei, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces support for LoRA (Low-Rank Adaptation) to the llama4
model within the vllm
framework. The primary goal is to enable the use of LoRA adapters with llama4
for fine-tuning and efficient deployment, specifically for the Llama4ForConditionalGeneration
class, by making it compatible with LoRA-specific operations and module mapping.
Highlights
- LoRA Integration: The
Llama4ForConditionalGeneration
class now implements theSupportsLoRA
interface, enabling LoRA (Low-Rank Adaptation) functionality for thellama4
model. - Multimodal Module Mapping: A new
get_mm_mapping
method has been added toLlama4ForConditionalGeneration
. This method provides module prefixes (language_model
,connector
,tower_model
) essential for correctly applying LoRA adapters in a multimodal context. - Dependency Updates: Necessary imports for
MultiModelKeys
andSupportsLoRA
have been added tovllm/model_executor/models/mllama4.py
to support the new LoRA integration.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces LoRA support for the mllama4
model by having Llama4ForConditionalGeneration
implement the SupportsLoRA
interface and adding a helper method get_mm_mapping
likely to aid in applying LoRA to multimodal components.
Key areas for the author to consider:
- Testing: The PR description mentions initial verification but highlights the need for more extensive testing, especially with actual LoRA adapters and on larger models. It would be helpful to detail the internal verification steps and outline a clearer plan for further testing, even if community help is needed for specific adapters.
- Scope of LoRA: Please clarify if LoRA is intended to be applicable to embedding layers (like token embeddings or
lm_head
) formllama4
, or if it's primarily for attention and MLP layers. The current changes rely on default empty configurations for LoRA on embeddings. - Packed Module Mapping: The
packed_modules_mapping
inLlama4ForConditionalGeneration
seems specific. It would be good to confirm if this mapping is intentionally different from potentially more comprehensive mappings in other Llama-style models, particularly regarding MLP layers.
Overall, the changes seem to lay the necessary groundwork for LoRA support. Addressing the points above will help ensure robustness and clarity.
class Llama4ForConditionalGeneration(nn.Module, SupportsMultiModal, SupportsPP, | ||
SupportsLoRA): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class Llama4ForConditionalGeneration
now correctly inherits SupportsLoRA
. This is a key step for enabling LoRA support. Two points to consider:
-
Embedding LoRA: The
SupportsLoRA
interface provides default emptyembedding_modules
andembedding_padding_modules
. If this model requires LoRA to be applied to specific embedding layers (e.g., word token embeddingswte
, or thelm_head
), these class variables should be overridden inLlama4ForConditionalGeneration
. For example:class Llama4ForConditionalGeneration(nn.Module, SupportsMultiModal, SupportsPP, SupportsLoRA): # ... existing packed_modules_mapping ... embedding_modules = { "embed_tokens": "language_model.model.embed_tokens", # Example path } # embedding_padding_modules = [...] # If needed for specific embeddings # ... rest of the class ...
If LoRA is not intended for embedding layers in
mllama4
, then using the defaults fromSupportsLoRA
is acceptable. Please clarify the intended scope. -
Packed Modules for LoRA: The
packed_modules_mapping
defined on lines 718-720 is{"qkv_proj": ["q_proj", "k_proj", "v_proj"]}
. For context, theLlama4ForCausalLM
class (defined invllm/model_executor/models/llama4.py
, not thismllama4.py
file, but potentially related or a source of inspiration) often has a more comprehensive mapping, e.g.,{"qkv_proj": [...], "gate_up_proj": ["gate_proj", "up_proj"]}
.
Is it intentional thatLlama4ForConditionalGeneration
has this specific mapping, potentially excluding MLP layers likegate_proj
andup_proj
from LoRA application? IfLlama4ForConditionalGeneration
shares a similar transformer block structure with other Llama models where MLP LoRA is common, consider aligning these mappings or clarifying the rationale for the difference.
def get_mm_mapping(self) -> MultiModelKeys: | ||
""" | ||
Get the module prefix in multimodal models | ||
""" | ||
return MultiModelKeys.from_string_field( | ||
language_model="language_model", | ||
connector="multi_modal_projector.", | ||
tower_model="vision_model.", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new method get_mm_mapping
provides module prefixes for different components of a multimodal model. This is likely used by the LoRA mechanism (or other systems) to correctly apply or map configurations (like LoRA weights) in a multimodal context.
- The hardcoded prefixes (
"language_model"
,"multi_modal_projector."
,"vision_model."
) are specific to themllama4
architecture. This is generally fine for a model-specific implementation. - Could you confirm if these prefixes are always static for all
Mllama4
variants? If there's a possibility of them changing (e.g., based onhf_config
), deriving them dynamically might offer more flexibility. However, if they are fixed architectural constants formllama4
, the current approach is fine. - Consider adding a brief comment within the method or in the docstring explaining the origin or significance of these specific prefix strings if it's not immediately obvious from the model's standard architecture documentation. For example, clarifying that
language_model
refers to the text processing backbone,vision_model
to the image encoder, andmulti_modal_projector
to the connector module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. We just need some adapter to verify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this adapter can be used for testing? https://huggingface.co/ImranzamanML/llama4-medqa/blob/main/adapter_config.json
Or we can create a dummy adapter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that our current MoE layer doesn't support LoRA yet, llama4 may not be able to fully support LoRA
@jeejeelee , we probably need to ask people do not add the adapter for MOE modules. |
Tried the adapter from @22quinn , that adapter has trained with module name not supported. The shared expert module has linear operator with name gate_up_proj and down_proj but the adapter has sth called gate_proj and up_proj which are not aligned. Log for shared expert: Error: At this stage, I think the functionality is good but need adapter for further checking. I verified the small model internally but the big one OSS. I can follow up if any user has their adapters. |
Signed-off-by: Wei Wei <[email protected]>
Signed-off-by: Wei Wei <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Current llama4 does not support LoRA. Add this feature support.
Test Plan
Verified on internal llama4 model. Need further verification on large models but I do not have the lora adapter. Will need user who has this request to help
Test Result
(Optional) Documentation Update