Skip to content

Conversation

@tbqh
Copy link
Collaborator

@tbqh tbqh commented Oct 14, 2025

Copied the llama4 model code from transformers so that we can tweak it locally, instead of type-checking + rewrites.

Needed to copy the entire Llama4ForCausalLM class, and a few other things, to keep a bunch of optimizations that transformers was running.

I prepended all the new class+function names with "copied_". This is needed since we import the real transformers Llama4TextMoe class for type checking. Alternatively, we could leave the function names alone, and do something like this for type-checking:
from transformers.models.llama4.modeling_llama4 import Llama4TextMoe as Llama4TextMoe_type

@tbqh tbqh requested a review from jjsjann123 October 14, 2025 19:59
@tbqh tbqh changed the title Copy llama4 model code from transformers [Inference Benchmark] Copy llama4 model code from transformers Oct 14, 2025
@github-actions
Copy link

Description

  • Copied Llama4 model implementation for local benchmarking

  • Added prefixed versions of model components with 'copied_'

  • Integrated copied model into inference benchmark pipeline

  • Included attention, MoE, MLP, and embedding layers


Changes walkthrough 📝

Relevant files
Enhancement
benchmark_inference.py
Use copied Llama4 model in benchmark                                         

benchmarks/python/benchmark_inference.py

  • Removed AutoModelForCausalLM import
  • Added import for copied_Llama4ForCausalLM
  • Updated model loading to use copied implementation
  • +2/-2     
    layers_for_inference_benchmark.py
    Add copied Llama4 model architecture                                         

    benchmarks/python/layers_for_inference_benchmark.py

  • Added multiple copied Llama4 components with 'copied_' prefix
  • Implemented attention, MoE, MLP, and normalization layers
  • Included rotary embeddings and decoder layer
  • Added full causal language model class
  • +952/-3 

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    ⚡ Recommended focus areas for review

    Missing Import

    The code uses Cache and DynamicCache in the copied_Llama4TextModel class but does not import them from transformers or define them locally. This will result in a NameError when attempting to use these types.

    past_key_value: Optional[Cache] = None,
    Undefined Function

    The method _update_causal_mask calls make_flex_block_causal_mask, but this function is not imported or defined in the file. This will lead to a runtime error when the function is invoked.

    chunked_attention_mask = make_flex_block_causal_mask(
    Undefined Logger

    The code references logger.warning_once in multiple places, but there is no import or definition of logger in the file. This will cause a NameError during execution.

    logger.warning_once(

    Copy link
    Collaborator

    @jjsjann123 jjsjann123 left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    looks like mostly mechanical changes.

    How are we validating the code change in this PR didn't lead to model change? I suspect the renamed model layers would lead to different behavior in code inside benchmark_inference.py not working as intended.

    NVFP4InferenceLinear,
    nvfuser_f16a_nvfp4weight_scaled_grouped_mm,
    nvfuser_f16a_nvfp4weight_scaled_mm,
    copied_Llama4ForCausalLM,
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    nitpick, we could use a better name on this 😜

    )
    from transformers.models.llama4 import Llama4TextConfig
    from transformers.models.llama4.modeling_llama4 import (
    Llama4TextMoe,
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    On the high level, one of the intention for this benchmark is to replace Llama4TextMoe with the custom module Llama4MoE defined in this file.

    We don't have to fix it right now, but we can add a TODO that we'll swap it out later. I think that would allow us to simplify the implementation and avoid some copied_ prefix.



    # Ref: https://github.com/huggingface/transformers/blob/ff8b88a9/src/transformers/models/llama4/modeling_llama4.py#L147-L165
    class copied_Llama4TextMoe(nn.Module):
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    this I believe is the MoE that we want to swap.

    Here the new model has copied_Llama4TextMoe. If you look at benchmark_inference.py, we have a function that does the swapping _replace_llama4_moe, that wouldn't work with the updated model, since it only mechanically swap Llama4TextMoe

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants