[Inference Benchmark] Copy llama4 model code from transformers #5388

tbqh · 2025-10-14T19:59:06Z

Copied the llama4 model code from transformers so that we can tweak it locally, instead of type-checking + rewrites.

Needed to copy the entire Llama4ForCausalLM class, and a few other things, to keep a bunch of optimizations that transformers was running.

I prepended all the new class+function names with "copied_". This is needed since we import the real transformers Llama4TextMoe class for type checking. Alternatively, we could leave the function names alone, and do something like this for type-checking:
from transformers.models.llama4.modeling_llama4 import Llama4TextMoe as Llama4TextMoe_type

github-actions · 2025-10-14T20:00:15Z

Description

Copied Llama4 model implementation for local benchmarking
Added prefixed versions of model components with 'copied_'
Integrated copied model into inference benchmark pipeline
Included attention, MoE, MLP, and embedding layers

Changes walkthrough 📝

Relevant files

Enhancement

benchmark_inference.py `Use copied Llama4 model in benchmark` benchmarks/python/benchmark_inference.py Removed AutoModelForCausalLM import Added import for copied_Llama4ForCausalLM Updated model loading to use copied implementation	+2/-2
layers_for_inference_benchmark.py `Add copied Llama4 model architecture` benchmarks/python/layers_for_inference_benchmark.py Added multiple copied Llama4 components with 'copied_' prefix Implemented attention, MoE, MLP, and normalization layers Included rotary embeddings and decoder layer Added full causal language model class	+952/-3

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 No relevant tests
⚡ Recommended focus areas for review Missing Import The code uses `Cache` and `DynamicCache` in the `copied_Llama4TextModel` class but does not import them from transformers or define them locally. This will result in a NameError when attempting to use these types. past_key_value: Optional[Cache] = None, Undefined Function The method `_update_causal_mask` calls `make_flex_block_causal_mask`, but this function is not imported or defined in the file. This will lead to a runtime error when the function is invoked. chunked_attention_mask = make_flex_block_causal_mask( Undefined Logger The code references `logger.warning_once` in multiple places, but there is no import or definition of `logger` in the file. This will cause a NameError during execution. logger.warning_once(

jjsjann123

looks like mostly mechanical changes.

How are we validating the code change in this PR didn't lead to model change? I suspect the renamed model layers would lead to different behavior in code inside benchmark_inference.py not working as intended.

jjsjann123 · 2025-10-16T18:33:19Z

benchmarks/python/benchmark_inference.py

    NVFP4InferenceLinear,
    nvfuser_f16a_nvfp4weight_scaled_grouped_mm,
    nvfuser_f16a_nvfp4weight_scaled_mm,
+    copied_Llama4ForCausalLM,


nitpick, we could use a better name on this 😜

jjsjann123 · 2025-10-20T16:47:31Z

benchmarks/python/layers_for_inference_benchmark.py

+)
+from transformers.models.llama4 import Llama4TextConfig
+from transformers.models.llama4.modeling_llama4 import (
+    Llama4TextMoe,


On the high level, one of the intention for this benchmark is to replace Llama4TextMoe with the custom module Llama4MoE defined in this file.

We don't have to fix it right now, but we can add a TODO that we'll swap it out later. I think that would allow us to simplify the implementation and avoid some copied_ prefix.

jjsjann123 · 2025-10-20T16:49:38Z

benchmarks/python/layers_for_inference_benchmark.py

+
+
+# Ref: https://github.com/huggingface/transformers/blob/ff8b88a9/src/transformers/models/llama4/modeling_llama4.py#L147-L165
+class copied_Llama4TextMoe(nn.Module):


this I believe is the MoE that we want to swap.

Here the new model has copied_Llama4TextMoe. If you look at benchmark_inference.py, we have a function that does the swapping _replace_llama4_moe, that wouldn't work with the updated model, since it only mechanically swap Llama4TextMoe

Copy llama4 model code from transformers

bc859b7

tbqh requested a review from jjsjann123 October 14, 2025 19:59

tbqh changed the title ~~Copy llama4 model code from transformers~~ [Inference Benchmark] Copy llama4 model code from transformers Oct 14, 2025

jjsjann123 reviewed Oct 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Inference Benchmark] Copy llama4 model code from transformers #5388

[Inference Benchmark] Copy llama4 model code from transformers #5388

Uh oh!

tbqh commented Oct 14, 2025

Uh oh!

github-actions bot commented Oct 14, 2025

Uh oh!

jjsjann123 left a comment

Uh oh!

jjsjann123 Oct 16, 2025

Uh oh!

jjsjann123 Oct 20, 2025

Uh oh!

jjsjann123 Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		# Ref: https://github.com/huggingface/transformers/blob/ff8b88a9/src/transformers/models/llama4/modeling_llama4.py#L147-L165
		class copied_Llama4TextMoe(nn.Module):

Uh oh!

[Inference Benchmark] Copy llama4 model code from transformers #5388

Are you sure you want to change the base?

[Inference Benchmark] Copy llama4 model code from transformers #5388

Uh oh!

Conversation

tbqh commented Oct 14, 2025

Uh oh!

github-actions bot commented Oct 14, 2025

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants