Generated GPT_OSS model files through porter script. #2384

laxmareddyp · 2025-09-05T19:09:05Z

@divyashreepathihalli @mattdangerw @abheesht17 Could you please check and provide your feedback on the quality of this code generated through script.

I assume that 80-85% the code is matching and backbone files import successfully and it's possible to instantiate a backbone model. There still were some errors , which might be alleviated with a stronger model.

The converter and weight conversion scripts are still in development. Generating a workable solution is complex because it requires providing the model with a comprehensive understanding of the entire architectural layout to handle the intricate dependencies of the model's layers and weights.

Checklist

I have added all the necessary unit tests for my change.
I have verified that my change does not break existing code and works with all backends (TensorFlow, JAX, and PyTorch).
My PR is based on the latest changes of the main branch (if unsure, rebase the code).
I have followed the Keras Hub Model contribution guidelines in making these changes.
I have followed the Keras Hub API design guidelines in making these changes.
I have signed the Contributor License Agreement.

divyashreepathihalli · 2025-09-08T16:25:07Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a new GPT-OSS model, seemingly generated by a script. While the overall structure aligns with the repository's conventions, there are several critical issues that need to be addressed. The weight conversion scripts in both utils/transformers and tools/checkpoint_conversion are non-functional due to incorrect layer and weight names, likely a result of being copied from another model's implementation without proper adaptation. Additionally, there's a critical code duplication issue with the GptOssCausalLMPreprocessor class being defined in two separate files. Several style guide violations were also found, including the use of type hints in function signatures, incorrect docstring formatting, and missing docstrings or examples in several new components. These issues must be resolved to ensure the model is correct, maintainable, and usable.

keras_hub/src/models/gpt_oss/gpt_oss_causal_lm.py

keras_hub/src/utils/transformers/convert_gpt_oss.py

tools/checkpoint_conversion/convert_gpt_oss_checkpoints.py

keras_hub/src/models/gpt_oss/gpt_oss_backbone.py

keras_hub/src/models/gpt_oss/gpt_oss_causal_lm_preprocessor.py

keras_hub/src/models/gpt_oss/gpt_oss_attention.py

keras_hub/src/models/gpt_oss/gpt_oss_decoder.py

keras_hub/src/models/gpt_oss/gpt_oss_layer_norm.py

keras_hub/src/models/gpt_oss/gpt_oss_tokenizer.py

laxmareddyp · 2025-09-08T23:53:04Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces the GPT_OSS model, including its backbone, causal language model task, preprocessor, and tokenizer. The overall structure aligns well with the project's conventions. However, there are several areas that need attention.

Most critically, the Hugging Face weight conversion script in convert_gpt_oss.py is incorrect and incompatible with the implemented model architecture, particularly for the Mixture-of-Experts (MoE) layers. This will prevent loading pre-trained weights correctly. I've provided a corrected implementation for the convert_weights function.

Additionally, there are multiple instances of incorrect docstring formatting across the new files, which violates the project's style guide. I've pointed these out with suggestions for correction. A minor but important clarification is also needed in a test file comment to improve maintainability.

keras_hub/src/utils/transformers/convert_gpt_oss.py

keras_hub/src/models/gpt_oss/gpt_oss_attention.py

keras_hub/src/models/gpt_oss/gpt_oss_backbone.py

keras_hub/src/models/gpt_oss/gpt_oss_backbone_test.py

keras_hub/src/models/gpt_oss/gpt_oss_causal_lm.py

keras_hub/src/models/gpt_oss/gpt_oss_decoder.py

laxmareddyp · 2025-09-10T20:49:47Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a new GptOss model, including its backbone, causal language model, preprocessor, and associated tests and conversion scripts. The code is script-generated, and this review focuses on its quality, correctness, and adherence to the project's style guide.

The model definition and tests are generally well-structured. However, there are a few issues in the GptOssBackbone implementation regarding initialization and configuration that need to be addressed. The most critical issue lies in the Hugging Face conversion script (convert_gpt_oss.py), which appears to contain significant copy-paste errors from another model's converter. This will prevent correct weight loading and needs a substantial revision. My detailed comments provide specific suggestions to fix these issues.

keras_hub/src/utils/transformers/convert_gpt_oss.py

keras_hub/src/models/gpt_oss/gpt_oss_backbone.py

mattdangerw · 2025-09-10T22:10:07Z

@laxmareddyp does it work? Looks like it probably doesn't even output valid code yet right? Going by the test output.

The overall code style looks ok, so this will come down to how accurate the code is.

I would view scripts like this as a developer tool rather than an automated workflow. So maybe the thing to do here is to try getting this code end to end working. Then you can have some first hand experience on how useful this was at saving time. If there's tons of hard to find errors in the code, this might be slower than doing it yourself. If it's pretty accurate, maybe this is saving time and worth putting forward as a tool for contributors to use.

laxmareddyp · 2025-09-11T23:58:54Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a new GptOss model, which is a significant contribution. The code is mostly well-structured and follows the project's conventions. However, I've found a few critical issues, primarily related to the tokenizer usage in tests and a bug in the weight conversion logic for the MoE layers. There are also some inconsistencies in the checkpoint conversion script and a minor point of confusion in a test file's comments. Addressing these points will be crucial for the model's correctness and maintainability.

gemini-code-assist · 2025-09-12T00:02:12Z

keras_hub/src/utils/transformers/convert_gpt_oss.py

+            # Gate projection
+            loader.port_weight(
+                keras_variable=expert.gate_up_proj[
+                    expert_idx, :, : backbone.intermediate_dim
+                ],
+                hf_weight_key=f"model.layers.{i}.mlp.experts.{expert_idx}.gate_proj.weight",
+                hook_fn=lambda hf_tensor, _: np.transpose(
+                    hf_tensor, axes=(1, 0)
+                ),
+            )
+            loader.port_weight(
+                keras_variable=expert.gate_up_proj_bias[
+                    expert_idx, : backbone.intermediate_dim
+                ],
+                hf_weight_key=f"model.layers.{i}.mlp.experts.{expert_idx}.gate_proj.bias",
+            )
+            # Up projection
+            loader.port_weight(
+                keras_variable=expert.gate_up_proj[
+                    expert_idx, :, backbone.intermediate_dim :
+                ],
+                hf_weight_key=f"model.layers.{i}.mlp.experts.{expert_idx}.up_proj.weight",
+                hook_fn=lambda hf_tensor, _: np.transpose(
+                    hf_tensor, axes=(1, 0)
+                ),
+            )
+            loader.port_weight(
+                keras_variable=expert.gate_up_proj_bias[
+                    expert_idx, backbone.intermediate_dim :
+                ],
+                hf_weight_key=f"model.layers.{i}.mlp.experts.{expert_idx}.up_proj.bias",
+            )


There is a critical mismatch in how the expert weights for gate_up_proj are loaded versus how they are used. The GptOssExperts layer expects the gate and up projections to be interleaved in the last dimension of the gate_up_proj weight (it splits them using gate = gate_up[..., ::2] and up = gate_up[..., 1::2]). However, this conversion script loads the gate_proj and up_proj weights from the Hugging Face model into contiguous blocks. This will lead to incorrect computations and model behavior. The weights need to be properly interleaved during the conversion process.

gemini-code-assist · 2025-09-12T00:02:12Z

keras_hub/src/models/gpt_oss/gpt_oss_causal_lm_preprocessor_test.py

+        self.tokenizer = GptOssTokenizer(
+            proto=os.path.join(
+                self.get_test_data_dir(), "gpt_oss_test_vocab.spm"
+            )


The GptOssTokenizer is being instantiated with a proto argument, which is intended for a SentencePieceTokenizer. However, GptOssTokenizer inherits from BytePairTokenizer and expects vocabulary and merges arguments. This will cause the test to fail. The test setup should be updated to correctly instantiate GptOssTokenizer with a vocabulary and merge rules. Based on convert_gpt_oss.py, this tokenizer is indeed intended to be a BytePairTokenizer.

gemini-code-assist · 2025-09-12T00:02:12Z

keras_hub/src/models/gpt_oss/gpt_oss_causal_lm_test.py

+            GptOssTokenizer(
+                # Generated using create_gpt_oss_test_proto.py
+                proto=os.path.join(
+                    self.get_test_data_dir(), "gpt_oss_test_vocab.spm"
+                )


Similar to other tests in this PR, the GptOssTokenizer is being instantiated with a proto argument. This is incorrect as GptOssTokenizer is a BytePairTokenizer and expects vocabulary and merges arguments, not a SentencePiece proto file. This test needs to be updated to correctly initialize the tokenizer for it to pass.

gemini-code-assist · 2025-09-12T00:02:12Z

tools/checkpoint_conversion/convert_gpt_oss_checkpoints.py

+PRESET_MAP = {
+    "gpt_oss_20b_en": "openai/gpt-oss-20b",
+    # "gpt_oss_instruct_8x7b_en": "openai/gpt-oss-20b",
+}


The PRESET_MAP in this conversion script defines a preset gpt_oss_20b_en which maps to openai/gpt-oss-20b. This is inconsistent with the presets defined in keras_hub/src/models/gpt_oss/gpt_oss_presets.py (e.g., gpt_oss_8_7b_en). To avoid confusion and potential errors, please ensure the preset names and associated Hugging Face model identifiers in this script are aligned with the presets being added in this pull request.

gemini-code-assist · 2025-09-12T00:02:12Z

keras_hub/src/models/gpt_oss/gpt_oss_backbone_test.py

+                + (2 * 16 * 2 * 8)  # Experts gate_up_proj weight
+                + (2 * 2 * 8)  # Experts gate_up_proj bias


The comments for Experts gate_up_proj weight and Experts gate_up_proj bias are a bit confusing. The number of parameters for gate_up_proj weight is num_experts * hidden_dim * 2 * intermediate_dim, and for the bias is num_experts * 2 * intermediate_dim. The comments seem to incorrectly involve top_k in the calculation. For clarity, it would be better to reflect the actual formula in the comments.

Suggested change

+ (2 * 16 * 2 * 8) # Experts gate_up_proj weight

+ (2 * 2 * 8) # Experts gate_up_proj bias

+ (2 * 16 * (2 * 8)) # Experts gate_up_proj weight

+ (2 * (2 * 8)) # Experts gate_up_proj bias

gemini-code-assist · 2025-09-12T00:02:13Z

keras_hub/src/models/gpt_oss/gpt_oss_causal_lm.py

+    def score(
+        self,
+        token_ids,
+        padding_mask=None,
+        scoring_mode="logits",
+        layer_intercept_fn=None,
+        target_ids=None,
+    ):
+        """Score a generation represented by the provided token ids.
+
+        Args:
+            token_ids: A `<int>[batch_size, num_tokens]` tensor containing
+                tokens to score. Typically, this tensor captures the output
+                from a call to `GptOssCausalLM.generate()`, i.e., tokens for
+                both the input text and the model-generated text.
+            padding_mask: A `<bool>[batch_size, num_tokens]` tensor indicating
+                the tokens that should be preserved during generation. This is
+                an artifact required by the GptOssBackbone and isn't
+                influential on the computation of this function. If omitted,
+                this function uses `keras.ops.ones()` to create a tensor of
+                the appropriate shape.
+            scoring_mode: The type of scores to return, either "logits" or
+                "loss", both will be per input token.
+            layer_intercept_fn: An optional function for augmenting
+                activations with additional computation, for example, as part
+                of interpretability research. This function will be passed the
+                activations as its first parameter and a numeric index
+                associated with that backbone layer. _This index _is not_ an
+                index into `self.backbone.layers`. The index -1 accompanies
+                the embeddings returned by calling
+                `self.backbone.token_embedding()` on `token_ids` in the
+                forward direction. All subsequent indexes will be 0-based
+                indices for the activations returned by each of the
+                Transformers layers in the backbone. This function must
+                return a `<float>[batch_size, num_tokens, hidden_dims]`
+                tensor that can be passed as an input to the next layer in
+                the model.
+            target_ids: An `<bool>[batch_size, num_tokens]` tensor containing
+                the predicted tokens against which the loss should be
+                computed. If a span of tokens is provided (sequential truthy
+                values along axis=1 in the tensor), the loss will be computed
+                as the aggregate across those tokens.
+
+        Raises:
+            ValueError: If an unsupported scoring_mode is provided, or if the
+                target_ids are not provided when using ScoringMode.LOSS.
+
+        Returns:
+            The per-token scores as a tensor of size
+            `<float>[batch_size, num_tokens, vocab_size]` in "logits" mode, or
+            `<float>[batch_size, num_tokens]` in "loss" mode.
+        """
+        if scoring_mode not in ("logits", "loss"):
+            raise ValueError(
+                "Unsupported scoring_mode. Must be one of 'logits' or 'loss'."
+            )
+
+        if scoring_mode == "loss" and target_ids is None:
+            raise ValueError(
+                "Cannot compute loss without targets. Please provide target "
+                "token ids via the target_ids parameter."
+            )
+
+        batch_shape = ops.shape(token_ids)[:2]
+        assert len(batch_shape) == 2
+
+        if layer_intercept_fn is None:
+
+            def default_layer_intercept_fn(x, unused_i):
+                return x
+
+            layer_intercept_fn = default_layer_intercept_fn
+
+        token_embeddings = self.backbone.token_embedding(token_ids)
+        x = layer_intercept_fn(token_embeddings, -1)
+
+        for i, transformer_layer in enumerate(self.backbone.transformer_layers):
+            x = transformer_layer(x, decoder_padding_mask=padding_mask)
+            x = layer_intercept_fn(x, i)
+
+        x = self.backbone.layer_norm(x)
+        logits = self.backbone.token_embedding(x, reverse=True)
+
+        if scoring_mode == "logits":
+            return logits
+
+        per_token_loss_fn = keras.losses.SparseCategoricalCrossentropy(
+            from_logits=True, reduction="none"
+        )
+        per_token_loss = per_token_loss_fn(target_ids, logits)
+        return per_token_loss


The score method re-implements the forward pass of the backbone. While this is necessary to inject the layer_intercept_fn, it duplicates the logic from the functional model definition in GptOssBackbone. This could lead to maintenance issues if the backbone's forward pass changes and this method isn't updated in sync. Consider refactoring to reduce this duplication if possible, perhaps by making the backbone's forward pass more modular.

gemini-code-assist · 2025-09-12T00:02:13Z

keras_hub/src/models/gpt_oss/gpt_oss_decoder.py

+        output = (decoder_output,)
+        if self_attention_cache is not None:
+            output += (self_attention_cache,)
+        if self.output_router_logits:
+            output += (router_logits,)
+
+        return output[0] if len(output) == 1 else output


The return signature of the call method is variable. It returns a single tensor or a tuple depending on the values of self_attention_cache and output_router_logits. This can make the layer's API brittle and hard to use correctly. It would be more robust to always return a tuple, for example (decoder_output, self_attention_cache, router_logits), where the latter two can be None.

Test GPT_OSS files through porter

26867ba

laxmareddyp requested review from divyashreepathihalli, mattdangerw and abheesht17 September 5, 2025 19:09

laxmareddyp added 4 commits September 6, 2025 09:16

generate API and moved files to respective folders

f1c055b

Fix format issues

d4da96c

Add gpt_oss to preset loader and Fix format issues

b14cfb5

Add gpt_oss to preset loader

b675610

gemini-code-assist bot reviewed Sep 8, 2025

View reviewed changes

generated files through 2.5-pro model

8cf71ce

gemini-code-assist bot reviewed Sep 8, 2025

View reviewed changes

Format fix

2242ef4

gemini-code-assist bot reviewed Sep 10, 2025

View reviewed changes

laxmareddyp added 2 commits September 11, 2025 16:46

Add converter, RoPE update

eb25d19

Fix format

ba50a9f

gemini-code-assist bot reviewed Sep 12, 2025

View reviewed changes

Fix BPE tests

1854d80

laxmareddyp removed request for mattdangerw, abheesht17 and divyashreepathihalli September 12, 2025 02:02

Merge branch 'keras-team:master' into test_gpt_oss_model

76139cd

		+ (2 * 16 * 2 * 8) # Experts gate_up_proj weight
		+ (2 * 2 * 8) # Experts gate_up_proj bias

Generated GPT_OSS model files through porter script. #2384

Are you sure you want to change the base?

Generated GPT_OSS model files through porter script. #2384

Uh oh!

Conversation

laxmareddyp commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

divyashreepathihalli commented Sep 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

laxmareddyp commented Sep 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

laxmareddyp commented Sep 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mattdangerw commented Sep 10, 2025

Uh oh!

laxmareddyp commented Sep 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

laxmareddyp commented Sep 5, 2025 •

edited

Loading