Skip to content

Conversation

laxmareddyp
Copy link
Collaborator

@laxmareddyp laxmareddyp commented Sep 5, 2025

@divyashreepathihalli @mattdangerw @abheesht17 Could you please check and provide your feedback on the quality of this code generated through script.

I assume that 80-85% the code is matching and backbone files import successfully and it's possible to instantiate a backbone model. There still were some errors , which might be alleviated with a stronger model.

The converter and weight conversion scripts are still in development. Generating a workable solution is complex because it requires providing the model with a comprehensive understanding of the entire architectural layout to handle the intricate dependencies of the model's layers and weights.

Checklist

  • I have added all the necessary unit tests for my change.
  • I have verified that my change does not break existing code and works with all backends (TensorFlow, JAX, and PyTorch).
  • My PR is based on the latest changes of the main branch (if unsure, rebase the code).
  • I have followed the Keras Hub Model contribution guidelines in making these changes.
  • I have followed the Keras Hub API design guidelines in making these changes.
  • I have signed the Contributor License Agreement.

@divyashreepathihalli
Copy link
Collaborator

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new GPT-OSS model, seemingly generated by a script. While the overall structure aligns with the repository's conventions, there are several critical issues that need to be addressed. The weight conversion scripts in both utils/transformers and tools/checkpoint_conversion are non-functional due to incorrect layer and weight names, likely a result of being copied from another model's implementation without proper adaptation. Additionally, there's a critical code duplication issue with the GptOssCausalLMPreprocessor class being defined in two separate files. Several style guide violations were also found, including the use of type hints in function signatures, incorrect docstring formatting, and missing docstrings or examples in several new components. These issues must be resolved to ensure the model is correct, maintainable, and usable.

@laxmareddyp
Copy link
Collaborator Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the GPT_OSS model, including its backbone, causal language model task, preprocessor, and tokenizer. The overall structure aligns well with the project's conventions. However, there are several areas that need attention.

Most critically, the Hugging Face weight conversion script in convert_gpt_oss.py is incorrect and incompatible with the implemented model architecture, particularly for the Mixture-of-Experts (MoE) layers. This will prevent loading pre-trained weights correctly. I've provided a corrected implementation for the convert_weights function.

Additionally, there are multiple instances of incorrect docstring formatting across the new files, which violates the project's style guide. I've pointed these out with suggestions for correction. A minor but important clarification is also needed in a test file comment to improve maintainability.

@laxmareddyp
Copy link
Collaborator Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new GptOss model, including its backbone, causal language model, preprocessor, and associated tests and conversion scripts. The code is script-generated, and this review focuses on its quality, correctness, and adherence to the project's style guide.

The model definition and tests are generally well-structured. However, there are a few issues in the GptOssBackbone implementation regarding initialization and configuration that need to be addressed. The most critical issue lies in the Hugging Face conversion script (convert_gpt_oss.py), which appears to contain significant copy-paste errors from another model's converter. This will prevent correct weight loading and needs a substantial revision. My detailed comments provide specific suggestions to fix these issues.

@mattdangerw
Copy link
Member

@laxmareddyp does it work? Looks like it probably doesn't even output valid code yet right? Going by the test output.

The overall code style looks ok, so this will come down to how accurate the code is.

I would view scripts like this as a developer tool rather than an automated workflow. So maybe the thing to do here is to try getting this code end to end working. Then you can have some first hand experience on how useful this was at saving time. If there's tons of hard to find errors in the code, this might be slower than doing it yourself. If it's pretty accurate, maybe this is saving time and worth putting forward as a tool for contributors to use.

@laxmareddyp
Copy link
Collaborator Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new GptOss model, which is a significant contribution. The code is mostly well-structured and follows the project's conventions. However, I've found a few critical issues, primarily related to the tokenizer usage in tests and a bug in the weight conversion logic for the MoE layers. There are also some inconsistencies in the checkpoint conversion script and a minor point of confusion in a test file's comments. Addressing these points will be crucial for the model's correctness and maintainability.

Comment on lines +111 to +142
# Gate projection
loader.port_weight(
keras_variable=expert.gate_up_proj[
expert_idx, :, : backbone.intermediate_dim
],
hf_weight_key=f"model.layers.{i}.mlp.experts.{expert_idx}.gate_proj.weight",
hook_fn=lambda hf_tensor, _: np.transpose(
hf_tensor, axes=(1, 0)
),
)
loader.port_weight(
keras_variable=expert.gate_up_proj_bias[
expert_idx, : backbone.intermediate_dim
],
hf_weight_key=f"model.layers.{i}.mlp.experts.{expert_idx}.gate_proj.bias",
)
# Up projection
loader.port_weight(
keras_variable=expert.gate_up_proj[
expert_idx, :, backbone.intermediate_dim :
],
hf_weight_key=f"model.layers.{i}.mlp.experts.{expert_idx}.up_proj.weight",
hook_fn=lambda hf_tensor, _: np.transpose(
hf_tensor, axes=(1, 0)
),
)
loader.port_weight(
keras_variable=expert.gate_up_proj_bias[
expert_idx, backbone.intermediate_dim :
],
hf_weight_key=f"model.layers.{i}.mlp.experts.{expert_idx}.up_proj.bias",
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical mismatch in how the expert weights for gate_up_proj are loaded versus how they are used. The GptOssExperts layer expects the gate and up projections to be interleaved in the last dimension of the gate_up_proj weight (it splits them using gate = gate_up[..., ::2] and up = gate_up[..., 1::2]). However, this conversion script loads the gate_proj and up_proj weights from the Hugging Face model into contiguous blocks. This will lead to incorrect computations and model behavior. The weights need to be properly interleaved during the conversion process.

Comment on lines 31 to 34
self.tokenizer = GptOssTokenizer(
proto=os.path.join(
self.get_test_data_dir(), "gpt_oss_test_vocab.spm"
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The GptOssTokenizer is being instantiated with a proto argument, which is intended for a SentencePieceTokenizer. However, GptOssTokenizer inherits from BytePairTokenizer and expects vocabulary and merges arguments. This will cause the test to fail. The test setup should be updated to correctly instantiate GptOssTokenizer with a vocabulary and merge rules. Based on convert_gpt_oss.py, this tokenizer is indeed intended to be a BytePairTokenizer.

Comment on lines 33 to 37
GptOssTokenizer(
# Generated using create_gpt_oss_test_proto.py
proto=os.path.join(
self.get_test_data_dir(), "gpt_oss_test_vocab.spm"
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to other tests in this PR, the GptOssTokenizer is being instantiated with a proto argument. This is incorrect as GptOssTokenizer is a BytePairTokenizer and expects vocabulary and merges arguments, not a SentencePiece proto file. This test needs to be updated to correctly initialize the tokenizer for it to pass.

Comment on lines +45 to +48
PRESET_MAP = {
"gpt_oss_20b_en": "openai/gpt-oss-20b",
# "gpt_oss_instruct_8x7b_en": "openai/gpt-oss-20b",
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The PRESET_MAP in this conversion script defines a preset gpt_oss_20b_en which maps to openai/gpt-oss-20b. This is inconsistent with the presets defined in keras_hub/src/models/gpt_oss/gpt_oss_presets.py (e.g., gpt_oss_8_7b_en). To avoid confusion and potential errors, please ensure the preset names and associated Hugging Face model identifiers in this script are aligned with the presets being added in this pull request.

Comment on lines +81 to +82
+ (2 * 16 * 2 * 8) # Experts gate_up_proj weight
+ (2 * 2 * 8) # Experts gate_up_proj bias

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comments for Experts gate_up_proj weight and Experts gate_up_proj bias are a bit confusing. The number of parameters for gate_up_proj weight is num_experts * hidden_dim * 2 * intermediate_dim, and for the bias is num_experts * 2 * intermediate_dim. The comments seem to incorrectly involve top_k in the calculation. For clarity, it would be better to reflect the actual formula in the comments.

Suggested change
+ (2 * 16 * 2 * 8) # Experts gate_up_proj weight
+ (2 * 2 * 8) # Experts gate_up_proj bias
+ (2 * 16 * (2 * 8)) # Experts gate_up_proj weight
+ (2 * (2 * 8)) # Experts gate_up_proj bias

Comment on lines +192 to +282
def score(
self,
token_ids,
padding_mask=None,
scoring_mode="logits",
layer_intercept_fn=None,
target_ids=None,
):
"""Score a generation represented by the provided token ids.

Args:
token_ids: A `<int>[batch_size, num_tokens]` tensor containing
tokens to score. Typically, this tensor captures the output
from a call to `GptOssCausalLM.generate()`, i.e., tokens for
both the input text and the model-generated text.
padding_mask: A `<bool>[batch_size, num_tokens]` tensor indicating
the tokens that should be preserved during generation. This is
an artifact required by the GptOssBackbone and isn't
influential on the computation of this function. If omitted,
this function uses `keras.ops.ones()` to create a tensor of
the appropriate shape.
scoring_mode: The type of scores to return, either "logits" or
"loss", both will be per input token.
layer_intercept_fn: An optional function for augmenting
activations with additional computation, for example, as part
of interpretability research. This function will be passed the
activations as its first parameter and a numeric index
associated with that backbone layer. _This index _is not_ an
index into `self.backbone.layers`. The index -1 accompanies
the embeddings returned by calling
`self.backbone.token_embedding()` on `token_ids` in the
forward direction. All subsequent indexes will be 0-based
indices for the activations returned by each of the
Transformers layers in the backbone. This function must
return a `<float>[batch_size, num_tokens, hidden_dims]`
tensor that can be passed as an input to the next layer in
the model.
target_ids: An `<bool>[batch_size, num_tokens]` tensor containing
the predicted tokens against which the loss should be
computed. If a span of tokens is provided (sequential truthy
values along axis=1 in the tensor), the loss will be computed
as the aggregate across those tokens.

Raises:
ValueError: If an unsupported scoring_mode is provided, or if the
target_ids are not provided when using ScoringMode.LOSS.

Returns:
The per-token scores as a tensor of size
`<float>[batch_size, num_tokens, vocab_size]` in "logits" mode, or
`<float>[batch_size, num_tokens]` in "loss" mode.
"""
if scoring_mode not in ("logits", "loss"):
raise ValueError(
"Unsupported scoring_mode. Must be one of 'logits' or 'loss'."
)

if scoring_mode == "loss" and target_ids is None:
raise ValueError(
"Cannot compute loss without targets. Please provide target "
"token ids via the target_ids parameter."
)

batch_shape = ops.shape(token_ids)[:2]
assert len(batch_shape) == 2

if layer_intercept_fn is None:

def default_layer_intercept_fn(x, unused_i):
return x

layer_intercept_fn = default_layer_intercept_fn

token_embeddings = self.backbone.token_embedding(token_ids)
x = layer_intercept_fn(token_embeddings, -1)

for i, transformer_layer in enumerate(self.backbone.transformer_layers):
x = transformer_layer(x, decoder_padding_mask=padding_mask)
x = layer_intercept_fn(x, i)

x = self.backbone.layer_norm(x)
logits = self.backbone.token_embedding(x, reverse=True)

if scoring_mode == "logits":
return logits

per_token_loss_fn = keras.losses.SparseCategoricalCrossentropy(
from_logits=True, reduction="none"
)
per_token_loss = per_token_loss_fn(target_ids, logits)
return per_token_loss

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The score method re-implements the forward pass of the backbone. While this is necessary to inject the layer_intercept_fn, it duplicates the logic from the functional model definition in GptOssBackbone. This could lead to maintenance issues if the backbone's forward pass changes and this method isn't updated in sync. Consider refactoring to reduce this duplication if possible, perhaps by making the backbone's forward pass more modular.

Comment on lines +401 to +407
output = (decoder_output,)
if self_attention_cache is not None:
output += (self_attention_cache,)
if self.output_router_logits:
output += (router_logits,)

return output[0] if len(output) == 1 else output

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The return signature of the call method is variable. It returns a single tensor or a tuple depending on the values of self_attention_cache and output_router_logits. This can make the layer's API brittle and hard to use correctly. It would be more robust to always return a tuple, for example (decoder_output, self_attention_cache, router_logits), where the latter two can be None.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants