-
Notifications
You must be signed in to change notification settings - Fork 296
Generated GPT_OSS model files through porter script. #2384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new GPT-OSS
model, seemingly generated by a script. While the overall structure aligns with the repository's conventions, there are several critical issues that need to be addressed. The weight conversion scripts in both utils/transformers
and tools/checkpoint_conversion
are non-functional due to incorrect layer and weight names, likely a result of being copied from another model's implementation without proper adaptation. Additionally, there's a critical code duplication issue with the GptOssCausalLMPreprocessor
class being defined in two separate files. Several style guide violations were also found, including the use of type hints in function signatures, incorrect docstring formatting, and missing docstrings or examples in several new components. These issues must be resolved to ensure the model is correct, maintainable, and usable.
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the GPT_OSS
model, including its backbone, causal language model task, preprocessor, and tokenizer. The overall structure aligns well with the project's conventions. However, there are several areas that need attention.
Most critically, the Hugging Face weight conversion script in convert_gpt_oss.py
is incorrect and incompatible with the implemented model architecture, particularly for the Mixture-of-Experts (MoE) layers. This will prevent loading pre-trained weights correctly. I've provided a corrected implementation for the convert_weights
function.
Additionally, there are multiple instances of incorrect docstring formatting across the new files, which violates the project's style guide. I've pointed these out with suggestions for correction. A minor but important clarification is also needed in a test file comment to improve maintainability.
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new GptOss
model, including its backbone, causal language model, preprocessor, and associated tests and conversion scripts. The code is script-generated, and this review focuses on its quality, correctness, and adherence to the project's style guide.
The model definition and tests are generally well-structured. However, there are a few issues in the GptOssBackbone
implementation regarding initialization and configuration that need to be addressed. The most critical issue lies in the Hugging Face conversion script (convert_gpt_oss.py
), which appears to contain significant copy-paste errors from another model's converter. This will prevent correct weight loading and needs a substantial revision. My detailed comments provide specific suggestions to fix these issues.
@laxmareddyp does it work? Looks like it probably doesn't even output valid code yet right? Going by the test output. The overall code style looks ok, so this will come down to how accurate the code is. I would view scripts like this as a developer tool rather than an automated workflow. So maybe the thing to do here is to try getting this code end to end working. Then you can have some first hand experience on how useful this was at saving time. If there's tons of hard to find errors in the code, this might be slower than doing it yourself. If it's pretty accurate, maybe this is saving time and worth putting forward as a tool for contributors to use. |
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new GptOss
model, which is a significant contribution. The code is mostly well-structured and follows the project's conventions. However, I've found a few critical issues, primarily related to the tokenizer usage in tests and a bug in the weight conversion logic for the MoE layers. There are also some inconsistencies in the checkpoint conversion script and a minor point of confusion in a test file's comments. Addressing these points will be crucial for the model's correctness and maintainability.
# Gate projection | ||
loader.port_weight( | ||
keras_variable=expert.gate_up_proj[ | ||
expert_idx, :, : backbone.intermediate_dim | ||
], | ||
hf_weight_key=f"model.layers.{i}.mlp.experts.{expert_idx}.gate_proj.weight", | ||
hook_fn=lambda hf_tensor, _: np.transpose( | ||
hf_tensor, axes=(1, 0) | ||
), | ||
) | ||
loader.port_weight( | ||
keras_variable=expert.gate_up_proj_bias[ | ||
expert_idx, : backbone.intermediate_dim | ||
], | ||
hf_weight_key=f"model.layers.{i}.mlp.experts.{expert_idx}.gate_proj.bias", | ||
) | ||
# Up projection | ||
loader.port_weight( | ||
keras_variable=expert.gate_up_proj[ | ||
expert_idx, :, backbone.intermediate_dim : | ||
], | ||
hf_weight_key=f"model.layers.{i}.mlp.experts.{expert_idx}.up_proj.weight", | ||
hook_fn=lambda hf_tensor, _: np.transpose( | ||
hf_tensor, axes=(1, 0) | ||
), | ||
) | ||
loader.port_weight( | ||
keras_variable=expert.gate_up_proj_bias[ | ||
expert_idx, backbone.intermediate_dim : | ||
], | ||
hf_weight_key=f"model.layers.{i}.mlp.experts.{expert_idx}.up_proj.bias", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a critical mismatch in how the expert weights for gate_up_proj
are loaded versus how they are used. The GptOssExperts
layer expects the gate and up projections to be interleaved in the last dimension of the gate_up_proj
weight (it splits them using gate = gate_up[..., ::2]
and up = gate_up[..., 1::2]
). However, this conversion script loads the gate_proj
and up_proj
weights from the Hugging Face model into contiguous blocks. This will lead to incorrect computations and model behavior. The weights need to be properly interleaved during the conversion process.
self.tokenizer = GptOssTokenizer( | ||
proto=os.path.join( | ||
self.get_test_data_dir(), "gpt_oss_test_vocab.spm" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GptOssTokenizer
is being instantiated with a proto
argument, which is intended for a SentencePieceTokenizer
. However, GptOssTokenizer
inherits from BytePairTokenizer
and expects vocabulary
and merges
arguments. This will cause the test to fail. The test setup should be updated to correctly instantiate GptOssTokenizer
with a vocabulary and merge rules. Based on convert_gpt_oss.py
, this tokenizer is indeed intended to be a BytePairTokenizer
.
GptOssTokenizer( | ||
# Generated using create_gpt_oss_test_proto.py | ||
proto=os.path.join( | ||
self.get_test_data_dir(), "gpt_oss_test_vocab.spm" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to other tests in this PR, the GptOssTokenizer
is being instantiated with a proto
argument. This is incorrect as GptOssTokenizer
is a BytePairTokenizer
and expects vocabulary
and merges
arguments, not a SentencePiece proto file. This test needs to be updated to correctly initialize the tokenizer for it to pass.
PRESET_MAP = { | ||
"gpt_oss_20b_en": "openai/gpt-oss-20b", | ||
# "gpt_oss_instruct_8x7b_en": "openai/gpt-oss-20b", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PRESET_MAP
in this conversion script defines a preset gpt_oss_20b_en
which maps to openai/gpt-oss-20b
. This is inconsistent with the presets defined in keras_hub/src/models/gpt_oss/gpt_oss_presets.py
(e.g., gpt_oss_8_7b_en
). To avoid confusion and potential errors, please ensure the preset names and associated Hugging Face model identifiers in this script are aligned with the presets being added in this pull request.
+ (2 * 16 * 2 * 8) # Experts gate_up_proj weight | ||
+ (2 * 2 * 8) # Experts gate_up_proj bias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments for Experts gate_up_proj weight
and Experts gate_up_proj bias
are a bit confusing. The number of parameters for gate_up_proj
weight is num_experts * hidden_dim * 2 * intermediate_dim
, and for the bias is num_experts * 2 * intermediate_dim
. The comments seem to incorrectly involve top_k
in the calculation. For clarity, it would be better to reflect the actual formula in the comments.
+ (2 * 16 * 2 * 8) # Experts gate_up_proj weight | |
+ (2 * 2 * 8) # Experts gate_up_proj bias | |
+ (2 * 16 * (2 * 8)) # Experts gate_up_proj weight | |
+ (2 * (2 * 8)) # Experts gate_up_proj bias |
def score( | ||
self, | ||
token_ids, | ||
padding_mask=None, | ||
scoring_mode="logits", | ||
layer_intercept_fn=None, | ||
target_ids=None, | ||
): | ||
"""Score a generation represented by the provided token ids. | ||
|
||
Args: | ||
token_ids: A `<int>[batch_size, num_tokens]` tensor containing | ||
tokens to score. Typically, this tensor captures the output | ||
from a call to `GptOssCausalLM.generate()`, i.e., tokens for | ||
both the input text and the model-generated text. | ||
padding_mask: A `<bool>[batch_size, num_tokens]` tensor indicating | ||
the tokens that should be preserved during generation. This is | ||
an artifact required by the GptOssBackbone and isn't | ||
influential on the computation of this function. If omitted, | ||
this function uses `keras.ops.ones()` to create a tensor of | ||
the appropriate shape. | ||
scoring_mode: The type of scores to return, either "logits" or | ||
"loss", both will be per input token. | ||
layer_intercept_fn: An optional function for augmenting | ||
activations with additional computation, for example, as part | ||
of interpretability research. This function will be passed the | ||
activations as its first parameter and a numeric index | ||
associated with that backbone layer. _This index _is not_ an | ||
index into `self.backbone.layers`. The index -1 accompanies | ||
the embeddings returned by calling | ||
`self.backbone.token_embedding()` on `token_ids` in the | ||
forward direction. All subsequent indexes will be 0-based | ||
indices for the activations returned by each of the | ||
Transformers layers in the backbone. This function must | ||
return a `<float>[batch_size, num_tokens, hidden_dims]` | ||
tensor that can be passed as an input to the next layer in | ||
the model. | ||
target_ids: An `<bool>[batch_size, num_tokens]` tensor containing | ||
the predicted tokens against which the loss should be | ||
computed. If a span of tokens is provided (sequential truthy | ||
values along axis=1 in the tensor), the loss will be computed | ||
as the aggregate across those tokens. | ||
|
||
Raises: | ||
ValueError: If an unsupported scoring_mode is provided, or if the | ||
target_ids are not provided when using ScoringMode.LOSS. | ||
|
||
Returns: | ||
The per-token scores as a tensor of size | ||
`<float>[batch_size, num_tokens, vocab_size]` in "logits" mode, or | ||
`<float>[batch_size, num_tokens]` in "loss" mode. | ||
""" | ||
if scoring_mode not in ("logits", "loss"): | ||
raise ValueError( | ||
"Unsupported scoring_mode. Must be one of 'logits' or 'loss'." | ||
) | ||
|
||
if scoring_mode == "loss" and target_ids is None: | ||
raise ValueError( | ||
"Cannot compute loss without targets. Please provide target " | ||
"token ids via the target_ids parameter." | ||
) | ||
|
||
batch_shape = ops.shape(token_ids)[:2] | ||
assert len(batch_shape) == 2 | ||
|
||
if layer_intercept_fn is None: | ||
|
||
def default_layer_intercept_fn(x, unused_i): | ||
return x | ||
|
||
layer_intercept_fn = default_layer_intercept_fn | ||
|
||
token_embeddings = self.backbone.token_embedding(token_ids) | ||
x = layer_intercept_fn(token_embeddings, -1) | ||
|
||
for i, transformer_layer in enumerate(self.backbone.transformer_layers): | ||
x = transformer_layer(x, decoder_padding_mask=padding_mask) | ||
x = layer_intercept_fn(x, i) | ||
|
||
x = self.backbone.layer_norm(x) | ||
logits = self.backbone.token_embedding(x, reverse=True) | ||
|
||
if scoring_mode == "logits": | ||
return logits | ||
|
||
per_token_loss_fn = keras.losses.SparseCategoricalCrossentropy( | ||
from_logits=True, reduction="none" | ||
) | ||
per_token_loss = per_token_loss_fn(target_ids, logits) | ||
return per_token_loss |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The score
method re-implements the forward pass of the backbone. While this is necessary to inject the layer_intercept_fn
, it duplicates the logic from the functional model definition in GptOssBackbone
. This could lead to maintenance issues if the backbone's forward pass changes and this method isn't updated in sync. Consider refactoring to reduce this duplication if possible, perhaps by making the backbone's forward pass more modular.
output = (decoder_output,) | ||
if self_attention_cache is not None: | ||
output += (self_attention_cache,) | ||
if self.output_router_logits: | ||
output += (router_logits,) | ||
|
||
return output[0] if len(output) == 1 else output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return signature of the call
method is variable. It returns a single tensor or a tuple depending on the values of self_attention_cache
and output_router_logits
. This can make the layer's API brittle and hard to use correctly. It would be more robust to always return a tuple, for example (decoder_output, self_attention_cache, router_logits)
, where the latter two can be None
.
@divyashreepathihalli @mattdangerw @abheesht17 Could you please check and provide your feedback on the quality of this code generated through script.
I assume that 80-85% the code is matching and backbone files import successfully and it's possible to instantiate a backbone model. There still were some errors , which might be alleviated with a stronger model.
The converter and weight conversion scripts are still in development. Generating a workable solution is complex because it requires providing the model with a comprehensive understanding of the entire architectural layout to handle the intricate dependencies of the model's layers and weights.
Checklist