Skip to content
This repository was archived by the owner on Oct 14, 2025. It is now read-only.

Conversation

evellasques
Copy link
Contributor

Issue #, 24

Description of changes:

Recent merging of up/down projection in Llama requires the equivalent merging in the HF to NeMo conversion scripts (and subsequent splitting in the NeMo to HF script).

This change fixes that for the following converters:

  • convert_nemo_checkpoint_to_hf_llama.py
  • convert_hf_checkpoint_to_nemo_llama.py
  • convert_hf_checkpoint_to_nemo_llama_70b.py

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@amithrm
Copy link
Contributor

amithrm commented Apr 22, 2024

Thanks @evellasques for the PR. Going over it!

"self_attention.core_attention.rotary_emb.inv_freq": (0, "self_attn.rotary_emb.inv_freq", None, 0),
"mlp.dense_h_to_4h.weight": (1, "mlp.gate_proj.weight", 0, 0),
"mlp.dense_h_to_4h_2.weight": (1, "mlp.up_proj.weight", 0, 0),
"mlp.dense_h_to_4h.weight": (1, "mlp.gate_proj_up_proj.weight", 0, 0),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why considering "gate" and "up" proj to be fused for HF checkpoint? Shouldn't you split them from nemo checkpoint instead and then save as separate "gate" and "up" params for HF?

hf_model[hf_key_q], hf_model[hf_key_k], hf_model[hf_key_v] = torch.split(hf_model[hf_key], size_per_seg, dim=0)
hf_model.pop(hf_key)

if "dense_h_to_4h" in k:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not accurate. "gate" and "proj" fusion is per tp rank in the nemo checkpoint. So, you can't first concatenate all tps and then split to "gate" and "proj". Instead you should split them for each tp rank.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants