Skip to content
This repository was archived by the owner on Oct 14, 2025. It is now read-only.

Conversation

evellasques
Copy link
Contributor

  • In a more recent version of Transformers, inv_freq buffer is no longer persistent huggingface/transformers@95f96b4
  • This causes a crash when someone tries to load a (converted) Llama checkpoint in NeMo (ie.: CodeLlama 7b)

*Issue #, if available: N/A

Description of changes: Switches off persistent flag when registering inv_freq buffer

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

* In a more recent version of Transformers, inv_freq buffer is no longer persistent huggingface/transformers@95f96b4
* This causes a crash when someone tries to load a (converted) Llama checkpoint in NeMo (ie.: CodeLlama 7b)
@evellasques
Copy link
Contributor Author

Hi,

Does anyone have an update on this? Basically, loading any of the recent HF checkpoints that rely on rotary position embeddings (ie.: Mistral, Llama), will result in a crash:

 Missing key(s) in state_dict: "model.language_model.encoder.layers.0.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.1.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.2.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.3.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.4.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.5.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.6.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.7.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.8.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.9.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.10.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.11.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.12.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.13.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.14.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.15.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.16.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.17.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.18.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.19.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.20.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.21.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.22.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.23.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.24.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.25.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.26.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.27.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.28.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.29.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.30.self_attention.core_attention.rotary_emb.inv_freq", "model.language_model.encoder.layers.31.self_attention.core_attention.rotary_emb.inv_freq". 

Because HF is not serializing inv_freq.

@aws-singhada
Copy link

aws-singhada commented Apr 4, 2024 via email

@aws-kingrj
Copy link
Contributor

Can you resolve the merge conflicts? Then we can merge it

@evellasques
Copy link
Contributor Author

evellasques commented Apr 4, 2024

Can you resolve the merge conflicts? Then we can merge it

Quick question, solving the conflict will involve replacing nemo/nemo/collections/nlp/modules/common/megatron/llama_module.py with nemo/nemo/collections/nlp/modules/common/megatron/falcon_module.py. I opened an issue about a bug in the Llama conversion scripts (basically they should aggregate gate_proj and up_proj weights from HuggingFace Llama into a single dense_h_to_4h (for Swiglu). Do you want me to also fix that and add it as part of this PR?

@evellasques
Copy link
Contributor Author

Can you resolve the merge conflicts? Then we can merge it

I noticed that in a more recent release you guys handle that during checkpoint loading. I created another PR for the issue with checkpoint conversion (#26) so I'll close this one.

@evellasques
Copy link
Contributor Author

Main issue was solved by another PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants