-
Notifications
You must be signed in to change notification settings - Fork 12.6k
model: Add support for GLM 4.5 family of models (#14921) #14939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5da3811
to
ec5c193
Compare
Just a few quick notes from a glance:
Will do a proper review when you are ready. :) |
Hey @CISC no worries on the naming etc.. will do. |
c4dbf69
to
b4c60e1
Compare
FYI when trying to run |
That's because converting FP8 weights isn't supported yet, see #14810 |
1957023
to
4397ccb
Compare
I'm close to having convert_hf_to_gguf.py and llama-quantize working (see updated PR), it completes conversion without error and I was then able to quantise to Q4_K_M. gguf-dump worked, but llama-server picked up a tensor mapping issue with token_embd.weight, so I've just put a fix into convert_hf_to_gguf.py. I'm going through the whole conversion then quantisation process again, it's getting late here (Hi from Melbourne 👋), so I'll come back and see if it's finished in 20~. |
The LLM_TYPE code is wrong, those models aren't (respectively) dense 12B and 32B models. You have to add new MoE constants for them (see Qwen3 and Ernie MoEs as examples). |
Also, you might want to include the nextn tensors instead of throwing them out - MTP support is not there yet, but that way you won't have to reconvert and requantize if/when it arrives. |
Thanks @pwilkin, LLM_TYPE updated. I've added the nextn tensors into the conversion, skipping mapping to avoid errors. |
Note that preserving the nextn tensors does result in a larger GGUF (780 tensors -> 1184 & 214GB -> 221GB for the f16) |
I can't replicate that error @Thireus |
Obviously, but they won't get loaded since they're not supported 😄 Also, don't make my mistake: Don't convert to f16, do --outtype bf16 or your model will probably have errors in the tensors. |
If you add unused tensors to the GGUF you must mark those tensors as unused ( Just FYI, all other models with MTP so far have those tensors stripped. |
Ah, that'd explain why I'm getting I'll have to come back to this in the morning as it's getting late here. If anyone is keen for this ASAP and has improvements feel free to either raise a PR against my branch or pull my commits into a PR of your own if you have a better approach and I'll review in the morning. |
I'll just put it out there right now; no-one should make GGUFs from this PR public yet, there will be changes! :) |
Absolutely, I hope people do not do that - it's very much in draft and I'm learning as I go. |
9d6ea41
to
7f026fb
Compare
@sammcj, 7f026fb#diff-4f653096980bd7d10518aa909cb648452cd3aa380ff93cb9fb642dca48536526 fixed the issue thanks. |
the fix seems to work, still testing -> INFO:hf-to-gguf:Model successfully exported to models/glm-45-air-f16.gguf |
I am AMD and a Vulkan user with 88GB of VRAM, already downloading GLM-4.5-Air, will report after a few hours if I have any success with it. |
Co-authored-by: Diego Devesa <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Unfortunately for the context thing...90k context is coherent for me for the Air model so sounds like I can't reproduce it here. I'm going to try with the big model too but I'm expecting it to be the same, and unless it breaks visibly so that I can troubleshoot it, I likely won't be reporting back on that one. I'll also try check GLM4 older issues if there was something else I did not myself see there, e.g. similar in spirit how I suggested f32-fix in my last comment, but you likely won't hear from me on this if I think I am useless here to help and don't have any information to contribute :) I suggest once this PR is merged, someone who is personally affected opens an issue and some reproducible scenario, and especially note what is their backend/architecture (e.g. AMD+Vulkan+Windows) (well you should report your setup anyway but here extra important). If this is at all similar problem to the last GLM4-series problems, then it seems to be platform-specific, and maybe it is AMD/Vulkan/Windows thing (I don't think I ever found out what combo was bad exactly, that was just my best guess based on trying to ask people around what was their platform). Edit: Found at least one PR that looks possibly relevant that I did not know happened: #13607 It unfortunately does not seem arch-specific (arch-specific as in GLM4 or GLM4_MOE), so possibly the current gibberish is of a different origin. Edit2: I'll also test Vulkan-backend on macOS since that's also an option. If you don't see updates from me, assume I could not reproduce. (might mean Vulkan is not the issue, or is not the only required condition for the gibberish to happen). |
Hi, sorry, just woke up. I don't use Vulkan, I have a pair of 3090s and I use CUDA on a fedora-41 VM. If the issue is specific to me, which it seems to be, then I can try re-downloading the HF safetensors and re-converting / re-quanting. Either way, I don't think it should block the PR because it's working for others. I say ship it :) Worst case maybe it's something that's resolved by me downloading an unsloth quant in the future or something. |
Thank you very much to @sammcj for undertaking this effort, and of course special thanks to all that jumped in to help along the way. I'm about to have a very unproductive week. @zRzRzRzRzRzRzR Can you please help us implement MTP? 🙏 |
Is there a way to disable thinking on this model through a parameter? |
Yes, the template supports |
For people having the same question as I did: Make sure you use |
A big thank you to @CISC for all your hard work on this one! 🙇 |
I'm a bit confused now as @sammcj posted this on Reddit not long ago:
Is there a working jinga template somewhere? |
Thanks! |
@CISC I narrowed down the gibberish issue a bit. It requires setting --batch-size 4096 --ubatch-size 4096 and possibly having a long multi-turn chat going. When I removed the batch-size / ubatch-size, my 40k and 50k token chats began working again. Setting the sizes up to 2048 / 2048 also worked. Something about 4096 / 4096 combined with over 32k context across multiple turns leads to that gibberish edge case. I also tried a needle in a haystack test with a 35k token prompt with a direction to answer a question from the text as a one-shot and that worked. So I don't have a reproducible smoking gun, but batch-size / ubatch-size is involved and for now I'm just scaling them back to make it work. |
Ah, ok, so that means it's not a model issue then, that's great! Submit an issue though. :) |
Just FYI for anyone wanting to create i-quants; as the final layer will not get imatrix data until MTP is supported it has to be overridden for lower quants to work, eg. using |
I am getting over 45t/s on three 3090s on unsloth quant Q4 for GLM Air, here is the optimized command:
|
I can confirm it's not warming up. Manually setting
If I patch uint32_t llama_context::graph_max_nodes() const {
//return std::max<uint32_t>(1024u, 8u*model.n_tensors());
return std::max<uint32_t>(65536u, 8u*model.n_tensors());
} and then run with You then need to rerun without I've got to go out so no more time to investigate until later. |
Actually, no it's still not warming up properly - it's just a lot quicker because it's got the experts mmapped I think... Will see if I can figure it out later if nobody else has by then. |
* model: Add GLM 4.5 (ggml-org#14921) Co-authored-by: Sigbjørn Skjæret <[email protected]> * Merge in PR suggestions Co-authored-by: Sigbjørn Skjæret <[email protected]> * model: Add GLM 4.5 family of models (ggml-org#14921) 1. Updated tensor_mapping.py with NextN tensor mappings - Added proper tensor mappings for all NextN/MTP tensors in /Users/samm/git/llama.cpp/gguf-py/gguf/tensor_mapping.py - Added mappings for: eh_proj, embed_tokens, enorm, hnorm, shared_head.head, shared_head.norm 2. Added num_nextn_predict_layers configuration - Added LLM_KV_NUM_NEXTN_PREDICT_LAYERS constant to llama-arch.h and llama-arch.cpp - Added num_nextn_predict_layers field to llama_hparams struct - Updated GLM4_MOE parameter loading in llama-model.cpp to read this parameter - Modified tensor loading logic to conditionally load NextN tensors based on num_nextn_predict_layers - Added GGUF writer support in gguf_writer.py with add_num_nextn_predict_layers() method - Updated conversion script to extract and write this parameter from HuggingFace config 3. Added FIM tokens for GLM4_MOE - Added GLM-4.5's FIM tokens to llama-vocab.cpp: - <|code_prefix|> for FIM_PRE - <|code_suffix|> for FIM_SUF - <|code_middle|> for FIM_MID 4. Removed manual NextN tensor handling - Removed the special-case handling in convert_hf_to_gguf.py that manually mapped NextN tensors - NextN tensors are now handled automatically through the proper tensor mapping system * glm 4.5 update tensors names * model: glm 4.5 apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * model: glm 4.5 apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> * model: glm 4.5 apply suggestions from code review * Apply suggestions from code review * patch broken chat template * typings fix * add TENSOR_SKIP flag Co-authored-by: Diego Devesa <[email protected]> * Update src/llama-model-loader.h Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]> Co-authored-by: Diego Devesa <[email protected]>
I've found it: // MoE layer with shared experts
//const int64_t n_expert = hparams.n_expert;
//const int64_t n_expert_used = hparams.n_expert_used;
// Process routed experts using existing MoE infrastructure
ggml_tensor * routed_out = build_moe_ffn(cur,
model.layers[il].ffn_gate_inp,
model.layers[il].ffn_up_exps,
model.layers[il].ffn_gate_exps,
model.layers[il].ffn_down_exps,
model.layers[il].ffn_exp_probs_b,
n_expert, n_expert_used,
LLM_FFN_SILU, hparams.expert_weights_norm,
true, hparams.expert_weights_scale,
(llama_expert_gating_func_type) hparams.expert_gating_func,
il);
cb(routed_out, "ffn_moe_out", il); The local llm_graph_context::llm_graph_context(const llm_graph_params & params) :
arch (params.arch),
hparams (params.hparams),
cparams (params.cparams),
ubatch (params.ubatch),
n_embd (hparams.n_embd),
n_layer (hparams.n_layer),
n_rot (hparams.n_rot),
n_ctx (cparams.n_ctx),
n_head (hparams.n_head()),
n_head_kv (hparams.n_head_kv()),
n_embd_head_k (hparams.n_embd_head_k),
n_embd_k_gqa (hparams.n_embd_k_gqa()),
n_embd_head_v (hparams.n_embd_head_v),
n_embd_v_gqa (hparams.n_embd_v_gqa()),
n_expert (hparams.n_expert),
n_expert_used (cparams.warmup ? hparams.n_expert : hparams.n_expert_used),
freq_base (cparams.rope_freq_base),
freq_scale (cparams.rope_freq_scale),
ext_factor (cparams.yarn_ext_factor),
attn_factor (cparams.yarn_attn_factor),
beta_fast (cparams.yarn_beta_fast),
beta_slow (cparams.yarn_beta_slow),
norm_eps (hparams.f_norm_eps),
norm_rms_eps (hparams.f_norm_rms_eps),
n_tokens (ubatch.n_tokens),
n_outputs (params.n_outputs),
n_ctx_orig (cparams.n_ctx_orig_yarn),
pooling_type (cparams.pooling_type),
rope_type (hparams.rope_type),
sched (params.sched),
backend_cpu (params.backend_cpu),
cvec (params.cvec),
loras (params.loras),
mctx (params.mctx),
cross (params.cross),
cb_func (params.cb),
res (params.res),
ctx0 (res->get_ctx()),
gf (res->get_gf()) {
res->set_params(params);
} |
@jukofyork confirmed. This fixes warmup for me. It also restores the GLM-4.5 to the performance levels I've come to expect from llama.cpp: ![]() Startup command: ./build/bin/llama-server \
--model /data/GLM-4.5-GGUF/q4_k_m/GLM-4.5-Q4_K_M.gguf \
--alias GLM-4.5-GGUF:q4_k_m \
--no-webui \
--numa numactl \
--threads 32 \
--ctx-size 131072 \
--n-gpu-layers 94 \
-ot "blk\.(3|4|5|6|7|8|9|10|11|12|13|14|15|16|17)\.ffn_.*=CUDA0" \
-ot exps=CPU \
-ub 4096 -b 4096 \
--seed 3407 \
--temp 0.6 \
--top-p 1.0 \
--log-colors \
--flash-attn \
--host 0.0.0.0 \
--jinja \
--port 11434 I had GLM-4.5 write a poem for you:
|
No problem and I can confirm it's running as expected for me now too (~6.5 tokens/s generation). I'm managed to transplant the vocab into
so assuming it trains OK, then we should have a draft model in a day or so. It actually looks to have transplanted very well, as even the untrained draft is getting a high acceptance rate for refactoring tasks:
|
Add support for the newly released GLM 4.5 family of models.
Core Architecture
Model Loading (src/llama-model.cpp)
Conversion Support (convert_hf_to_gguf.py)
Technical Details
MoE Architecture
Model Variants
The NextN/MTP prediction tensors are preserved during conversion but marked as unused since llama.cpp does not yet support multi-token prediction.
Testing
CI scripts run locally (CPU only) have two failing tests that I believe are unrelated to this change (please tell me if this isn't the case!):
gguf-dump
Disclaimer:
Hopefully resolves #14921