Skip to content

memory : handle kv_unified for hybrid models #15050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 3, 2025

Conversation

compilade
Copy link
Collaborator

Follow-up from #14725, which didn't really fix the underlying problem of not considering cparams.kv_unified.

Since #14959, inference with hybrid models has been broken (except when using -kvu), due to hybrid memory not passing cparams.kv_unified properly.

Reproduction of the problem: attempt to run llama-perplexity with any hybrid model.

$ ./bin/llama-perplexity -f /workspace/wikitext-2-raw/wiki.test.raw -m /workspace/gguf/LFM2-350M-BF16.gguf --chunks 10

On master, this fails with an assertion

/workspace/llama.cpp/ggml/src/ggml.c:3740: GGML_ASSERT(mask->ne[1] >= a->ne[1]) failed

With this PR, this is no longer a problem. I've tested this with https://huggingface.co/LiquidAI/LFM2-350M.


Make sure to read the contributing guidelines before submitting a PR

@compilade compilade requested a review from ggerganov August 3, 2025 05:05
@compilade compilade added the bugfix fixes an issue or bug label Aug 3, 2025
@CISC CISC merged commit 11a3811 into master Aug 3, 2025
45 of 47 checks passed
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Aug 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants