memory : handle kv_unified for hybrid models #15050
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Follow-up from #14725, which didn't really fix the underlying problem of not considering
cparams.kv_unified
.Since #14959, inference with hybrid models has been broken (except when using
-kvu
), due to hybrid memory not passingcparams.kv_unified
properly.Reproduction of the problem: attempt to run
llama-perplexity
with any hybrid model.$ ./bin/llama-perplexity -f /workspace/wikitext-2-raw/wiki.test.raw -m /workspace/gguf/LFM2-350M-BF16.gguf --chunks 10
On
master
, this fails with an assertionWith this PR, this is no longer a problem. I've tested this with https://huggingface.co/LiquidAI/LFM2-350M.
Make sure to read the contributing guidelines before submitting a PR