Fix numerical issue on hybrid kv cache allocation #1139

Chenyaaang · 2025-11-20T20:51:15Z

Description

Fix numerical issue on hybrid kv cache allocation. When we enable hybrid kv cache, at each kv cache allocation round, the block_id is different between each kv cache group, which means different layers are writing to different block_ids, so we need to create individual attention metadata for each layer, instead of using the same attention metadata for every layer.

Tests

unit tests in tpu_worker, tpu_runner passed
The results w/ vs w/o hybrid kv cache are the same when I run offline_inference.py with Gemma model. python examples/offline_inference.py --model google/gemma-3-27b-it --tensor-parallel-size 8
CI: https://buildkite.com/tpu-commons/tpu-inference-ci/builds/5787 all tasks are green except for lora, which I believe is an upstream change, not related to my pr.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

github-actions · 2025-11-20T20:51:33Z

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

py4

These PR doesn't have any tests. Please add the following tests:

e2e Correctness test: output with and without hybrid allocation is the same
e2e performance test: performance with hybrid allocator is higher than without hybrid allocator
unit tests for the changed python files and the runner. We need to keep coverage above 70% and we need our PRs to come with enough tests

tpu_inference/runner/tpu_runner.py

py4 · 2025-11-20T21:55:01Z

tpu_inference/runner/tpu_runner.py

        # TODO(pooyam): I guess we can remove returning sampling_metadata in `_prepare_inputs` after https://github.com/njhill/vllm/commit/b7433ca1a47732394b1bdea4099d98389515954b
        (
            input_ids,
+            input_positions,


Why are we returning input_positions here? Shouldn't it be in attn_metadata?

"Shouldn't it be in attn_metadata"- Yes, it is in attn_metadata.

But if we use hybrid kv cache, attn_metadata is a dict instead of a single metadata obj, which means we need to get it by attn_metadata[any_layer_name].input_positions. Considering either pass in layer name or input_positions directly, I chose the later way.

I mean in self.model_fn, what's the difference between input_positions inside attn_metadata and the input_positions that you are passing directly? It doesn't seem clean to me that now there are two different fields for input_positions

There's no difference between those 2 input_positions, it's just the attention_metadata becomes a dict[layer_name, attn_metadata_for_that layer] instead of a single attn_metadata shared by every layer. So inside vllm_model_wrapper's step_fun, when we need to pass input_positions to the model, we used to get it from attn_metadata.input_positions, now we need to get it in this way: attn_metadata[layer0_name].input_positions, which requires us to know the layer name, so I chose to pass input_positions directly.

i think it makes the code messy due to having two redundant fields. It's won't be clear one should read from input_positions or attn_metadata.input_positions.
Also isn't input_positions same for all layers? So at each later we can do something like attn_metadata.values()[0].input_positions. If not, i think it's better to get using layer name.
Overall having two fields for the same thing doesn't look good i think. wdyt?

py4

Does this also work for JAX path? if no, can we also make JAX path work?

tpu_inference/runner/tpu_runner.py

py4 · 2025-11-20T22:33:47Z

tpu_inference/runner/tpu_runner.py

        # TODO(pooyam): I guess we can remove returning sampling_metadata in `_prepare_inputs` after https://github.com/njhill/vllm/commit/b7433ca1a47732394b1bdea4099d98389515954b
        (
            input_ids,
+            input_positions,


I mean in self.model_fn, what's the difference between input_positions inside attn_metadata and the input_positions that you are passing directly? It doesn't seem clean to me that now there are two different fields for input_positions

Chenyaaang · 2025-11-21T00:06:45Z

Does this also work for JAX path? if no, can we also make JAX path work?

It should be backend agnostic, but to enable in Jax, we need to modify the individual jax model. Previously all jax models don't need hybrid kv cache, so it's not enabled. The numerical issue is also reported using vLLM model instead of flax nnx.

github-actions · 2025-11-21T00:09:31Z

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

Signed-off-by: Chenyaaang <[email protected]>

kyuyeunk · 2025-11-21T00:51:52Z

with this PR, Ion gpt-oss, 've verified that numeric issue has been solved & also a performance issue that stemmed from numeric issues has been resolved.

Signed-off-by: Chenyaaang <[email protected]>

Chenyaaang requested review from hfan, kyuyeunk, mrjunwan-lang, py4, sixiang-google, vanbasten23, vipannalla and wenxindongwork as code owners November 20, 2025 20:51

py4 reviewed Nov 20, 2025

View reviewed changes

Chenyaaang removed request for hfan, mrjunwan-lang, sixiang-google, vanbasten23, vipannalla and wenxindongwork November 20, 2025 22:22

py4 reviewed Nov 20, 2025

View reviewed changes

Chenyaaang closed this Nov 21, 2025

Chenyaaang reopened this Nov 21, 2025

fix hybrid kv cache

a1d07b7

Signed-off-by: Chenyaaang <[email protected]>

Chenyaaang force-pushed the test-hybrid-kv branch from 8f5b161 to a1d07b7 Compare November 21, 2025 00:41

add comments

616016c

Signed-off-by: Chenyaaang <[email protected]>

Chenyaaang added 2 commits November 21, 2025 01:55

fix dp recompilation

73d0633

Signed-off-by: Chenyaaang <[email protected]>

fix dp unit test

c7dfd6a

Signed-off-by: Chenyaaang <[email protected]>

Fix numerical issue on hybrid kv cache allocation #1139

Are you sure you want to change the base?

Fix numerical issue on hybrid kv cache allocation #1139

Uh oh!

Conversation

Chenyaaang commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

github-actions bot commented Nov 20, 2025

Description

Tests

Checklist

Uh oh!

py4 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

py4 Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Chenyaaang Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

py4 Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Chenyaaang Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

py4 Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

py4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

py4 Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Chenyaaang commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 21, 2025

Description

Tests

Checklist

Uh oh!

kyuyeunk commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Chenyaaang commented Nov 20, 2025 •

edited

Loading

py4 left a comment •

edited

Loading

Chenyaaang Nov 20, 2025 •

edited

Loading

Chenyaaang commented Nov 21, 2025 •

edited

Loading