[Core] Support Local Chunked Attention for Hybrid KV Cache #19351

luccafong · 2025-06-09T08:23:57Z

Purpose

This PR follows #17996 to add Hybrid KV Cache support for local chunked attention for supporting models like llama4 maverick and scout.

Test Plan

unit test:

pytest tests/v1/core/test_specialized_manager.py -k "chunked_local"

eval:

run mmlupro before and after on llama4 scout model

lm_eval --model vllm --tasks mmlu_pro --model_args pretrained=$HUB/Llama-4-Scout-17B-16E-Instruct,max_model_len=131072,tensor_parallel_size=4 --output_path .../output --batch_size auto

run long context eval (ruler)

 lm_eval --model vllm --tasks niah_multikey_2 --model_args pretrained=$HUB/Llama-4-Scout-17B-16E-Instruct,tensor_parallel_size=4,max_model_len=256000  --metadata='{"max_seq_lengths":[4096,8192,16384,32768]}' --batch_size auto  > /tmp/test_hybrid_baseline.log 2>&1 &

Test Result

unit tests:

tests/v1/core/test_specialized_manager.py ...                       [100%]

===================== 3 passed, 3 deselected in 1.59s =====================

eval

mmlu_pro

This PR

| Groups |Version|    Filter    |n-shot|  Metric   |   |Value |   |Stderr|
|--------|------:|--------------|------|-----------|---|-----:|---|-----:|
|mmlu_pro|      2|custom-extract|      |exact_match|↑  |0.6511|±  |0.0043|

Baseline (trunk: 7e3e74c)

uvx --with vllm --extra-index-url https://wheels.vllm.ai/7e3e74c97c9ba8a30b05063af42035c58c2501e8 lm_eval --model vllm --tasks mmlu_pro --model_args pretrained=$HUB/Llama-4-Scout-17B-16E-Instruct,max_model_len=131072,tensor_parallel_size=4,disable_hybrid_kv_cache_manager=True --output_path ../output --batch_size auto

| Groups |Version|    Filter    |n-shot|  Metric   |   |Value |   |Stderr|
|--------|------:|--------------|------|-----------|---|-----:|---|-----:|
|mmlu_pro|      2|custom-extract|      |exact_match|↑  |0.6508|±  |0.0043|

ruler niah_multikey_2

This PR

████████| 2000/2000 [17:47<00:00,  1.87it/s, est. speed input: 28058.11 toks/s, output: 12.99 toks/s]

|     Tasks     |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|---------------|------:|------|-----:|-----:|---|----:|---|------|
|niah_multikey_2|      1|none  |     0| 16384|↑  |0.982|±  |   N/A|
|               |       |none  |     0| 32768|↑  |0.946|±  |   N/A|
|               |       |none  |     0|  4096|↑  |1.000|±  |   N/A|
|               |       |none  |     0|  8192|↑  |0.996|±  |   N/A|

Baseline

████████| 2000/2000 [17:09<00:00,  1.94it/s, est. speed input: 29084.28 toks/s, output: 13.47 toks/s]

|     Tasks     |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|---------------|------:|------|-----:|-----:|---|----:|---|------|
|niah_multikey_2|      1|none  |     0| 16384|↑  |0.980|±  |   N/A|
|               |       |none  |     0| 32768|↑  |0.938|±  |   N/A|
|               |       |none  |     0|  4096|↑  |1.000|±  |   N/A|
|               |       |none  |     0|  8192|↑  |0.996|±  |   N/A|

gemini-code-assist

Hello @luccafong, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request. Based on the title [wip] support local chunked attention for hybrid kv cache and the code changes, this PR introduces support for a new KV cache management strategy: chunked local attention. This is intended to work within vLLM's hybrid KV cache framework. Since the description is a placeholder, the intent is primarily derived from the title and the implementation details in the code.

Highlights

New KV Cache Specification: A new dataclass, ChunkedLocalAttentionSpec, is introduced in vllm/v1/kv_cache_interface.py to define the configuration for chunked local attention KV caches, including the attention_chunk_size.
New KV Cache Manager: A new manager class, ChunkedLocalAttentionManager, is added in vllm/v1/core/single_type_kv_cache_manager.py. This manager implements the logic for finding cache hits and removing skipped blocks specifically for chunked local attention, handling the non-contiguous nature of the relevant KV cache blocks.
Hybrid KV Cache Integration: The KV cache utility functions in vllm/v1/core/kv_cache_utils.py are updated to recognize ChunkedLocalAttentionSpec when determining hybrid configurations and unifying specs, ensuring compatibility with existing full attention and sliding window mechanisms.
Model Runner Updates: The GpuModelRunner in vllm/v1/worker/gpu_model_runner.py is modified to select the ChunkedLocalAttentionSpec for a layer if attention_chunk_size is configured and the layer uses iRoPE (rotary positional embeddings).
New Tests: Dedicated tests for the ChunkedLocalAttentionManager are added in tests/v1/core/test_specialized_manager.py to verify the find_longest_cache_hit and remove_skipped_blocks logic for the new attention type.

Changelog

Click here to see the changelog

tests/v1/core/test_specialized_manager.py
- Imported random.
- Imported ChunkedLocalAttentionManager and ChunkedLocalAttentionSpec.
- Added get_chunked_local_attention_manager helper function.
- Added test_chunked_local_attention_possible_cached_prefix to test cache hit finding for chunked local attention.
- Added test_chunked_local_attention_remove_skipped_blocks to test block removal logic.
- Renamed test_get_num_blocks_to_allocate to test_sliding_window_get_num_blocks_to_allocate.
- Added test_chunked_local_attention_get_num_blocks_to_allocate.
vllm/attention/layer.py
- Added use_irope attribute to the Attention class, initialized from extra_impl_args.
vllm/executor/executor_base.py
- Added a debug print statement showing block_size and num_gpu_blocks during cache initialization.
vllm/v1/core/kv_cache_utils.py
- Imported ChunkedLocalAttentionSpec.
- Added a debug print statement in _get_kv_cache_config_uniform_page_size.
- Updated is_hybrid and unify_hybrid_kv_cache_specs functions to include ChunkedLocalAttentionSpec in their logic for handling hybrid configurations.
vllm/v1/core/single_type_kv_cache_manager.py
- Imported ChunkedLocalAttentionSpec.
- Added the ChunkedLocalAttentionManager class, inheriting from SingleTypeKVCacheManager.
- Implemented find_longest_cache_hit and remove_skipped_blocks methods within ChunkedLocalAttentionManager.
- Added debug print statements within ChunkedLocalAttentionManager methods.
- Added ChunkedLocalAttentionManager to the spec_manager_map.
vllm/v1/kv_cache_interface.py
- Defined the ChunkedLocalAttentionSpec dataclass with attention_chunk_size.
- Implemented type_id, max_memory_usage_bytes, and merge methods for ChunkedLocalAttentionSpec.
vllm/v1/worker/gpu_model_runner.py
- Imported ChunkedLocalAttentionSpec.
- Modified get_kv_cache_spec to create a ChunkedLocalAttentionSpec for layers that have attention_chunk_size configured and use iRoPE.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request lays the groundwork for supporting local chunked attention in the hybrid KV cache. The introduction of ChunkedLocalAttentionManager and ChunkedLocalAttentionSpec, along with initial tests, is a good step forward.

However, as this is marked [wip], there are a few areas to address:

Pull Request Description: The PR description is currently a template. Please fill it out with the purpose of these changes, a test plan (even if preliminary for WIP), and any expected outcomes or known limitations. This context is crucial for reviewers.
Debugging Code: Several print statements used for debugging are present in the codebase. These should be removed before merging.
Clarity and TODOs: There are a few comments and TODOs that need attention or clarification.

Overall, the direction seems good, and the core logic for the new manager and spec is taking shape. Addressing the points below will help improve the clarity and readiness of this PR.

Summary of Findings

PR Description: The pull request description is currently a template and needs to be filled out with details about the purpose, test plan, and results of these changes. This is especially important for a work-in-progress PR.
Debugging Code: Several print statements, likely used for debugging, are present in the codebase (e.g., in executor_base.py, kv_cache_utils.py, single_type_kv_cache_manager.py, kv_cache_interface.py). These should be removed before merging.
Clarity and Completeness: Some comments, like the one in ChunkedLocalAttentionSpec.max_memory_usage_bytes, could be clarified. Additionally, the get_num_common_prefix_blocks method in the new manager (and existing SlidingWindowManager) has a known simplification related to cascade attention that should be tracked.
TODOs: A TODO comment exists in kv_cache_utils.py regarding making the hybrid spec unification more generic. This should ideally be tracked with a follow-up issue if it's a larger task.

Merge Readiness

This pull request is a work-in-progress and introduces significant new functionality for chunked local attention. Before it can be considered for merging, the PR description needs to be completed, all debugging print statements must be removed, and the identified points for clarification should be addressed. Given the WIP nature and these outstanding items, I recommend further changes before this PR is merged. I am unable to approve the pull request, and it should be reviewed by others before merging.

vllm/executor/executor_base.py

vllm/v1/core/kv_cache_utils.py

vllm/v1/core/single_type_kv_cache_manager.py

vllm/v1/kv_cache_interface.py

github-actions · 2025-06-09T08:34:33Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

houseroad · 2025-06-09T09:29:31Z

Please don't forget to fill the task description :-)

mergify · 2025-06-11T09:45:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @luccafong.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Lucia Fang <[email protected]>

WoosukKwon · 2025-06-12T18:21:02Z

vllm/v1/core/kv_cache_utils.py

@@ -845,6 +846,7 @@ def _get_kv_cache_config_uniform_page_size(
    # full.0, sw.0, sw.2: share a Tensor with size=available_memory//2
    # full.1, sw.1: share another Tensor with size=available_memory//2
    page_size = get_uniform_page_size(kv_cache_spec)
+    # print(f"{page_size=}, {group_size=}")


Suggested change

# print(f"{page_size=}, {group_size=}")

WoosukKwon · 2025-06-12T18:22:08Z

vllm/v1/attention/backends/flash_attn.py

@@ -715,6 +715,7 @@ def use_cascade_attention(
    num_kv_heads: int,
    use_alibi: bool,
    use_sliding_window: bool,
+    use_local_attention: bool,


Just curious: What is this for?

use_local_attention does not support cascade attention as noted in

vllm/vllm/v1/attention/backends/flash_attn.py

Lines 693 to 694 in c3fec47

assert not use_local_attn, (

"Cascade attention does not support local attention.")

, we make sure this is turned off here.

Do you need this line in this function?

if use_local_attention: return False

I left it out when merging, will add back

WoosukKwon · 2025-06-12T18:27:54Z

vllm/v1/core/single_type_kv_cache_manager.py

+        max_num_blocks = max_length // kv_cache_spec.block_size
+        if max_length > 0:
+            local_attention_start_idx = \
+                (max_length-1) // kv_cache_spec.attention_chunk_size \


Why -1 here?

we need the actual index instead of the length here to calculate the actual attending window,
e.g. given a max length of 128, and chunk size = 64, the context chunked as [0, 63] and [64, 127], the 127th should attend with window [64, 127], where start idx=64 = (127// 64 * 64) instead of 2 * 64=128.

heheda12345

Does the 1024-th token needs kv cache of token [0-1023] if attn_chunk_size is 1024? I think most of my question comes from this problem.

vllm/v1/core/kv_cache_utils.py

vllm/v1/core/single_type_kv_cache_manager.py

tests/v1/core/test_specialized_manager.py

heheda12345 · 2025-06-13T08:31:38Z

vllm/v1/core/single_type_kv_cache_manager.py

+            [block_pool.null_block] * local_attention_start_block_idx
+            for _ in range(len(kv_cache_group_ids)))
+
+        for i in range(local_attention_start_block_idx, max_num_blocks):


Can you explain the rule of cache hit? For example, block_size 1 and chunk_size 2, what is the expect result of the following cases?

[miss miss] [miss miss] [miss miss]. Should it be 0 or 6?

[miss miss] [hit miss] [miss miss]. Should it be 3 or 6?
And please add some comment to describe the expect behavior.

Yeah
For current token, we check from the first block that contains the attention window for cache hit until it miss.

it mark computed blocks = previous unattended blocks + # of hit blocks, so even zero hit, it return the previous unattended blocks.

So in your questions here:

it return 4, since last window missed

still 4 since last window missed.

For case like [miss, miss][miss miss][hit miss] it return 5.

I will add more comments to explain.

heheda12345 · 2025-06-13T08:34:44Z

vllm/v1/core/single_type_kv_cache_manager.py

+        super().__init__(kv_cache_spec, block_pool, **kwargs)
+        self.attention_chunk_size = kv_cache_spec.attention_chunk_size
+        self._null_block = block_pool.null_block
+


assert self.attention_chunk_size % block_size == 0?

the logic should have covered case !=0?

Yes, you are right.

heheda12345 · 2025-06-13T08:37:24Z

vllm/v1/core/single_type_kv_cache_manager.py

+        local_attention_start_idx = (
+            num_computed_tokens -
+            1) // self.attention_chunk_size * self.attention_chunk_size
+        # 1024-> 0, 1025-> 1024


Why 1024 -> 0? Does the attention of the 1024-th token (the first token of the next chunk) need tokens 0-1023?

here num_computed_tokens = 1024, so it is indexed 1023, which the local attention start from 0.

can you update the comment?

vllm/v1/kv_cache_interface.py

luccafong · 2025-06-16T13:54:16Z

Does the 1024-th token needs kv cache of token [0-1023] if attn_chunk_size is 1024? I think most of my question comes from this problem.

it does not need [0-1023] for 1024th token

Signed-off-by: Lu Fang <[email protected]>

vllm/v1/kv_cache_interface.py

heheda12345 · 2025-06-17T16:08:22Z

vllm/v1/worker/gpu_model_runner.py

+        use_local_attention = isinstance(kv_cache_spec,
+                                         ChunkedLocalAttentionSpec)


Suggested change

use_local_attention = isinstance(kv_cache_spec,

ChunkedLocalAttentionSpec)

use_local_attention = isinstance(kv_cache_spec,

ChunkedLocalAttentionSpec) or ((isinstance(kv_cache_spec, FullAttentionSpec)

and kv_cache_spec.attention_chunk_size is not None))

tests/v1/core/test_specialized_manager.py

heheda12345 · 2025-06-17T16:23:13Z

tests/v1/core/test_specialized_manager.py

+
+    # 4 tokens are computed. no token is out of the local attention window.
+    manager.remove_skipped_blocks("test", 4)
+    assert_block_id(block_table, original_block_ids)


This test, as token 4 doesn't need the kv cache of tokens [0-3], why do you need to keep them?

token 4 (if 1 indexed) need kv cache of [0-4],

heheda12345 · 2025-06-17T16:25:15Z

vllm/v1/attention/backends/flash_attn.py

@@ -715,6 +715,7 @@ def use_cascade_attention(
    num_kv_heads: int,
    use_alibi: bool,
    use_sliding_window: bool,
+    use_local_attention: bool,


Do you need this line in this function?

if use_local_attention: return False

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Lu Fang <[email protected]>

mergify · 2025-06-19T04:55:28Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @luccafong.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Lu Fang <[email protected]>

heheda12345

Thanks for the great job. I think we have aligned on the expect behavior. Can you write some examples in find_longest_cache_hit and remove_skipped_blocks to help people understand it?

heheda12345 · 2025-06-19T14:39:16Z

vllm/entrypoints/openai/tool_parsers/xlam_tool_parser.py

 from collections.abc import Sequence
 from typing import Any, Dict, List, Optional, Union

+import regex as re


why do you need this line?

heheda12345 · 2025-06-19T14:41:27Z

vllm/v1/core/single_type_kv_cache_manager.py

@@ -385,12 +386,102 @@ def get_num_common_prefix_blocks(self, request_id: str,
        """
        NOTE(Chen): The prefix blocks are null blocks for sliding window layers.
        So it's not correct to count ref_cnt like FullAttentionManager. Return 
-        0 here for correctness. Need to support cascade attention + sliding 
+        0 here for correctness. Need to support cascade attention + sliding


Can you revert?

heheda12345 · 2025-06-19T14:52:02Z

vllm/v1/core/single_type_kv_cache_manager.py

+                break
+        if use_eagle and computed_blocks[0]:
+            for computed in computed_blocks:
+                computed.pop()


In eagle, we can't simply pop the last block.
For example, chunk size 2 and block size 1:
[miss, miss] [miss miss] -> cache_hit_length 4
if we remove the 3-th block (0-indexed), the cache_hit_length becomes 3, but [miss, miss] [miss] is not a valid cache hit prefix. I think we should return cache_hit_length 2 in this case.

heheda12345 · 2025-06-19T14:54:30Z

vllm/v1/core/single_type_kv_cache_manager.py

+        # we marked blocks out of window as computed
+        # with null blocks, and blocks inside window
+        # based on cache lookup result


Can you change the length of each line to ~80 characters? And should [430-432] be put before [425-429]?

heheda12345 · 2025-06-19T14:54:54Z

vllm/v1/core/single_type_kv_cache_manager.py

+        # [ block 0, ..., block x(x_start<=first_attention_token),
+        # block x+1, ..,  block N (N_end <=max_len), ...]


Why do you need this comment? what is x for?

tests/v1/core/test_specialized_manager.py

heheda12345 · 2025-06-19T15:36:16Z

vllm/v1/core/single_type_kv_cache_manager.py

+        local_attention_start_idx = (
+            num_computed_tokens -
+            1) // self.attention_chunk_size * self.attention_chunk_size
+        # 1024-> 0, 1025-> 1024


can you update the comment?

heheda12345 · 2025-06-19T15:38:42Z

vllm/v1/core/single_type_kv_cache_manager.py

+        ) // self.attention_chunk_size * self.attention_chunk_size
+        # 1024-> 0, 1025-> 1024
+        first_useful_block_idx = local_attention_start_idx // self.block_size
+        # block size =128, 0 -> block 0, 1024 -> block 8, 372 -> block 2


Suggested change

# block size =128, 0 -> block 0, 1024 -> block 8, 372 -> block 2

# if block size = 128, 0 -> block 0, 1024 -> block 8, 372 -> block 2

heheda12345 · 2025-06-19T15:39:30Z

vllm/v1/core/single_type_kv_cache_manager.py

+        # block size =128, 0 -> block 0, 1024 -> block 8, 372 -> block 2
+        blocks = self.req_to_blocks[request_id]
+        removed_blocks: list[KVCacheBlock] = []
+        blockids = []


why do you need this blocdids?

heheda12345 · 2025-06-19T15:42:09Z

vllm/v1/kv_cache_interface.py

+        return cdiv(num_tokens, self.block_size) * self.page_size_bytes
+
+    @classmethod
+    def merge(cls, specs: list[Self]) -> Self:


remove this function after updating type_id

gemini-code-assist bot reviewed Jun 9, 2025

View reviewed changes

mergify bot added the v1 label Jun 9, 2025

gemini-code-assist bot suggested changes Jun 9, 2025

View reviewed changes

luccafong force-pushed the hybrid-local-attn branch from e8c8c6e to 50281ac Compare June 11, 2025 09:44

mergify bot added the tpu Related to Google TPUs label Jun 11, 2025

mergify bot added the needs-rebase label Jun 11, 2025

luccafong force-pushed the hybrid-local-attn branch from 50281ac to b5157ad Compare June 11, 2025 09:51

mergify bot removed tpu Related to Google TPUs needs-rebase labels Jun 11, 2025

luccafong force-pushed the hybrid-local-attn branch 3 times, most recently from bd29016 to d8b297c Compare June 11, 2025 10:19

luccafong changed the title ~~[wip] support local chunked attention for hybrid kv cache~~ [Core] support local chunked attention for hybrid kv cache Jun 11, 2025

luccafong force-pushed the hybrid-local-attn branch from d8b297c to 3a04e9a Compare June 11, 2025 18:18

support local chunked attentin for hybrid

7913145

Signed-off-by: Lucia Fang <[email protected]>

luccafong force-pushed the hybrid-local-attn branch from 3a04e9a to 7913145 Compare June 11, 2025 18:22

luccafong marked this pull request as ready for review June 11, 2025 18:22

luccafong requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners June 11, 2025 18:22

luccafong changed the title ~~[Core] support local chunked attention for hybrid kv cache~~ [Core] Support Local Chunked Attention for Hybrid KV Cache Jun 11, 2025

revert gemma change

38f90b5

Signed-off-by: Lucia Fang <[email protected]>

WoosukKwon reviewed Jun 12, 2025

View reviewed changes

heheda12345 reviewed Jun 13, 2025

View reviewed changes

Lu Fang added 2 commits June 16, 2025 07:25

fix and add comment

78385bc

Signed-off-by: Lu Fang <[email protected]>

add attention_chunk_size in full attention spec

320ab71

Signed-off-by: Lu Fang <[email protected]>

luccafong force-pushed the hybrid-local-attn branch from 1f53790 to 320ab71 Compare June 16, 2025 14:40

Lu Fang added 3 commits June 16, 2025 07:53

Merge remote-tracking branch 'origin/main' into hybrid-local-attn

1bbff13

Merge remote-tracking branch 'origin/main' into hybrid-local-attn

076ab27

fix use_cascade_attention signature

8b7b409

Signed-off-by: Lu Fang <[email protected]>

heheda12345 reviewed Jun 17, 2025

View reviewed changes

Lu Fang added 2 commits June 18, 2025 05:00

add assertions and merge impl

f2887f6

Signed-off-by: Lu Fang <[email protected]>

fix issue of local attention start idx based on num computed tokens

f7b6961

Signed-off-by: Lu Fang <[email protected]>

luccafong force-pushed the hybrid-local-attn branch from 19fbdee to f7b6961 Compare June 18, 2025 15:14

mergify bot added the needs-rebase label Jun 19, 2025

Merge remote-tracking branch 'origin/main' into hybrid-local-attn

f60bea5

Signed-off-by: Lu Fang <[email protected]>

luccafong requested a review from aarnphm as a code owner June 19, 2025 11:56

mergify bot added frontend tool-calling and removed needs-rebase labels Jun 19, 2025

github-project-automation bot added this to Tool Calling Jun 19, 2025

heheda12345 reviewed Jun 19, 2025

View reviewed changes

	assert not use_local_attn, (
	"Cascade attention does not support local attention.")

		use_local_attention = isinstance(kv_cache_spec,
		ChunkedLocalAttentionSpec)

		# [ block 0, ..., block x(x_start<=first_attention_token),
		# block x+1, .., block N (N_end <=max_len), ...]

	# block size =128, 0 -> block 0, 1024 -> block 8, 372 -> block 2
	# if block size = 128, 0 -> block 0, 1024 -> block 8, 372 -> block 2

Uh oh!

[Core] Support Local Chunked Attention for Hybrid KV Cache #19351

Are you sure you want to change the base?

[Core] Support Local Chunked Attention for Hybrid KV Cache #19351

Conversation

luccafong commented Jun 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

unit test:

eval:

Test Result

unit tests:

eval

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jun 9, 2025

Uh oh!

houseroad commented Jun 9, 2025

Uh oh!

mergify bot commented Jun 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luccafong commented Jun 9, 2025 •

edited by github-actions bot

Loading