Fix: Correct num_cached_tokens counting logic in BlockManager.allocate #110
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
File Modified:
engine/block_manager.py
- Adjusted the timing of seq.num_cached_tokens increment in theallocate
methodReason for Modification:
In the original logic,
num_cached_tokens
was incremented directly whencache_miss=False
(hash hit and token match), but it failed to account for potential stale entries in thehash_to_block_id map
. Specifically, a block ID might exist in the hash map but have already been released (no longer inused_block_ids
) because its reference count dropped to 0. In such cases,_allocate_block
would be triggered to reinitialize the block—this is essentially a "new allocation" rather than "cache reuse," yet it was incorrectly counted as cached tokens.Fix Logic:
Moved
seq.num_cached_tokens += self.block_size
into theif block_id in self.used_block_ids:
branch. This ensuresnum_cached_tokens
is only incremented when the block is actually in use (not released) and reused, fully aligning with its intended semantic of "counting truly reused cached tokens."Impact:
Fixes the statistical error in cache utilization, making
num_cached_tokens
accurately reflect real cache reuse (facilitating subsequent performance analysis). Does not affect the core logic of KV cache allocation/reuse; no functional risks.