[TPU Offload] Separate offload manager and cpu-cache backend, and code structure refactor #1122

juncgu-google · 2025-11-18T18:23:20Z

Description

Remove the original singleton LocalCPUBackend per process (shared by the offload scheduler and the offload worker).
- The offload scheduler now contains a LRUOffloadManager that is responsible for all offloading decisions.
  - The size of CPU cache is set by TPU_OFFLOAD_NUM_CPU_CHUNKS.
  - The hash value of cpu chunk is inherited from vllm request's original block hash (full block only).
  - The completed save and load info is collected from kv_connector_output.kv_connector_stats.
- The offload worker now uses a standard key-value store.
Code refactor.

TODOs:

The cpu cache size (num of chunks) and the staging buffer capacity are configured to # of chunks or # of tokens, instead of bytes. This limitation is caused by the lack of block size information in the offload connector scheduler.

Tests

pytest -sv tests/distributed/offload/tpu_offload_cpu_backend_test.py
pytest -sv tests/distributed/offload/tpu_offload_connector_worker_test.py
pytest -sv tests/distributed/offload/tpu_offload_connector_scheduler_test.py
pytest -sv tests/distributed/offload/tpu_offload_utils_test.py
pytest -sv tests/distributed/offload/tpu_offload_manager_test.py
pytest -sv tests/distributed/offload/tpu_offload_accuracy_test.py

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

…he decisions; worker has a regular kv store Signed-off-by: Juncheng Gu <[email protected]>

Signed-off-by: Juncheng Gu <[email protected]>

github-actions · 2025-11-18T18:23:39Z

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

Signed-off-by: Juncheng Gu <[email protected]>

examples/gke/benchmarks/deploy-cpu-offload.yaml

examples/gke/pod_tpu_commons_cpu_offload.yaml

examples/gke/pod_tpu_commons_cpu_offload_verification.yaml

Signed-off-by: Juncheng Gu <[email protected]>

dannawang0221 · 2025-11-22T04:13:39Z

/lgtm

Signed-off-by: Juncheng Gu <[email protected]>

juncgu-google added 3 commits November 18, 2025 05:51

separate cpu_backend; scheduler has a offload_manager who makes all t…

4e21295

…he decisions; worker has a regular kv store Signed-off-by: Juncheng Gu <[email protected]>

rename offload files

345ce36

Signed-off-by: Juncheng Gu <[email protected]>

move offload tests

5773e8c

Signed-off-by: Juncheng Gu <[email protected]>

juncgu-google requested a review from dannawang0221 November 19, 2025 03:48

juncgu-google added 14 commits November 19, 2025 22:54

update cpu_backend_test

b1ccae8

Signed-off-by: Juncheng Gu <[email protected]>

offload accuracy test

c1c5a7b

Signed-off-by: Juncheng Gu <[email protected]>

offload worker precompile test

9050d9c

Signed-off-by: Juncheng Gu <[email protected]>

dma kernel test

5cdfa37

Signed-off-by: Juncheng Gu <[email protected]>

utils test

f04f6c8

Signed-off-by: Juncheng Gu <[email protected]>

connector worker tests

0db26d8

Signed-off-by: Juncheng Gu <[email protected]>

rename

3551c9d

Signed-off-by: Juncheng Gu <[email protected]>

rename

1764767

Signed-off-by: Juncheng Gu <[email protected]>

update gke unit tests

16ff951

Signed-off-by: Juncheng Gu <[email protected]>

offload manager tests

e9c45ac

Signed-off-by: Juncheng Gu <[email protected]>

offload manager tests2

3d580cb

Signed-off-by: Juncheng Gu <[email protected]>

scheduler test, not fully ready yet

3767929

Signed-off-by: Juncheng Gu <[email protected]>

scheduler test

e2ac48c

Signed-off-by: Juncheng Gu <[email protected]>

rename test files

64343d7

Signed-off-by: Juncheng Gu <[email protected]>

juncgu-google changed the title ~~[TPU Offload][WIP] Separate offload manager and cpu-cache backend, and code structure refactor~~ [TPU Offload] Separate offload manager and cpu-cache backend, and code structure refactor Nov 21, 2025

dannawang0221 reviewed Nov 21, 2025

View reviewed changes

examples/gke/benchmarks/deploy-cpu-offload.yaml Outdated Show resolved Hide resolved

dannawang0221 reviewed Nov 21, 2025

View reviewed changes

examples/gke/pod_tpu_commons_cpu_offload.yaml Outdated Show resolved Hide resolved

dannawang0221 reviewed Nov 21, 2025

View reviewed changes

examples/gke/pod_tpu_commons_cpu_offload_verification.yaml Outdated Show resolved Hide resolved

fix file path

616958b

Signed-off-by: Juncheng Gu <[email protected]>

fix kv_transfer_stat reduce bug

1be2866

Signed-off-by: Juncheng Gu <[email protected]>

juncgu-google force-pushed the cpu-offloading/dev-cpu-backend branch from 81d8fc0 to 1be2866 Compare November 22, 2025 05:00

juncgu-google merged commit 0c9f039 into cpu-offloading/dev Nov 22, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TPU Offload] Separate offload manager and cpu-cache backend, and code structure refactor #1122

[TPU Offload] Separate offload manager and cpu-cache backend, and code structure refactor #1122

Uh oh!

juncgu-google commented Nov 18, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dannawang0221 commented Nov 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[TPU Offload] Separate offload manager and cpu-cache backend, and code structure refactor #1122

[TPU Offload] Separate offload manager and cpu-cache backend, and code structure refactor #1122

Uh oh!

Conversation

juncgu-google commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

github-actions bot commented Nov 18, 2025

Description

Tests

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dannawang0221 commented Nov 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

juncgu-google commented Nov 18, 2025 •

edited

Loading