RPC: How to offload specific tensors / layers? #15020

Mushoz · 2025-08-01T19:37:37Z

Mushoz
Aug 1, 2025

I am running big MOE models on a Strix Halo 395. One weakness of the laptop, is the relatively poor prompt processing. I was wondering if I can use a RPC server on a desktop with a dedicated GPU specifically for the attention layers / KV cache, and keep the rest on the laptop. Would this be possible with the -ot switch somehow?

Answered by rgerganov

Aug 2, 2025

Yes, you can offload specific tensors to RPC with -ot and the RPC device name, e.g. -ot 'blk.1*=RPC[localhost:50052]'. I'll appreciate if you share the results with this approach, so we can document it.

View full answer

rgerganov · 2025-08-02T08:19:13Z

rgerganov
Aug 2, 2025
Collaborator

Yes, you can offload specific tensors to RPC with -ot and the RPC device name, e.g. -ot 'blk.1*=RPC[localhost:50052]'. I'll appreciate if you share the results with this approach, so we can document it.

4 replies

Mushoz Aug 2, 2025
Author

This doesn't seem to work:

[docker@b75533261419 ~]$ llama-bench --rpc jaap-desktop:10001 -m ~/.cache/llama.cpp/unsloth_dots.llm1.inst-GGUF_UD-Q4_K_XL_dots.llm1.inst-UD-Q4_K_XL-00001-of-00002.gguf -ot '*att*=RPC[jaap-desktop:10001]'
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
error: unrecognized buffer type 'RPC[jaap-desktop:10001]'
Available buffer types:
  CPU
  Vulkan0
error: invalid parameter for argument: -ot

Removing the -ot switch fixes the issue. RPC itself is working, as I do see load and memory usage on the desktop card if I run the above command without the -ot switch.

rgerganov Aug 2, 2025
Collaborator

Does it work with llama-cli? You can also try using IP instead of hostname

Mushoz Aug 2, 2025
Author

llama-cli does seem to work. I will play around a bit and see if I can notice a difference in performance. Ideally this would also work with llama-bench so I can run objective benchmarks with specific settings. llama-server also seems to accept these settings.

Mushoz Aug 5, 2025
Author

While it works, it doesn't really perform well at all. During prompt processing (where I want to take advantage of the high compute of the dedicated GPU) I am just seeing the 1gbit ethernet link fully saturated and performance is very poor. This needs a much faster interconnect for it to be worth it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RPC: How to offload specific tensors / layers? #15020

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RPC: How to offload specific tensors / layers? #15020

Uh oh!

Mushoz Aug 1, 2025

Replies: 1 comment · 4 replies

Uh oh!

rgerganov Aug 2, 2025 Collaborator

Uh oh!

Mushoz Aug 2, 2025 Author

Uh oh!

rgerganov Aug 2, 2025 Collaborator

Uh oh!

Mushoz Aug 2, 2025 Author

Uh oh!

Mushoz Aug 5, 2025 Author

Mushoz
Aug 1, 2025

Replies: 1 comment 4 replies

rgerganov
Aug 2, 2025
Collaborator

Mushoz Aug 2, 2025
Author

rgerganov Aug 2, 2025
Collaborator

Mushoz Aug 2, 2025
Author

Mushoz Aug 5, 2025
Author