-
I am running big MOE models on a Strix Halo 395. One weakness of the laptop, is the relatively poor prompt processing. I was wondering if I can use a RPC server on a desktop with a dedicated GPU specifically for the attention layers / KV cache, and keep the rest on the laptop. Would this be possible with the |
Beta Was this translation helpful? Give feedback.
Answered by
rgerganov
Aug 2, 2025
Replies: 1 comment 4 replies
-
Yes, you can offload specific tensors to RPC with |
Beta Was this translation helpful? Give feedback.
4 replies
Answer selected by
Mushoz
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Yes, you can offload specific tensors to RPC with
-ot
and the RPC device name, e.g.-ot 'blk.1*=RPC[localhost:50052]'
. I'll appreciate if you share the results with this approach, so we can document it.