Skip to content

[Feature Request] Direct server-server communication ("and then" clause) #226

Open
@justheuristic

Description

@justheuristic

Based on conversations with @borzunov , @dbaranchuk

Premise: currently in rpc_inference, each client sends inputs to a given server, collects responses from that server, then sends this input manually to the next server; this is needed for full fault-tolerance, in case one of the servers disconnects. A faster option is to send data directly from server 1 to server 2, if we can make it without compromising fault-tolerance -- and without insane code complexity.

Proposed solution: in rpc_inference, whenever a client sends a pb2 request, it can add a metadata key, e.g. "next_peer", which denotes the peer id of the next server. When a server finishes computing that key, it will immediately send results to the specified peer_id and marks it as "hidden states for session {inference_session_id}" - assuming that the next peer currently takes part in the same session.

On the receiving end, each server awaits asyncio.wait(request_from_client, request_from_previous_server), whichever comes first. If the request from previous server came first, current server will begin processing it immediately, but will still wait for the client's data to ensure that the results are valid.

Sending data to the next server is not guaranteed: the requested server will simply fire a request and forget about it.
Notably, the server will still return hidden states to the client as usual. The extra communication is fine because rpc_inference performance does not use much network throughput ("mbps"), being more sensitive to latency ("ping").

Notes:

  • client can request a different next_peer after each new inference step. This happens if one of the "next servers" disconnected from the inference session. Servers should send each hidden_states to the server that was specified in the current request.next_peer
  • if a server receives a request that doesn't correspond to any active session, it simply ignores the request. this is fine because if that request was valid, the client will still send the same data later
  • [security] since the previous server can be faulty/malicious, the "next peer" server should check that the data it received from previous peer is equal to the data it eventually received from client; when we implement full verification, the server can simply sign the next peer message so it can be used as a proof of (benign or malicious) activity
    • if this took place, a server may have to re-send inference message; we can support this by specifying the current length in the server's response
  • [security] the server-to-server traffic caused by the client is strictly less than client-to-server traffic, which eliminates the potential misuse via ddos amplification
  • the current-best routing strategy would still work decently for this algorithm because it uses a strictly non-optimistic (time>=actual) performance model

@dbaranchuk also proposed a clever alternative solution, where each server runs its own fault-tolerant inference session to subsequent servers. This can be a better solution If we find a way to limit the memory / bandwidth usage on a server.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions