|
| 1 | +# Redis Pending Request Queue Beta |
| 2 | + |
| 3 | +Halibut provides a Redis backed pending request queue for multi node setups. This solves the problem where |
| 4 | +a cluster of multiple clients need to send commands to polling services which connect to only one of the |
| 5 | +clients. |
| 6 | + |
| 7 | +For example if we have two clients ClientA and ClientB and the Service connects to B, yet A wants |
| 8 | +to execute an RPC. Currently that won't work as the request will end up in the in memory queue for ClientA |
| 9 | +but it needs to be accessible to ClientB. |
| 10 | + |
| 11 | +The Redis queue solves this, as the request is placed into Redis allowing ClientB to access the request and |
| 12 | +so send it to the Service. |
| 13 | + |
| 14 | +## How to run Redis for this queue. |
| 15 | + |
| 16 | +Redis can be started by running the following command in the root of the directory: |
| 17 | + |
| 18 | +``` |
| 19 | +docker run -v `pwd`/redis-conf:/usr/local/etc/redis -p 6379:6379 --name redis -d redis redis-server /usr/local/etc/redis/redis.conf |
| 20 | +``` |
| 21 | + |
| 22 | +Note that Redis is configured to have no backup, everything must be in memory. The queue makes this assumption to function. |
| 23 | + |
| 24 | +# Design |
| 25 | + |
| 26 | +## Background |
| 27 | +### What is a Pending Request Queue. |
| 28 | + |
| 29 | +Halibut turns an RPC call into a RequestMessage which is placed into the Pending Request Queue. This is done by calling: `ResponseMessage QueueAndWait(RequestMessage)`. Which is a blocking call that queues the RequestMessage and waits for the ResponseMessage before returning. |
| 30 | + |
| 31 | +Polling service, e.g, Tentacle, call into the `Dequeue` method of the queue to get the next `RequestMessage` to processing. It then responds by calling `ApplyResponse(ResponseMessage)`, doing so results in `QueueAndWait()` returning the ResponseMessage. This in turn results in the RPC call completing. |
| 32 | + |
| 33 | +The Redis Pending Request Queue solves the problem where we have multiple clients, that wish to execute RPC calls to a single Polling Service that is connected to exactly one client. For example Client A makes an RPC call, but the service is connected to Client B. The Redis Pending Request Queue is what moves the `RequestMessage` from Client A to Client B to be sent to the service. |
| 34 | + |
| 35 | +### Redis specific details relevant to the queue. |
| 36 | + |
| 37 | +First we need to understand just a little about Redis and how we are using redis: |
| 38 | + - Redis may have data lose. |
| 39 | + - Pub/Sub does not have guaranteed delivery, we can miss publication. |
| 40 | + - Pub/Sub channels are not pets in Redis, they can be created simply by "subscribing" and are "deleted" when there are no subscribers to that channel. |
| 41 | + - Redis is connected to via the network, which can be flaky we will make retries to Redis when we can. |
| 42 | + |
| 43 | +## High Level design. |
| 44 | + |
| 45 | +Setup: |
| 46 | + - Client A is executing the RPC call |
| 47 | + - Client B has the Polling service connected to it. |
| 48 | + |
| 49 | +At a high level steps the Redis Queue goes through to execute an RPC are: |
| 50 | + |
| 51 | + 1. Client B subscribes to the unique "RequestMessage Pulse Channel", as the client service is connected to it. The channel is keyed by the polling client id e.g. "poll://123" |
| 52 | + 2. Client A executes an RPC and so Calls QueueAndWait with a RequestMessage. Each RequestMessage has a unique `GUID`. |
| 53 | + 3. Client A subscribes to the `ResponseMessage channel` keyed by `GUID` to be notified when a response is available. |
| 54 | + 4. Client A serialises the message and places the message into a hash in Redis keyed by the RequestMessage `Guid`. |
| 55 | + 5. Client A Adds the `GUID` to the polling clients unique Redis list (aka queue). The key is the polling client id e.g. "poll://123". |
| 56 | + 6. Client A pulses the polling clients unique "RequestMessage Pulse Channel", to alert to it that it has work to do. |
| 57 | + 7. Client B receives the Pulse message and tries to dequeue a `GUID` from the polling clients unique Redis list (aka queue). |
| 58 | + 8. Client B now has the `GUID` of the request and so atomically gets and deletes the RequestMessage from the Redis Hash using that guid. |
| 59 | + 9. Client B sends the request to the tentacle, waits for the response, and calls `ApplyResponse()` with the ResponseMessage. |
| 60 | + 10. Client B writes the `ResponseMessage` to redis in a hash using the `GUID` as the key. |
| 61 | + 11. Client B Pulses the `ResponseMessage channel` keyed by the RequestMessage `GUID`, that a Response is available. |
| 62 | + 12. Client A receives a pulse on the `ResponseMessage channel` and so knows a Response is available, it reads the response from Redis and returns from the `QueueAndWait()` method. |
| 63 | + |
| 64 | +### Flow Diagram |
| 65 | + |
| 66 | +The following sequence diagram illustrates this high-level flow: |
| 67 | + |
| 68 | + |
| 69 | + |
| 70 | +## Cancellation support. |
| 71 | + |
| 72 | +The Redis PRQ supports cancellation, even for collected requests. This is done by the RequestReceiverNode (ie the node connected to the Service) subscribing to the request cancellation channel and polling for request cancellation. |
| 73 | + |
| 74 | +## Dealing with minor network interruptions to Redis. |
| 75 | + |
| 76 | +All operations to redis are retried for up to 30s, this allows connections to Redis to go down briefly with impacting RPCs even for non idempotent RPCs. |
| 77 | + |
| 78 | +### Pub/Sub and Poll. |
| 79 | + |
| 80 | +Since Pub/Sub does not have guaranteed delivery in Redis, in any place that we do Pub/Sub we must also have a form of polling. For example: |
| 81 | + - When Dequeuing work not only are we subscribed but when `Dequeue()` is called we also check for work on the queue anyway. (Note that Dequeue() returns every 30s if there is no work, and thus we have polling.) |
| 82 | + - When waiting for a Response, we are not only subscribed to the response channel we also poll to see if the Response has been sent back. |
| 83 | + |
| 84 | +## Dealing with nodes that disappear mid request. |
| 85 | + |
| 86 | +Either node could go offline at any time, including during execution of an RPC. For example: |
| 87 | + - The node executing the RPC could go offline, when the node with the Service connected is sending the Request to the Service. |
| 88 | + - The node sending the Request to the Service could go offline. |
| 89 | + |
| 90 | +To handle this case in a way that allows for large file transfers aka request that take a long time, we have a concept of "heart beats". |
| 91 | + |
| 92 | +When executing an RPC both nodes involved will send heart beats to a unique channel keyed by the request ID AND the nodes role in the RPC. For example: |
| 93 | +- The node executing RPC will pulse heart beats to a channel with a key such as `NodeSendingRequest:GUID` |
| 94 | +- The node sending the request to the service will pulse heart beats to a channel with a key such as: `NodeReceivingRequest:GUID` |
| 95 | + |
| 96 | +Now each node can watch for heart beats from the other node, when heart beats stop being sent they can assume it is offline and cancel/abandon the request. |
| 97 | + |
| 98 | +## Dealing with Redis losing its data. |
| 99 | + |
| 100 | +Since redis can lose data at anytime the queue is able to detect data lose and cancel any inflight requests when data lose occurs. |
| 101 | + |
| 102 | +## Message serialisation |
| 103 | + |
| 104 | +Message serialisation is provided by re-using the serialiser halibut uses for transferring requests/responses over the wire. |
| 105 | + |
| 106 | +## Cleanup of old data in Redis. |
| 107 | + |
| 108 | +All values in redis have a TTL applied, so redis will automatically clean up old keys if Halibut does not. |
| 109 | + |
| 110 | +Request message TTL: request pickup timeout + 2 minutes. |
| 111 | +Response TTL: default 20 minutes. |
| 112 | +Pending GUID list TTL: 1 day. |
| 113 | +Heartbeat rates: 15s; timeouts: sender 90s, processor 60s. |
| 114 | + |
| 115 | +### DataStream |
| 116 | + |
| 117 | +DataStreams are not stored in the queue, instead an implementation of `IStoreDataStreamsForDistributedQueues` must be provided. It will be called with the DataStreams that are to be stored, and will be called again with the "husks" of a DataStream that needs to be re-hydrated. DataStreams have unique GUIDs which make it easier to find the data for re-hydration. |
| 118 | + |
| 119 | +Sub classing DataStream is a useful technique for avoiding the storage of DataStream data when it is trivial to read the data from some known places. For example a DataStream might be subclassed to hold the file location on disk that should be read when sending the data for a data stream. The halibut serialiser has been updated to work with sub classes of DataStream, in that it will ignore the sub class and send just the DataStream across the wire. This makes it safe to sub class DataStream for efficient storage and have that work with both listening and polling clients. |
0 commit comments