Skip to content

Commit 73222bd

Browse files
Redis queue reimagined (#665)
* Adds Steve's redis queue * A queue in halibut * The test * . * . * This works with octopus * Cancel works * pick up test * Redis sub retry * Ok this is better * Add simple retries * Show that the reciever can re-connect itself * Add failing test for dequeuer disconnect * If the node processing the request disconnects we can detect that now * . * The sender can breifly disconnect and still recieve the response * well well that was hard * . * . * some clean up * . * Now support cancelling in flight requests when the sender is offline * Fix bug where we kept sending pulses * Now we can detect a redis that drops all it data like its hot * Add support to detect redis loss * . * . * . * Fix tests under load * Improve test * More reliable timeout * Start testing both queues at the same time * . * Queue tests are now shared between in mem and redis queue * Fixes a bug where subscriptions would not be unsubscribed from * ignore some exceptions * Fix big in detecting if redis lost its data * Better logging, dispose extra queues created * Never dispose CTS * Allow retrying when redis loses all of its data * Account for requests being abandoned * Update some TODOs * When the receiever of the requests detects data lose it returns a retryable error * Set TTL, Use token, Add logging, Remove generic from PollAndSub class * Redis Facade unit tests * . * Test we wait for the request to be collected before timing out on heart beats * . * Dont wait forever when an error occurs reading the response * Fix error types so that RPC will be retried * Random delay on error * . * One must dispose CTS less mem leak, but also not while anything is still using it * Fix issue where we dispose the PollAndSubscribeToResponse before we delete the response from redis * It compiles on windows * Ignore most redis tests for now * Start redis if not already started * . * Add redis test attribute * Log setup fixture to standard location * . * . * Support setting redis host * Finally the host is respected * Add back net48 * . * . * . * . * . * . * . * . * . * . * . * Cleanup HalibutRedisTransport * Cleanup * Cleanup * . * . * . * . * Merged in retryable * . * Don't re-send response * Fix typo in Redis queue doc * fmt ExceptionReturnedByHalibutProxyExtensionMethod * format Halibut.Util * Cleanup trailing whitespace * format HalibutRuntimeBuilder * Minor nits in QueueMessageSerializer * Fix nits for RedisPendingRequest * More doco * Fix typo in GetTokenForDataLossDetection * test cleanup * test cleanup * . * CancelOnDispose doco * . * cleanup * cleanup * Rename request to requestId * Consolidate key generation in RedisFacade * test cleanup * Tests for CancelOnDisposeCts * Cleanup nits in HalibutRedisTransport * . * Renamed DataLoss exception * Rename other data loss exception * Re arrange redis queue * clean redis PRQ builder * clean up test * . * . * Nit cleanup of NodeHeartBeatWatcher * Rename data loss detection namespace * . * Comment on QueueMessageSerializer origins * Fix compilation on net48 * Remove redundant overrideCancellationReason calls * . * More data loss renames * Fix nits in RedisPendingRequestQueue * fix formatting in MessageReaderWriterExtensionMethods * . * . * . * . * . * . * Geoff is willing to risk it * . * Don't immediatly poll for a response * Don't immediatly poll for request cancellation * Don't immediatly check the request was collected * add sequence diagram * . --------- Co-authored-by: Rhys Parry <[email protected]>
1 parent 9b27a2f commit 73222bd

File tree

82 files changed

+7727
-41
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

82 files changed

+7727
-41
lines changed

docs/RedisQueue.md

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Redis Pending Request Queue Beta
2+
3+
Halibut provides a Redis backed pending request queue for multi node setups. This solves the problem where
4+
a cluster of multiple clients need to send commands to polling services which connect to only one of the
5+
clients.
6+
7+
For example if we have two clients ClientA and ClientB and the Service connects to B, yet A wants
8+
to execute an RPC. Currently that won't work as the request will end up in the in memory queue for ClientA
9+
but it needs to be accessible to ClientB.
10+
11+
The Redis queue solves this, as the request is placed into Redis allowing ClientB to access the request and
12+
so send it to the Service.
13+
14+
## How to run Redis for this queue.
15+
16+
Redis can be started by running the following command in the root of the directory:
17+
18+
```
19+
docker run -v `pwd`/redis-conf:/usr/local/etc/redis -p 6379:6379 --name redis -d redis redis-server /usr/local/etc/redis/redis.conf
20+
```
21+
22+
Note that Redis is configured to have no backup, everything must be in memory. The queue makes this assumption to function.
23+
24+
# Design
25+
26+
## Background
27+
### What is a Pending Request Queue.
28+
29+
Halibut turns an RPC call into a RequestMessage which is placed into the Pending Request Queue. This is done by calling: `ResponseMessage QueueAndWait(RequestMessage)`. Which is a blocking call that queues the RequestMessage and waits for the ResponseMessage before returning.
30+
31+
Polling service, e.g, Tentacle, call into the `Dequeue` method of the queue to get the next `RequestMessage` to processing. It then responds by calling `ApplyResponse(ResponseMessage)`, doing so results in `QueueAndWait()` returning the ResponseMessage. This in turn results in the RPC call completing.
32+
33+
The Redis Pending Request Queue solves the problem where we have multiple clients, that wish to execute RPC calls to a single Polling Service that is connected to exactly one client. For example Client A makes an RPC call, but the service is connected to Client B. The Redis Pending Request Queue is what moves the `RequestMessage` from Client A to Client B to be sent to the service.
34+
35+
### Redis specific details relevant to the queue.
36+
37+
First we need to understand just a little about Redis and how we are using redis:
38+
- Redis may have data lose.
39+
- Pub/Sub does not have guaranteed delivery, we can miss publication.
40+
- Pub/Sub channels are not pets in Redis, they can be created simply by "subscribing" and are "deleted" when there are no subscribers to that channel.
41+
- Redis is connected to via the network, which can be flaky we will make retries to Redis when we can.
42+
43+
## High Level design.
44+
45+
Setup:
46+
- Client A is executing the RPC call
47+
- Client B has the Polling service connected to it.
48+
49+
At a high level steps the Redis Queue goes through to execute an RPC are:
50+
51+
1. Client B subscribes to the unique "RequestMessage Pulse Channel", as the client service is connected to it. The channel is keyed by the polling client id e.g. "poll://123"
52+
2. Client A executes an RPC and so Calls QueueAndWait with a RequestMessage. Each RequestMessage has a unique `GUID`.
53+
3. Client A subscribes to the `ResponseMessage channel` keyed by `GUID` to be notified when a response is available.
54+
4. Client A serialises the message and places the message into a hash in Redis keyed by the RequestMessage `Guid`.
55+
5. Client A Adds the `GUID` to the polling clients unique Redis list (aka queue). The key is the polling client id e.g. "poll://123".
56+
6. Client A pulses the polling clients unique "RequestMessage Pulse Channel", to alert to it that it has work to do.
57+
7. Client B receives the Pulse message and tries to dequeue a `GUID` from the polling clients unique Redis list (aka queue).
58+
8. Client B now has the `GUID` of the request and so atomically gets and deletes the RequestMessage from the Redis Hash using that guid.
59+
9. Client B sends the request to the tentacle, waits for the response, and calls `ApplyResponse()` with the ResponseMessage.
60+
10. Client B writes the `ResponseMessage` to redis in a hash using the `GUID` as the key.
61+
11. Client B Pulses the `ResponseMessage channel` keyed by the RequestMessage `GUID`, that a Response is available.
62+
12. Client A receives a pulse on the `ResponseMessage channel` and so knows a Response is available, it reads the response from Redis and returns from the `QueueAndWait()` method.
63+
64+
### Flow Diagram
65+
66+
The following sequence diagram illustrates this high-level flow:
67+
68+
![Redis Queue Solution Flow](images/redis-queue-flow-diagram.svg)
69+
70+
## Cancellation support.
71+
72+
The Redis PRQ supports cancellation, even for collected requests. This is done by the RequestReceiverNode (ie the node connected to the Service) subscribing to the request cancellation channel and polling for request cancellation.
73+
74+
## Dealing with minor network interruptions to Redis.
75+
76+
All operations to redis are retried for up to 30s, this allows connections to Redis to go down briefly with impacting RPCs even for non idempotent RPCs.
77+
78+
### Pub/Sub and Poll.
79+
80+
Since Pub/Sub does not have guaranteed delivery in Redis, in any place that we do Pub/Sub we must also have a form of polling. For example:
81+
- When Dequeuing work not only are we subscribed but when `Dequeue()` is called we also check for work on the queue anyway. (Note that Dequeue() returns every 30s if there is no work, and thus we have polling.)
82+
- When waiting for a Response, we are not only subscribed to the response channel we also poll to see if the Response has been sent back.
83+
84+
## Dealing with nodes that disappear mid request.
85+
86+
Either node could go offline at any time, including during execution of an RPC. For example:
87+
- The node executing the RPC could go offline, when the node with the Service connected is sending the Request to the Service.
88+
- The node sending the Request to the Service could go offline.
89+
90+
To handle this case in a way that allows for large file transfers aka request that take a long time, we have a concept of "heart beats".
91+
92+
When executing an RPC both nodes involved will send heart beats to a unique channel keyed by the request ID AND the nodes role in the RPC. For example:
93+
- The node executing RPC will pulse heart beats to a channel with a key such as `NodeSendingRequest:GUID`
94+
- The node sending the request to the service will pulse heart beats to a channel with a key such as: `NodeReceivingRequest:GUID`
95+
96+
Now each node can watch for heart beats from the other node, when heart beats stop being sent they can assume it is offline and cancel/abandon the request.
97+
98+
## Dealing with Redis losing its data.
99+
100+
Since redis can lose data at anytime the queue is able to detect data lose and cancel any inflight requests when data lose occurs.
101+
102+
## Message serialisation
103+
104+
Message serialisation is provided by re-using the serialiser halibut uses for transferring requests/responses over the wire.
105+
106+
## Cleanup of old data in Redis.
107+
108+
All values in redis have a TTL applied, so redis will automatically clean up old keys if Halibut does not.
109+
110+
Request message TTL: request pickup timeout + 2 minutes.
111+
Response TTL: default 20 minutes.
112+
Pending GUID list TTL: 1 day.
113+
Heartbeat rates: 15s; timeouts: sender 90s, processor 60s.
114+
115+
### DataStream
116+
117+
DataStreams are not stored in the queue, instead an implementation of `IStoreDataStreamsForDistributedQueues` must be provided. It will be called with the DataStreams that are to be stored, and will be called again with the "husks" of a DataStream that needs to be re-hydrated. DataStreams have unique GUIDs which make it easier to find the data for re-hydration.
118+
119+
Sub classing DataStream is a useful technique for avoiding the storage of DataStream data when it is trivial to read the data from some known places. For example a DataStream might be subclassed to hold the file location on disk that should be read when sending the data for a data stream. The halibut serialiser has been updated to work with sub classes of DataStream, in that it will ignore the sub class and send just the DataStream across the wire. This makes it safe to sub class DataStream for efficient storage and have that work with both listening and polling clients.

docs/images/redis-queue-flow-diagram.svg

Lines changed: 1 addition & 0 deletions
Loading

0 commit comments

Comments
 (0)