-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
I am using nsq as a store-and-forward agent for IoT devices that report metrics to a central HTTPS API.
The small devices are offline for periods of time, and need to send back data opportunistically. The power supply is also intermittent.
You can think of the use case as a passenger bus instrumentation, collecting metrics between stops where wifi is available, and the power supply going up and down in between those bus stops.
As with most IoT devices the flash storage has really limited write lifetime.
I've identified the following nsqd options relevant to this use-case:
-mem-queue-size int
number of messages to keep in memory (per topic/channel) (default 10000)
-sync-timeout duration
duration of time per diskqueue fsync (default 2s)
-max-output-buffer-timeout duration
maximum client configurable duration of time between flushing to a client (default 1s)
Our first approach was to disable the memory queue fully with -mem-queue-size 0 and setting -sync-timeout 20s.
However this isn't working out for a couple of reasons:
- The OS is likely to flush writes sooner than 20s, so we are still wearing down the flash storage. We want to avoid writing to disk except every 20s, something which can't be guaranteed by disabling the memory queue.
- "in-flight" and "rescheduled" messages seem to remain solely in memory, sometimes for several minutes or longer, even when the memory queue is disabled as above. I get the impression these messages are needlessly lost when the power supply is cut. I may be wrong - is it possible to sync in-flight and deferred messages to disk aggressively (say every 20s)?
My next attempt was to implement flushing of the memory queue to to the disk queue as below (based on the first part of Channel.flush()) called by a new 20s ticker:
+++ b/nsqd/channel.go
+func (c *Channel) flushToDisk() error { ... }
This way we can enable the memory queue, allowing good performance and avoiding disk wear, while still achieving data safety guarantees.
Two problems with this:
- It now flushes the entire memory queue to disk every 20s. Ideally we would like to avoid committing short-lived messages to disk that happen to be in the memory queue at that moment.
- This still doesn't seem to solve in-flight and deferred messages stuck in memory. We could commit them to disk too as Channel.flush() does but that would break the semantics of operation.
I'd like to get a recommendation from the nsqd developer community on how to proceed. It looks like whatever change won't be more than a few lines of code, but I'm not sure of the best way to solve these requirements.
I'll be glad to submit patches and documentation for IoT use once we hear the best approach from an experienced developer.