Journey to v9 #5961
silentsokolov
started this conversation in
Show and tell
Journey to v9
#5961
Replies: 1 comment
-
|
For the bug in the Helm chart when merging affinity, care to open a PR, I'll gladly review. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
As promised, here’s a short story about how we upgraded from v8 to v9.
Let’s start with the reason - the first and the last one: unstable behavior of large clusters #5819 when nodes go down often.
We have a pretty big cluster (12 indexers and 8 searchers), and to save some cost we run it on spot instances. Because of that, after 1–2 months of uptime, the cluster could suddenly fail to recover, and we had to do a full restart. As you can imagine, that’s pretty painful.
We saw on Discord/Github that
qw-airmail-20250522-hotfixversion was one of the most stable, and more importantly, it fixed the exact bug we were hitting. So we decided to upgrade.For testing, we spun up a much smaller cluster and ran some basic tests. We just could not physically reproduce the same production load - something we later really regretted. But anyway, the tests passed fine :)
First, if you use the official Helm chart be aware there’s a bug with affinity merging: #133. So just be careful when configuring that.
Second, during the first startup, the metastore runs migrations and there are no logs about that, not even on debug level. On large indexes, migrations take a long time and don’t finish within the startup probe window. Because of that, the metastore pod gets killed before migrations finish and starts again. And so it goes in a loop. On one hand it’s obvious something’s happening, but on the other - there’s zero feedback. We spent more than an hour finding the real cause. On our test environment we didn’t see this because indexes were much smaller.
Third, we use the bulk API for log delivery. The new version introduced a new ingestion mode -
ingest_v2. Unfortunately, it’s... not great. Seriously. On our test cluster, we could not reach even half the throughput we had withingest_v1. We tried tons of configuration combinations - nothing helped. Thankfully, it’s possible to disable it. I guess we might be missing something and should dig deeper into the code, but still — that’s the reality.Fourth, and the most painful one. We spent a week fighting this. Basically, at random times indexing would just stop. We called it “stop the world.”
You can see the pattern on the screenshot. First, we found it happens only when one specific index is active. If we stop ingestion for that index - the problem disappears. The only big difference was that it used partitioning.
We first thought the issue was related to the number of partitions (not even that big - only 20). We managed to remove partitioning completely - it helped, but only for a while: the gaps between “STW” just got longer.
Then we suspected merge settings and tweaked them - no luck either.
During this whole research we found this bug: #5240 but didn’t pay much attention - 1) it was about
ingest_v2, and 2) we used to deliver large messages just fine. But as a last resort, we decided to cut off all messages larger than 1MB.And guess what - it worked! Everything went back to normal. We didn’t even bother bringing partitions back - turns out we like it more without them.
Quickwit is still the best thing that happened to logs in recent years. We’ve been using it for over a year now, and nothing else gives the same performance/cost ratio.
What’s above isn’t really a complaint - it’s feedback, so others who try Quickwit don’t fall into the same traps. I really hope the dev team keeps pushing it forward - the community deserves a solid tool for serious workloads.
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions