Journey to v9 #5961

silentsokolov · 2025-11-03T14:59:01Z

silentsokolov
Nov 3, 2025

As promised, here’s a short story about how we upgraded from v8 to v9.

Let’s start with the reason - the first and the last one: unstable behavior of large clusters #5819 when nodes go down often.
We have a pretty big cluster (12 indexers and 8 searchers), and to save some cost we run it on spot instances. Because of that, after 1–2 months of uptime, the cluster could suddenly fail to recover, and we had to do a full restart. As you can imagine, that’s pretty painful.

We saw on Discord/Github that qw-airmail-20250522-hotfix version was one of the most stable, and more importantly, it fixed the exact bug we were hitting. So we decided to upgrade.

For testing, we spun up a much smaller cluster and ran some basic tests. We just could not physically reproduce the same production load - something we later really regretted. But anyway, the tests passed fine :)

First, if you use the official Helm chart be aware there’s a bug with affinity merging: #133. So just be careful when configuring that.

Second, during the first startup, the metastore runs migrations and there are no logs about that, not even on debug level. On large indexes, migrations take a long time and don’t finish within the startup probe window. Because of that, the metastore pod gets killed before migrations finish and starts again. And so it goes in a loop. On one hand it’s obvious something’s happening, but on the other - there’s zero feedback. We spent more than an hour finding the real cause. On our test environment we didn’t see this because indexes were much smaller.

Third, we use the bulk API for log delivery. The new version introduced a new ingestion mode - ingest_v2. Unfortunately, it’s... not great. Seriously. On our test cluster, we could not reach even half the throughput we had with ingest_v1. We tried tons of configuration combinations - nothing helped. Thankfully, it’s possible to disable it. I guess we might be missing something and should dig deeper into the code, but still — that’s the reality.

Fourth, and the most painful one. We spent a week fighting this. Basically, at random times indexing would just stop. We called it “stop the world.”
You can see the pattern on the screenshot. First, we found it happens only when one specific index is active. If we stop ingestion for that index - the problem disappears. The only big difference was that it used partitioning.

We first thought the issue was related to the number of partitions (not even that big - only 20). We managed to remove partitioning completely - it helped, but only for a while: the gaps between “STW” just got longer.
Then we suspected merge settings and tweaked them - no luck either.

During this whole research we found this bug: #5240 but didn’t pay much attention - 1) it was about ingest_v2, and 2) we used to deliver large messages just fine. But as a last resort, we decided to cut off all messages larger than 1MB.
And guess what - it worked! Everything went back to normal. We didn’t even bother bringing partitions back - turns out we like it more without them.

Quickwit is still the best thing that happened to logs in recent years. We’ve been using it for over a year now, and nothing else gives the same performance/cost ratio.
What’s above isn’t really a complaint - it’s feedback, so others who try Quickwit don’t fall into the same traps. I really hope the dev team keeps pushing it forward - the community deserves a solid tool for serious workloads.

Thank you!

guilload · 2025-11-06T23:21:18Z

guilload
Nov 6, 2025
Maintainer

For the bug in the Helm chart when merging affinity, care to open a PR, I'll gladly review.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Journey to v9 #5961

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Journey to v9 #5961

Uh oh!

silentsokolov Nov 3, 2025

Replies: 1 comment

Uh oh!

guilload Nov 6, 2025 Maintainer

silentsokolov
Nov 3, 2025

guilload
Nov 6, 2025
Maintainer