fix: use compressed event size to close chunk #7517

sspaink · 2025-04-15T03:33:49Z

The chunk encoder writes gzipped content to a buffer. The bug is that enc.buf.Len() doesn't represent the total chunk size, but the compressed size. Currently enc.WriteBytes compares the enc.bytesWritten to enc.softLimit to determine if the chunk should be closed and returned. While within enc.reset() it uses enc.buf.Len() to adjust the enc.softLimit. It seems to me that enc.bytesWritten is the expected size and allows the encoder to adapt the soft limit correctly. The updated tests reflect the improvement showing a more stable chunk size. Something to think about, if decision_logs.reporting.upload_size_limit_bytes is meant to limit the final compressed OR uncompressed size. I wrote this pull request with the assumption that it is meant to represent the final uncompressed size. I'd also assume this is what a user would expect when configuring the limit because they'd be more concerned with how the configured service will deal with the uncompressed size. Of course just speculative so open for discussion, could be this was meant to reduce network packet size but maximize number of events. Could make this configurable 🤔 although to use enc.buf.Len() I think you also have to call enc.w.Flush() to make sure all pending data is written.

This bug definitely made me tear some of my hair out finding it haha Found it while working on: #7455. Currently the buffers reset the chunk encoder frequently hiding the problem because the soft limit never gets out of control. I was working on updating the event buffer to reuse the same chunk encoder throughout its lifecycle. This is were the problem revealed itself because the enc.softLimit began to overflow due to frequent calls to enc.reset()! 🚨

What was happening is that the encoder kept increasing the soft limit because it was checking against the compressed size it assumed the chunk buffer was constantly being underutilized. I also added a check to prevent the enc.softLimit from overflowing and setting a upper limit to the growth (twice the hard limit or math.MaxInt64 -1) . This might not be required because enc.reset() shouldn't be called so aggressively anymore, but I added a unit test showing that it is possible.

The chunk encoder writes gzipped content to a buffer. Using enc.buf.Len() doesn't represent the total chunk size, but the compressed size. While enc.bytesWritten is the expected size and allows the encoder to adapt the soft limit correctly. The updated tests reflect the improvement showing a more stable chunk size. Signed-off-by: sspaink <[email protected]>

Signed-off-by: sspaink <[email protected]>

johanfylling · 2025-04-15T08:47:31Z

From the description of decision_logs.reporting.upload_size_limit_bytes:

Decision log upload size limit in bytes. OPA will chunk uploads to cap message body to this limit.

Since this note talks about the "message body", my assumption is that this config param refers to the limit of the size in transit, i.e. the chunk in it's compressed form.

Perhaps @ashutosh-narkar can shed some more light on this, as I believe he implemented the soft limit. (It's been a while since then though, so he'll be forgiven if he doesn't recall 😄)

johanfylling

Gosh, math 😵‍💫. You're gonna need to lead me by the hand on this one 😄.

johanfylling · 2025-04-15T12:41:48Z

v1/plugins/logs/encoder.go

 		if enc.metrics != nil {
 			enc.metrics.Counter(encSoftLimitScaleUpCounterName).Incr()
 		}

 		mul := int64(math.Pow(float64(softLimitBaseFactor), float64(enc.softLimitScaleUpExponent+1)))
+		// this can cause enc.softLimit to overflow into a negative value


What are the circumstances for a scenario where we reach an overflow? Since the thing we're exponentially increasing is upload bytes, for us to overflow, wouldn't the previous successful reset need have had a soft-limit already terabytes in size?

This is intuition talking, and not me doing actual calculus, though, so I may be way off in my estimates. It's very likely I'm missing something here, since you've encountered this in your work and had to fix it.

As we discussed, I updated the PR to now enforce a maximum configurable limit of 4294967296 instead removing the need to check if the soft limit will ever overflow.

johanfylling · 2025-04-15T12:41:51Z

v1/plugins/logs/encoder.go

+			if limit < 0 {
+				limit = math.MaxInt64 - 1
+			}
+			enc.softLimit = limit


As always when it comes to math, I'm a bit confused 😅.
Why are we setting the soft-limit to 2x the configured limit (or even higher) here? Won't that cause us to write past the configured limit in WriteBytes()? There is probably some detail I'm missing,

Deleted the math, it won't hurt us anymore 😜

ashutosh-narkar · 2025-04-15T16:46:40Z

From the description of decision_logs.reporting.upload_size_limit_bytes:

Decision log upload size limit in bytes. OPA will chunk uploads to cap message body to this limit.

Since this note talks about the "message body", my assumption is that this config param refers to the limit of the size in transit, i.e. the chunk in it's compressed form.

Perhaps @ashutosh-narkar can shed some more light on this, as I believe he implemented the soft limit. (It's been a while since then though, so he'll be forgiven if he doesn't recall 😄)

Since this note talks about the "message body", my assumption is that this config param refers to the limit of the size in transit, i.e. the chunk in it's compressed form.

Yes that's correct. It's been a while since I looked into this. But the goal is to pack as much as possible in the uploaded packet. We have some explanation of the algorithm in the section on Decision Logs. It's possible there could be a bug in some calculation which we haven't seen before.

…unk body should be closed. Also enforce a maximum allowed upload limit to 2^32. Signed-off-by: sspaink <[email protected]>

netlify · 2025-04-15T19:00:55Z

✅ Deploy Preview for openpolicyagent ready!

Name	Link
🔨 Latest commit	`63dd180`
🔍 Latest deploy log	https://app.netlify.com/sites/openpolicyagent/deploys/67feac8b26002c0008e3fcf7
😎 Deploy Preview	https://deploy-preview-7517--openpolicyagent.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

…s dropping the nd cache is less likely Signed-off-by: sspaink <[email protected]>

sspaink · 2025-04-15T20:31:12Z

@ashutosh-narkar thank you for the clarification! In that case the bug is in the WriteBytes function which compares the size against the non-compressed size to determine when to close the chunk.

Latest changes now has WriteBytes updated to check the compressed size instead to match the reset function. You can see in the TestPluginTriggerManual unit test that because of this change it is now possible to send a lot more events in a single chunk, instead of over 3 chunks it sends all the events in 1. As well as in the test TestChunkMaxUploadSizeLimitNDBCacheDropping it shows it is less likely for the ND cache to be dropped because it compares against the compression size instead. I think this could be a noticeable improvement!

Also added a maximum configurable upload size limit of 4294967296 bytes that prints a warning if the user exceeds it that it was capped. Thank you @johanfylling for this suggestion.

Unit test added but also tested that out locally:

decision_logs:
  service: fakeservice
  reporting:
    upload_size_limit_bytes: 4294967296000000

➜  buffertest ./opa_darwin_arm64 run -c opa-conf.yaml --server ./example.rego --log-level=error
{"level":"warning","msg":"the configured `upload_size_limit_bytes` (4294967296000000) has been set to the maximum limit (4294967296)","plugin":"discovery","time":"2025-04-15T15:28:29-05:00"}

sspaink added 2 commits April 14, 2025 22:21

add check for overflow soft limit

2e13c21

Signed-off-by: sspaink <[email protected]>

sspaink marked this pull request as ready for review April 15, 2025 03:44

remove debug print statement

1c769a2

Signed-off-by: sspaink <[email protected]>

johanfylling reviewed Apr 15, 2025

View reviewed changes

update WriteBytes to use the compressed length to determine when a ch…

63dd180

…unk body should be closed. Also enforce a maximum allowed upload limit to 2^32. Signed-off-by: sspaink <[email protected]>

check the incoming event compressed size against the limit, this mean…

3c313ec

…s dropping the nd cache is less likely Signed-off-by: sspaink <[email protected]>

sspaink changed the title ~~fix: use enc.bytesWritten to update chunk encoder soft limit~~ fix: use compressed event size to close chunk Apr 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use compressed event size to close chunk #7517

fix: use compressed event size to close chunk #7517

sspaink commented Apr 15, 2025 •

edited

Loading

johanfylling commented Apr 15, 2025

johanfylling left a comment

johanfylling Apr 15, 2025

sspaink Apr 15, 2025

johanfylling Apr 15, 2025

sspaink Apr 15, 2025

ashutosh-narkar commented Apr 15, 2025

netlify bot commented Apr 15, 2025

sspaink commented Apr 15, 2025

fix: use compressed event size to close chunk #7517

Are you sure you want to change the base?

fix: use compressed event size to close chunk #7517

Conversation

sspaink commented Apr 15, 2025 • edited Loading

johanfylling commented Apr 15, 2025

johanfylling left a comment

Choose a reason for hiding this comment

johanfylling Apr 15, 2025

Choose a reason for hiding this comment

sspaink Apr 15, 2025

Choose a reason for hiding this comment

johanfylling Apr 15, 2025

Choose a reason for hiding this comment

sspaink Apr 15, 2025

Choose a reason for hiding this comment

ashutosh-narkar commented Apr 15, 2025

netlify bot commented Apr 15, 2025

✅ Deploy Preview for openpolicyagent ready!

sspaink commented Apr 15, 2025

sspaink commented Apr 15, 2025 •

edited

Loading