Skip to content

fix: use compressed event size to close chunk #7517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sspaink
Copy link
Contributor

@sspaink sspaink commented Apr 15, 2025

The chunk encoder writes gzipped content to a buffer. The bug is that enc.buf.Len() doesn't represent the total chunk size, but the compressed size. Currently enc.WriteBytes compares the enc.bytesWritten to enc.softLimit to determine if the chunk should be closed and returned. While within enc.reset() it uses enc.buf.Len() to adjust the enc.softLimit. It seems to me that enc.bytesWritten is the expected size and allows the encoder to adapt the soft limit correctly. The updated tests reflect the improvement showing a more stable chunk size. Something to think about, if decision_logs.reporting.upload_size_limit_bytes is meant to limit the final compressed OR uncompressed size. I wrote this pull request with the assumption that it is meant to represent the final uncompressed size. I'd also assume this is what a user would expect when configuring the limit because they'd be more concerned with how the configured service will deal with the uncompressed size. Of course just speculative so open for discussion, could be this was meant to reduce network packet size but maximize number of events. Could make this configurable 🤔 although to use enc.buf.Len() I think you also have to call enc.w.Flush() to make sure all pending data is written.

This bug definitely made me tear some of my hair out finding it haha Found it while working on: #7455. Currently the buffers reset the chunk encoder frequently hiding the problem because the soft limit never gets out of control. I was working on updating the event buffer to reuse the same chunk encoder throughout its lifecycle. This is were the problem revealed itself because the enc.softLimit began to overflow due to frequent calls to enc.reset()! 🚨

What was happening is that the encoder kept increasing the soft limit because it was checking against the compressed size it assumed the chunk buffer was constantly being underutilized. I also added a check to prevent the enc.softLimit from overflowing and setting a upper limit to the growth (twice the hard limit or math.MaxInt64 -1) . This might not be required because enc.reset() shouldn't be called so aggressively anymore, but I added a unit test showing that it is possible.

sspaink added 2 commits April 14, 2025 22:21
The chunk encoder writes gzipped content to a buffer. Using enc.buf.Len() doesn't represent the total chunk size, but the compressed size. While enc.bytesWritten is the expected size and allows the encoder to adapt the soft limit correctly. The updated tests reflect the improvement showing a more stable chunk size.

Signed-off-by: sspaink <[email protected]>
@sspaink sspaink marked this pull request as ready for review April 15, 2025 03:44
@johanfylling
Copy link
Contributor

From the description of decision_logs.reporting.upload_size_limit_bytes:

Decision log upload size limit in bytes. OPA will chunk uploads to cap message body to this limit.

Since this note talks about the "message body", my assumption is that this config param refers to the limit of the size in transit, i.e. the chunk in it's compressed form.

Perhaps @ashutosh-narkar can shed some more light on this, as I believe he implemented the soft limit. (It's been a while since then though, so he'll be forgiven if he doesn't recall 😄)

Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gosh, math 😵‍💫. You're gonna need to lead me by the hand on this one 😄.

if enc.metrics != nil {
enc.metrics.Counter(encSoftLimitScaleUpCounterName).Incr()
}

mul := int64(math.Pow(float64(softLimitBaseFactor), float64(enc.softLimitScaleUpExponent+1)))
// this can cause enc.softLimit to overflow into a negative value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the circumstances for a scenario where we reach an overflow? Since the thing we're exponentially increasing is upload bytes, for us to overflow, wouldn't the previous successful reset need have had a soft-limit already terabytes in size?

This is intuition talking, and not me doing actual calculus, though, so I may be way off in my estimates. It's very likely I'm missing something here, since you've encountered this in your work and had to fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, I updated the PR to now enforce a maximum configurable limit of 4294967296 instead removing the need to check if the soft limit will ever overflow.

if limit < 0 {
limit = math.MaxInt64 - 1
}
enc.softLimit = limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As always when it comes to math, I'm a bit confused 😅.
Why are we setting the soft-limit to 2x the configured limit (or even higher) here? Won't that cause us to write past the configured limit in WriteBytes()? There is probably some detail I'm missing,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted the math, it won't hurt us anymore 😜

@ashutosh-narkar
Copy link
Member

From the description of decision_logs.reporting.upload_size_limit_bytes:

Decision log upload size limit in bytes. OPA will chunk uploads to cap message body to this limit.

Since this note talks about the "message body", my assumption is that this config param refers to the limit of the size in transit, i.e. the chunk in it's compressed form.

Perhaps @ashutosh-narkar can shed some more light on this, as I believe he implemented the soft limit. (It's been a while since then though, so he'll be forgiven if he doesn't recall 😄)

Since this note talks about the "message body", my assumption is that this config param refers to the limit of the size in transit, i.e. the chunk in it's compressed form.

Yes that's correct. It's been a while since I looked into this. But the goal is to pack as much as possible in the uploaded packet. We have some explanation of the algorithm in the section on Decision Logs. It's possible there could be a bug in some calculation which we haven't seen before.

…unk body should be closed.

Also enforce a maximum allowed upload limit to 2^32.

Signed-off-by: sspaink <[email protected]>
Copy link

netlify bot commented Apr 15, 2025

Deploy Preview for openpolicyagent ready!

Name Link
🔨 Latest commit 63dd180
🔍 Latest deploy log https://app.netlify.com/sites/openpolicyagent/deploys/67feac8b26002c0008e3fcf7
😎 Deploy Preview https://deploy-preview-7517--openpolicyagent.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

…s dropping the nd cache is less likely

Signed-off-by: sspaink <[email protected]>
@sspaink
Copy link
Contributor Author

sspaink commented Apr 15, 2025

@ashutosh-narkar thank you for the clarification! In that case the bug is in the WriteBytes function which compares the size against the non-compressed size to determine when to close the chunk.

Latest changes now has WriteBytes updated to check the compressed size instead to match the reset function. You can see in the TestPluginTriggerManual unit test that because of this change it is now possible to send a lot more events in a single chunk, instead of over 3 chunks it sends all the events in 1. As well as in the test TestChunkMaxUploadSizeLimitNDBCacheDropping it shows it is less likely for the ND cache to be dropped because it compares against the compression size instead. I think this could be a noticeable improvement!

Also added a maximum configurable upload size limit of 4294967296 bytes that prints a warning if the user exceeds it that it was capped. Thank you @johanfylling for this suggestion.

Unit test added but also tested that out locally:

decision_logs:
  service: fakeservice
  reporting:
    upload_size_limit_bytes: 4294967296000000
➜  buffertest ./opa_darwin_arm64 run -c opa-conf.yaml --server ./example.rego --log-level=error
{"level":"warning","msg":"the configured `upload_size_limit_bytes` (4294967296000000) has been set to the maximum limit (4294967296)","plugin":"discovery","time":"2025-04-15T15:28:29-05:00"}

@sspaink sspaink changed the title fix: use enc.bytesWritten to update chunk encoder soft limit fix: use compressed event size to close chunk Apr 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants