-
Notifications
You must be signed in to change notification settings - Fork 1.4k
fix: use compressed event size to close chunk #7517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The chunk encoder writes gzipped content to a buffer. Using enc.buf.Len() doesn't represent the total chunk size, but the compressed size. While enc.bytesWritten is the expected size and allows the encoder to adapt the soft limit correctly. The updated tests reflect the improvement showing a more stable chunk size. Signed-off-by: sspaink <[email protected]>
Signed-off-by: sspaink <[email protected]>
Signed-off-by: sspaink <[email protected]>
From the description of
Since this note talks about the "message body", my assumption is that this config param refers to the limit of the size in transit, i.e. the chunk in it's compressed form. Perhaps @ashutosh-narkar can shed some more light on this, as I believe he implemented the soft limit. (It's been a while since then though, so he'll be forgiven if he doesn't recall 😄) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gosh, math 😵💫. You're gonna need to lead me by the hand on this one 😄.
v1/plugins/logs/encoder.go
Outdated
if enc.metrics != nil { | ||
enc.metrics.Counter(encSoftLimitScaleUpCounterName).Incr() | ||
} | ||
|
||
mul := int64(math.Pow(float64(softLimitBaseFactor), float64(enc.softLimitScaleUpExponent+1))) | ||
// this can cause enc.softLimit to overflow into a negative value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the circumstances for a scenario where we reach an overflow? Since the thing we're exponentially increasing is upload bytes, for us to overflow, wouldn't the previous successful reset need have had a soft-limit already terabytes in size?
This is intuition talking, and not me doing actual calculus, though, so I may be way off in my estimates. It's very likely I'm missing something here, since you've encountered this in your work and had to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, I updated the PR to now enforce a maximum configurable limit of 4294967296
instead removing the need to check if the soft limit will ever overflow.
v1/plugins/logs/encoder.go
Outdated
if limit < 0 { | ||
limit = math.MaxInt64 - 1 | ||
} | ||
enc.softLimit = limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As always when it comes to math, I'm a bit confused 😅.
Why are we setting the soft-limit to 2x the configured limit (or even higher) here? Won't that cause us to write past the configured limit in WriteBytes()? There is probably some detail I'm missing,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted the math, it won't hurt us anymore 😜
Yes that's correct. It's been a while since I looked into this. But the goal is to pack as much as possible in the uploaded packet. We have some explanation of the algorithm in the section on Decision Logs. It's possible there could be a bug in some calculation which we haven't seen before. |
…unk body should be closed. Also enforce a maximum allowed upload limit to 2^32. Signed-off-by: sspaink <[email protected]>
✅ Deploy Preview for openpolicyagent ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
…s dropping the nd cache is less likely Signed-off-by: sspaink <[email protected]>
@ashutosh-narkar thank you for the clarification! In that case the bug is in the Latest changes now has Also added a maximum configurable upload size limit of 4294967296 bytes that prints a warning if the user exceeds it that it was capped. Thank you @johanfylling for this suggestion. Unit test added but also tested that out locally: decision_logs:
service: fakeservice
reporting:
upload_size_limit_bytes: 4294967296000000 ➜ buffertest ./opa_darwin_arm64 run -c opa-conf.yaml --server ./example.rego --log-level=error
{"level":"warning","msg":"the configured `upload_size_limit_bytes` (4294967296000000) has been set to the maximum limit (4294967296)","plugin":"discovery","time":"2025-04-15T15:28:29-05:00"} |
The chunk encoder writes gzipped content to a buffer. The bug is that
enc.buf.Len()
doesn't represent the total chunk size, but the compressed size. Currentlyenc.WriteBytes
compares theenc.bytesWritten
toenc.softLimit
to determine if the chunk should be closed and returned. While withinenc.reset()
it usesenc.buf.Len()
to adjust theenc.softLimit
. It seems to me thatenc.bytesWritten
is the expected size and allows the encoder to adapt the soft limit correctly. The updated tests reflect the improvement showing a more stable chunk size. Something to think about, ifdecision_logs.reporting.upload_size_limit_bytes
is meant to limit the final compressed OR uncompressed size. I wrote this pull request with the assumption that it is meant to represent the final uncompressed size. I'd also assume this is what a user would expect when configuring the limit because they'd be more concerned with how the configured service will deal with the uncompressed size. Of course just speculative so open for discussion, could be this was meant to reduce network packet size but maximize number of events. Could make this configurable 🤔 although to useenc.buf.Len()
I think you also have to callenc.w.Flush()
to make sure all pending data is written.This bug definitely made me tear some of my hair out finding it haha Found it while working on: #7455. Currently the buffers reset the chunk encoder frequently hiding the problem because the soft limit never gets out of control. I was working on updating the event buffer to reuse the same chunk encoder throughout its lifecycle. This is were the problem revealed itself because the
enc.softLimit
began to overflow due to frequent calls toenc.reset()
! 🚨What was happening is that the encoder kept increasing the soft limit because it was checking against the compressed size it assumed the chunk buffer was constantly being underutilized. I also added a check to prevent the
enc.softLimit
from overflowing and setting a upper limit to the growth (twice the hard limit ormath.MaxInt64 -1
) . This might not be required becauseenc.reset()
shouldn't be called so aggressively anymore, but I added a unit test showing that it is possible.