-
Notifications
You must be signed in to change notification settings - Fork 41
fix: Add state root calculation on payload resolve #334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
5d86917 to
ba5ada0
Compare
| ctx = ctx.with_cancel(fb_cancel).with_extra_ctx(next_flashblocks_ctx); | ||
| }, | ||
| _ = block_cancel.cancelled() => { | ||
| self.resolve_best_payload(&mut state, &ctx, &mut info, &best_payload).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not guaranteed that rollup-boost will cancel building (send engine_getPayload)
I think correct way would be to calculate state root asynchronously after we sent last flashblock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why it is placed here is because on the event where rollup-boost does not send the payload resolve, on a subsequent FCU, the payload will be resolved regardless.
Also, I placed it on resolve because we are keen in exploring using the flashblock payload builder directly on a custom reth node without the need to use rollup-boost. Maybe this implementation is better in terms of de-coupling with rollup boost service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it is hard to deterministically say for sure which is the last flashblock. As observed during our deployment, it seems like on the builder, on most cases, the payload building early cancels which results in the payload not ever reaching the target flashblock. This seems to me to be the most deterministic way of ensuring locally built payloads contain a state root.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, you need to use --flashblock.leeway-time
And for external state root it should be considerable as usually it used in cases when it's slow to calculate state root, so you need some extra time to do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for decoupling
I think some assumptions behind state root calculation weren't designed for this case, so it's better to give it a good thought on how to change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/flashbots/op-rbuilder/pull/334/files#diff-20125ce86c70cfd520b09bc61e2032873f80249afe0027ee76dfbfcb503e7fb7R535-R542 - This issue should be resolved in the attached code block. We start the state root calculation immediately once the last target flashblock is built.
|
Hi! |
|
@sieniven could you please attach newPayload latency before and after fix? |
|
@SozinM Hey, thanks and appreciate your time in helping review this :) Converting this PR to draft first because there are still some issues when testing - will re-open it in abit once everything is working on our end. |
|
@SozinM I have re-opened the PR. From the above discussion thread, I understand there is the leeway time configuration which allows us to early package flashblocks. However on the assumption that we are trying to maximize throughput of the chain, setting too big a leeway time may tradeoff chain max throughput (gas executed per block). I am trying to think of the most optimal way of doing this - it still seems to me that unless the intent of the builder for flashbots is to always couple it with the rollup boost proxy service, perhaps we could explore the default behavior of the builder to calculate state root on payload resolve. Giving some context here - we found the issue with the builder sync-ing with tip to have severe effects on the chain's maximum throughput (gas per second) during our stress testing on the flashblocks architecture. It was found that during our stress tests which benchmarks sending ERC20 transactions, the tps of the chain drops significantly and a huge part of the bottleneck was coming from cache misses due to the mismatch in block hash. This PR has been tested on our end to work and resolves the cache miss issue, and the current design is such that the builder will be default calculate state root on payload resolution, which will also nicely help our use case as we intend to use op-rbuilder as a dependency library crate to integrate the flashblock payload builder directly into our node. Thanks again and appreciate the prompt review for this PR :) |
|
@SozinM another consideration to why this PR shouldnt have no impact with rollup boost is because engine_getPayload on rollup boost uses the underlying accumulated flashblocks' cache. Since the payload resolution on the builder is called async, together with the expensive state root calculation of the unsafe payload with the default EL on FCU, this fix on cache misses to locally built blocks should have little impact on the intended design of rollup boost? |
|
@sieniven Can you please show newPayload graph with and without this fix |
|
@sieniven reth already provides these metrics, you could set up their op-reth dashboard |
We used a different approach where we used the metrics on op-node instead. Since that will be a better metric in terms of understanding the impact of this PR? Comparing engine_newPayload without state root calculations on the builder will definitely be significantly slower since we now add state root calculations at the end of the payload resolution. |
|
@sieniven engine_newPayload metric would suffice. |
|
@SozinM Sure, we will generate both on the EL response times and on the CL response times for you, during our stress test environment (which is an ERC20 stress test). On a separate note, I think there are a few more issues which I have opened on rollup-boost which explains why cache misses will still happen in your current flashblock design:
Because the source of truth of the flashblock architecture is on the rollup-boost accumulated flashblock sequence, the consensus layer will always accept this as the unsafe payload (and subsequently calculate the external state root of that payload there). However, we cannot guarantee the state of the accumulated sequence of rollup boost and the builder to be consistent, thus resulting in cache misses on locally built payloads on engine_newPayload to the builder. We have tested and this PR is confirmed to have fixed cache misses on engine_newPayload if the payload state matches (on rollup-boost and on op-rbuilder during payload resolution). However, if the states are mismatched, payloads are still re-validated on the builder which affects the overall chain throughput. Also, it was noticed on my end that when stress testing, payload re-validations could take more than 2 seconds on the builder. There might be a mutex issue on the payload, but I have yet to really debug this fully. I cant replicate that anymore with the fixes on the state root calculations, but might be worth adding to your todo to investigate that. |
|
Wdym by
Is it payload built be fallback EL, or empty payload built by CL? |
@SozinM Sorry for the confusion. What i meant is in this specific case, payload is locally built by the op-rbuilder. However, due to the accumulated payload sequence on rollup-boost state being different from the builder, the eventual accepted unsafe payload returned to the op-node sequencer is different from what was actually built on the op-rbuilder (when payload is resolved on rollup-boost). This results in the subsequent engine_newPayload to still cache miss - which is what I meant by this PR not being able to fully resolve the cache miss issue. |
|
@SozinM We have generated the metrics you were asking for. Note that these were generated under stress test situation with full blocks filled with ERC20 transactions from multiple addresses to simulate high chain traffic situation. The metrics captures a 15minute duration, which sufficiently captures the results. Hope this helps.
Note that there are still occasional cache misses which was the default flashblocks design issue that I was mentioning above in this comment: https://github.com//pull/334#issuecomment-3595341837. Thus this PR still doesnt fully resolve the issue of cache misses (resulting in full validation of locally built payloads), but it significantly reduces this and should help with optimizing throughput of the default flashblocks architecture.
|


📝 Summary
✅ I have completed the following steps:
make lintmake test