-
Notifications
You must be signed in to change notification settings - Fork 254
Benchmarks for balance transfer check extension #3557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just had a look at the benchmark results in detail, and something isn't quite right. It
looks like 1000 nested utility calls takes the same time to process as one call.
use frame_system::pallet_prelude::RuntimeCallFor; | ||
|
||
#[benchmark] | ||
fn balance_transfer_check_mixed(c: Linear<0, 1000>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this benchmark is meaningful, because it's just a linear combination of the other two nested calls. (And the results show it's the average of the other two, within a small level of variation.)
So either 1000 nested utility calls will give the highest value, or 1000 multisig will, this third alternative never actually appears in the result. It just makes the weight calculation slightly more expensive to run.
It would be more useful to have a benchmark containing multiple non-transfer calls, which is reasonably likely when using utility calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more useful to have a benchmark containing multiple non-transfer calls, which is reasonably likely when using utility calls.
Why would the result be any different ? There is no special logic for non transfer calls. We can any type of calls and all of them would give same result as its just pattern matching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an early loop exit for transfer calls, making a run of non-transfer calls more expensive to process. It's likely that the optimiser will treat utility and non-transfer calls differently, like it treats utility and multisig calls differently.
But we'd need to run a benchmark to find out if the difference is significant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are so many ways to construct a nested call, I think the goal is to find the worst situation/most expensive one, since balance_transfer_check_mixed
is turns out not the one (and probably won't change in the future), I'm okay to remove it, but @teor2345 please also add the worst case you already have in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check the cases I listed above, and see which of them is the worst case (for a similar number of iterations).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A list of non-transfer calls is more expensive in some cases, so I replaced this benchmark with it:
https://github.com/autonomys/subspace/tree/benchmarks_disable_extension_exploration
I also tried tweaking the types of calls we're using in that branch, it wasn't more expensive, and in some cases it removed the variable CPU weight entirely. (So it was still prematurely optimised.)
// Estimated: `1486` | ||
// Minimum execution time: 1_000_000 picoseconds. | ||
Weight::from_parts(1_039_050, 0) | ||
.saturating_add(Weight::from_parts(0, 1486)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This result is suspicious, because it is a constant: there is no multiplication by c
. We might have accidentally hit a highly optimised code path for nested vectors.
But we need a non-optimised worst case benchmark here. Maybe it would help to use black_box
in our benchmark calls?
https://doc.rust-lang.org/beta/std/hint/fn.black_box.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know what black_box supposed to help here. We just want to capture the senario of finding transfer calls if any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmarks are supposed to measure the performance on unknown or unpredictable data, but in this case we're giving the compiler fixed data to work with. This means it can do more work at compile time, incorrectly reducing the weight of the call.
This section of the docs goes into more detail about why that's a problem in benchmarks:
https://doc.rust-lang.org/beta/std/hint/fn.black_box.html#when-is-this-useful
There is no black box or anything similar in the macro expansion:
https://github.com/paritytech/polkadot-sdk/blob/master/substrate/frame/support/procedural/src/benchmark.rs#L1030
So our alternatives are:
- use black_box to avoid compile-time optimisations based on the exact content of the call (because the content of the call is unknown in production code)
- use
inline(never)
during runtime benchmarking on the runtime functions we call (edited) - randomise how we construct the call
- all of the above
- assume it won't be an issue and accept the risk of reduced weights
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we can try black_box
and inline(never)
to see if it can help, but I'm leaning toward "assume it won't be an issue and accept the risk of reduced weights" because:
- This extension check is very lightweight
balance_transfer_check_multisig
will be used since it is larger thanbalance_transfer_check_utility
, and we can delete the latter since it is suspicious- Even in
balance_transfer_check_multisig
, thec
doesn't make much difference (it is at most19 * 1000
picoseconds)
Though I'm not sure if #3554 also has this problem if we change its benchmark to Linear
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that 19 is also an over-optimised estimate, I'll check it out today.
Though I'm not sure if #3554 also has this problem if we change its benchmark to
Linear
.
I expect that it will, because we can use exactly the same nested utility call structure, and the call checking loop is very similar as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get substantially different results for the linear CPU factor on my M1 Max MacBook Pro (20x more, 300-410 picoseconds * c
), but the constant CPU is similar to this PR (always lower, and within 6%).
This suggests the results in this PR are an optimiser quirk on a particular platform or architecture.
@vedhavyas it makes me wonder if we're using different compiler commands, mine is:
cargo build --release --features runtime-benchmarks --bin subspace-node
This is using the nightly-2025-05-31
compiler and 2024 edition, I'm guessing you were on the previous nightly 2025-04-15
and 2021 edition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using black_box
seems like the most helpful way to get reliable/reasonable benchmark figures here:
https://github.com/autonomys/subspace/tree/benchmarks_disable_extension_exploration
I also tried tweaking the types of calls we're using in that branch, it wasn't more expensive, and in some cases it removed the variable CPU weight entirely. (So it was still prematurely optimised.) Randomising the call types might help, but it's also a bit out of scope.
use sp_runtime::DispatchResult; | ||
|
||
/// Maximum number of calls we benchmarked for. | ||
const MAXIMUM_NUMBER_OF_CALLS: u32 = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the actual number of nested calls is greater than this? Does the submitter get the extra calls for free?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes if it exceeds 1000, the remaining are free but max decoding depth for extrinsics is 256. So we will never reach that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are at least 3 ways to get more than 1000 loop iterations without exceeding the decoding depth:
utility([non_transfer_call; 2000])
: 2000 calls in one utility callutility([utility([non_transfer_call; 100]); 100])
: 100 utility calls each containing 100 calls- a btree structure, with 2 calls at each level and 2 nested utility calls within them, this would take around 10 levels to have more than 1000 calls
It is likely that decoding will have different performance for case 3, due to branch misprediction (there are no long runs of the same item).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MAXIMUM_NUMBER_OF_CALLS
is inevitable because the weight system works as:
- Charge for the worst situation weight
- Do execution/validation and get an accurate weight
- Refund any overcharged weight
So we have to first define the worst situation weight with MAXIMUM_NUMBER_OF_CALLS
, the value of MAXIMUM_NUMBER_OF_CALLS
is hard to determine since we don't really have an explicit limit on the maximum nested calls, options I can think of:
- Add such an explicit limit to the extension
- Calculate
MAXIMUM_NUMBER_OF_CALLS
byblock_size / min_extrinsic_size
- Leave
MAXIMUM_NUMBER_OF_CALLS
as is or pick another value, since the check is lightweight, the resulting weight won't be much different I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried again with max of 1000 nested utility calls with 500 system.remark calls until the last nested call. Last one included on balance transfer.
Allocator failed to allocate the memory. If such a case arise in practical, that extrinsic will never be inluded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if that's the data structure I was talking about above.
Either way, there's going to be some number of calls that fits within the allocation limits. I'll try a few things and add a benchmark if it's heavier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A list of 1000 non-transfer calls fits in memory and is heavier in some cases:
https://github.com/autonomys/subspace/tree/benchmarks_disable_extension_exploration
I haven't had time to explore the other data structures like squares and trees. Since both "long" and "tall" calls have similar results, I'm not sure it's going to be much different. Trees might be different if there's a lot of pointer dereferences, but I don't think it's a blocker for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just had a look at the benchmark results in detail, and something isn't quite right. It
looks like 1000 nested utility calls takes the same time to process as one call.
This benchmark number is an avg of all the iterations within 0...1000. If there is no reads or write for the entirety, and due to rust optimisations of pattern matching I belive this is expected
use sp_runtime::DispatchResult; | ||
|
||
/// Maximum number of calls we benchmarked for. | ||
const MAXIMUM_NUMBER_OF_CALLS: u32 = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes if it exceeds 1000, the remaining are free but max decoding depth for extrinsics is 256. So we will never reach that
use frame_system::pallet_prelude::RuntimeCallFor; | ||
|
||
#[benchmark] | ||
fn balance_transfer_check_mixed(c: Linear<0, 1000>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more useful to have a benchmark containing multiple non-transfer calls, which is reasonably likely when using utility calls.
Why would the result be any different ? There is no special logic for non transfer calls. We can any type of calls and all of them would give same result as its just pattern matching.
// Estimated: `1486` | ||
// Minimum execution time: 1_000_000 picoseconds. | ||
Weight::from_parts(1_039_050, 0) | ||
.saturating_add(Weight::from_parts(0, 1486)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know what black_box supposed to help here. We just want to capture the senario of finding transfer calls if any.
This comment was marked as resolved.
This comment was marked as resolved.
I'm very confused with this statement. WASM should not have unknown or unpredictable data here. If you look at the way the calls are constructed closely, you will see the balance transfer is the last call which was nested into multiple layers of utility calls. I do not see any reason why rust's pattern matching and lopping of calls should take same or similar time as reading or writing data from/to DB through host functions 🤔
Why do we care about the content of the call when all we care about is the type of call ?
I have just tried this and it makes not significant difference actually
Why would decoding is a problem here. The call is already decoded even before it reaches this extension I'm a bit lost from what you are suggesting and maybe @NingLin-P has some more insights ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a look at this today in detail, including some alternative data structures and benchmark wrappers like black_box
use sp_runtime::DispatchResult; | ||
|
||
/// Maximum number of calls we benchmarked for. | ||
const MAXIMUM_NUMBER_OF_CALLS: u32 = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if that's the data structure I was talking about above.
Either way, there's going to be some number of calls that fits within the allocation limits. I'll try a few things and add a benchmark if it's heavier.
use frame_system::pallet_prelude::RuntimeCallFor; | ||
|
||
#[benchmark] | ||
fn balance_transfer_check_mixed(c: Linear<0, 1000>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check the cases I listed above, and see which of them is the worst case (for a similar number of iterations).
// Estimated: `1486` | ||
// Minimum execution time: 1_000_000 picoseconds. | ||
Weight::from_parts(1_039_050, 0) | ||
.saturating_add(Weight::from_parts(0, 1486)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that 19 is also an over-optimised estimate, I'll check it out today.
Though I'm not sure if #3554 also has this problem if we change its benchmark to
Linear
.
I expect that it will, because we can use exactly the same nested utility call structure, and the call checking loop is very similar as well.
I'm going to try and demonstrate this using code changes, because I think that might be clearer than discussions/documentation.
My concern is that one of our benchmarks is broken - it is not reflecting the actual cost of processing a call in production:
And if one benchmark is obviously broken, it's possible multiple benchmarks are inaccurate, but in less obvious ways.
Your statement is correct, but it describes the cause of the problem we're seeing in the benchmark results. So I wonder if there's some kind of misunderstanding of how to create reliable benchmarks? Maybe re-reading this part of the Rust documentation could help clear things up? If it doesn't, can you help me understand which part needs more explanation?
I think there's been a bit of a miscommunication here. These kinds of optimisations happen at (or before) assembly level, so what matters for benchmark timing is the number of pointer dereferences and memory copies. There's no high-level data structure transformations going on.
I'm not sure why host functions are relevant here.
I think we're talking about the same thing using different words? But just to be clear, this is about automatic compiler optimisations based on specific benchmark data, not anything we've explicitly written in the code. |
//! THIS FILE WAS AUTO-GENERATED USING THE SUBSTRATE BENCHMARK CLI VERSION 46.2.0 | ||
//! DATE: 2025-05-29, STEPS: `50`, REPEAT: `20`, LOW RANGE: `[]`, HIGH RANGE: `[]` | ||
//! WORST CASE MAP SIZE: `1000000` | ||
//! HOSTNAME: `macmini`, CPU: `<UNKNOWN>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a minor issue, but would it be useful to edit the CPU type in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a bunch of different changes in this branch, and left the code and results unsquashed so everyone can see it:
https://github.com/autonomys/subspace/tree/benchmarks_disable_extension_exploration
The useful changes were:
- a list of 1000 non-transfer calls in one utility call
- black_box wrapping the benchmark arguments
Let me know what you think, I'm happy to continue to add tree or square call structures, but I think we have good coverage and reasonable results now.
use frame_system::pallet_prelude::RuntimeCallFor; | ||
|
||
#[benchmark] | ||
fn balance_transfer_check_mixed(c: Linear<0, 1000>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A list of non-transfer calls is more expensive in some cases, so I replaced this benchmark with it:
https://github.com/autonomys/subspace/tree/benchmarks_disable_extension_exploration
I also tried tweaking the types of calls we're using in that branch, it wasn't more expensive, and in some cases it removed the variable CPU weight entirely. (So it was still prematurely optimised.)
// Estimated: `1486` | ||
// Minimum execution time: 1_000_000 picoseconds. | ||
Weight::from_parts(1_039_050, 0) | ||
.saturating_add(Weight::from_parts(0, 1486)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using black_box
seems like the most helpful way to get reliable/reasonable benchmark figures here:
https://github.com/autonomys/subspace/tree/benchmarks_disable_extension_exploration
I also tried tweaking the types of calls we're using in that branch, it wasn't more expensive, and in some cases it removed the variable CPU weight entirely. (So it was still prematurely optimised.) Randomising the call types might help, but it's also a bit out of scope.
use sp_runtime::DispatchResult; | ||
|
||
/// Maximum number of calls we benchmarked for. | ||
const MAXIMUM_NUMBER_OF_CALLS: u32 = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A list of 1000 non-transfer calls fits in memory and is heavier in some cases:
https://github.com/autonomys/subspace/tree/benchmarks_disable_extension_exploration
I haven't had time to explore the other data structures like squares and trees. Since both "long" and "tall" calls have similar results, I'm not sure it's going to be much different. Trees might be different if there's a lot of pointer dereferences, but I don't think it's a blocker for this PR.
The results I have seen on your branch does align with the results the I have generated albeit small fluctuations which is expected over different benchmark runs with std-errors using linear variant. I'm not sure why you believe having different variants of calls would help in this case. I'm specifically curious why you believe that non-transfer calls take more time 🤔 At the end of the day, it is still pattern matching. We do want the benchmarks for optimised code since that is the code that runs on the mainnet |
Thanks for working through this with me - I understand there's a lot going on right now. But I think there's been an ongoing miscommunication on this PR, and some important details are being lost. Having accurate benchmarks is security-critical, so I've highlighted the important parts of what I'm saying in italics.
What you say is only correct for individual calls, if there is more than one call in a batch, the call types matter. Checking multiple non-transfer calls contained in a utility batch call takes more time, because this loop exits at the first transfer call: subspace/crates/subspace-runtime-primitives/src/extension.rs Lines 85 to 102 in 4f4900a
This has nothing to do with compiler optimisation settings, it's about code being optimised in different ways in benchmarks and in production. Processing the fixed calls used in benchmarks at compile time can create runtime benchmark results that significantly under-estimate weights for the variable calls we see at runtime on production nodes. This is a security and performance risk. These benchmarking concepts are explained here in more detail: Happy to explain further the next time we have a call? |
I understood what you said in other cases this is true but in this case it absolutely does not matter.
yes and why do you think my benchmarks does include one transfer call where is another benchmmark that included a different call other than transfer. Lets assume, like you said, different types of calls here, which I believe is not true due to pattern matching with your changes from your branch https://github.com/autonomys/subspace/blob/benchmarks_disable_extension_exploration/crates/subspace-runtime-primitives/src/extension/weights.rs I see no difference in the weights.
As me and @NingLin-P mentioned why there is a need to have a upper limit on number of calls for the benchmark as a worst case scenario. Having said that, even if the call do have more than upper limit, there will not be a signficant compute time comparitively based on benchmark results from the both the branches. Only thing we can play around is the upper limit of the calls which can still be up for discussion if you prefer a higher number than 1000 |
I would like to make progress on this on the other benchmark PR Lets align and move forward. Thanks! Will fix conflicts once there is alignemnt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed two changes here that improve benchmark accuracy:
- using a single list of calls, rather than a nested list of mixed calls
- using 5,000 as the call limit, rather than 1,000
Happy for you to squash or drop those commits, whatever gets this merged.
Sure but next time if you want to push changes, please fork this branch and create a PR with this branch as base. |
Closes: #3502
Code contributor checklist: