Skip to content

Replace Dart-flute-complex with Dart-flute-todomvc #97

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

eqrion
Copy link

@eqrion eqrion commented Jul 23, 2025

This is the WIP commit I've been using to compare the behavior of Dart-flute-todomvc and Dart-flute-complex. I also built a local version of https://wonderous.app/web/ using Wasm-GC to compare as well.

From profiling I observed the following relative rates of allocation in bytes per frame:

  • Dart-flute-todomvc = 1x (baseline)
  • Dart-flute-complex = 39x
  • wondrous.app = 0.34x

This is from a profiler that samples 1% of allocations and reports the stack, bytes, etc for them. So I have good relative numbers, but not precise absolute numbers.

In summary, complex allocates about 39x more bytes than todomvc, which allocates about 3x more bytes than wondrous.app.

In addition, over half of the allocated bytes in complex are coming from stacks with CupertinoTimePicker in their name. That seems excessive to me.

I would prefer if todomvc replaces complex, as it seems more representative of realistic workloads. This PR includes both for if anyone else wants to do comparisons on them, but I would remove the complex benchmark if we go ahead with this.

Here are dartpad links for complex/todomvc if anyone wants to look at the rendered output of either.

cc' @danleh @camillobruni @kmiller68 @mkustermann

Update to the latest wasm_gc_benchmarks commit, and pull in the todomvc
benchmark too.

I had to update the benchmark runner polyfill. I just copy pasted things
from tools/run_wasm.js until it worked for me. This part likely needs
some filtering and clean up.
Copy link

netlify bot commented Jul 23, 2025

Deploy Preview for webkit-jetstream-preview ready!

Name Link
🔨 Latest commit e16893c
🔍 Latest deploy log https://app.netlify.com/projects/webkit-jetstream-preview/deploys/68814d1b47b2f40007fa30c2
😎 Deploy Preview https://deploy-preview-97--webkit-jetstream-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Contributor

@danleh danleh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, changing to todomvc as the Dart workload generally sounds fine to me (but it would be great if @mkustermann can chime in as the Dart expert).

In terms of overall CPU profile, we see on d8 (complex vs. todomvc, where I already increased the iterations for todomvc to 30, see comment in the driver):

  • Roughly the same amount of samples spent in compiling code (2.2B vs. 2.1B), so given the shorter overall runtime, compilation accounts for a higher proportion of the samples in todomvc (5% vs. 11%).
  • todomvc is certainly much less GC heavy, both in absolute (8.8B vs. 2.1B samples) and relative (20% vs. 11% of all samples) terms.
  • In terms of the bottom up profile, they seem to be roughly similar, i.e., both contain similar high-self time functions, e.g., _getMasqueradedRuntimeType, _HashSet.add, MultiChildRenderObjectElement.update, ComponentElement.performRebuild, _CompressedNode.get, SingleChildRenderObjectElement.update, Element.dependOnInheritedElement, _checkInstance (polymorphic dispatcher). The heaviest functions that were in complex, but are missing from the todomvc profile are BoxConstraints.==, RenderBox.layout, RenderShiftedBox.paint @mkustermann can you comment whether that's realistic/fine/intended?

Also some smaller nits and fixes to make it run in d8, see detailed comments.

try {
await action();
} catch (e) {
// JSC doesn't report/print uncaught async exceptions for some reason.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmiller68 Is this something to fix on JSC side?

var id = timerIdCounter++;
// A callback can be scheduled at most once.
// (console.assert is only available on D8)
if (isD8) console.assert(f.$timerId === undefined);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While console.assert is available in d8, we overwrite/polyfill the console object in

console = {
, so uncommenting this line causes an error. Let's just remove it.

self.dartUseDateNowForTicks = true;

function dartPrint(...args) { console.log(args); }
// globalThis.window ??= globalThis;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line needs to be uncommented for d8.


// Signals `Stopwatch._initTicker` to use `Date.now` to get ticks instead of
// `performance.now`, as it's not available in d8.
self.dartUseDateNowForTicks = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is not required, d8 has performance.now since a number of years: https://crsrc.org/c/v8/src/d8/d8.cc;drc=cb2b1a2ff084490f114e27a84b2a9a439af40ded;l=2180

@@ -19,7 +19,10 @@ mkdir -p build/ | tee -a "$BUILD_LOG"
# Generic Dart2wasm runner.
cp wasm_gc_benchmarks/tools/run_wasm.js build/ | tee -a "$BUILD_LOG"
# "Flute Complex" benchmark application.
cp wasm_gc_benchmarks/benchmarks-out/flute.dart2wasm.{mjs,wasm} build/ | tee -a "$BUILD_LOG"
# cp wasm_gc_benchmarks/benchmarks-out/flute.complex.dart2wasm.mjs build/flute.dart2wasm.mjs | tee -a "$BUILD_LOG"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove commented out lines

@@ -1272,7 +1272,7 @@ class WSLBenchmark extends Benchmark {
benchmark.buildStdlib();
results.push(performance.now() - start);

performance.measure(markLabel, markLabel);
// performance.measure(markLabel, markLabel);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -1283,7 +1283,7 @@ class WSLBenchmark extends Benchmark {
benchmark.run();
results.push(performance.now() - start);

performance.measure(markLabel, markLabel);
// performance.measure(markLabel, markLabel);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

// `performance.now`, as it's not available in d8.
self.dartUseDateNowForTicks = true;

function dartPrint(...args) { console.log(args); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using print instead of console.log here hides the Dart console output, which I think makes sense.

],
preload: {
jsModule: "./Dart/build/flute.todomvc.dart2wasm.mjs",
wasmBinary: "./Dart/build/flute.todomvc.dart2wasm.wasm",
},
iterations: 15,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the shorter runtime of todomvc, I'd be happy to increase iterations to 30 (or so), to have a bit less noisy measurement (still <5s Wall time in d8).

},
iterations: 15,
worstCaseCount: 2,
tags: ["Wasm"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given #103, probably makes sense to add tag "disabled" for complex, and "default" for todomvc.

@danleh danleh requested a review from kmiller68 July 30, 2025 13:28
@mkustermann
Copy link
Contributor

In general I'm definitely in favor of including TodoMVC benchmark.

These two flutter-based benchmark are just very different:

  • TodoMVC is mostly static, a new frame will add a new Todo item to a list but the rest of the page stays the same. Complex instead is re-building the entire page on each frame with modified widget positions, which will cause re-build, re-layout, re-draw of everything (which is why more time is spent in e.g. RenderBox.layout etc)
    => For real apps a re-building, re-layout & re-draw of entire screen would happen e.g. on first frame, if window is resized, navigation to new screen or animations change sizing of everything on screen
    => The complex benchmark is doing this on every frame when in real apps it happens only occasionally
  • TodoMVC is simpler page, it has less widgets and simpler widgets being displayed. The Complex benchmark uses many more widgets. Especially the CupertinoTimePicker widget causes a lot of CPU time to be spent, lots of objects to be allocated etc (arguably it's implementation is sub-optimal -- but it's code that real flutter apps out there use)

In some sense one could view

  • TodoMVC as exercising "average frame" workload (i.e. update small part of screen)
  • Complex benchmark exercising "worst frame" workload (i.e. rebuild entire screen from scratch)

So if one optimizes for Complex then one will in real apps optimize the worst-case/longest frames (which arguably is important - as the longest frames can cause UI jank).

What are the policies around benchmark inclusion into JetStream? Is there a concept of "experimental" workload where the bar for inclusion is lower (like in WebKit/Speedometer/experimental)? Is it a problem to keep both?

(If it's helpful, I'd actually be happy to supply more workloads. Recently I got dart2wasm to be able to compile itself working - which may also be quite interesting (and there's big differences in CPU time and RAM consumption of D8 vs JSC vs JSShell. We also have a more real world larger flutter UI app that has an automated benchmark of a few screens that could be interesting - but I'd need to see if it can be made to run with flute framework instead of real flutter)

@eqrion
Copy link
Author

eqrion commented Jul 30, 2025

In some sense one could view

* TodoMVC as exercising "average frame" workload (i.e. update small part of screen)

* Complex benchmark exercising "worst frame" workload (i.e. rebuild entire screen from scratch)

So if one optimizes for Complex then one will in real apps optimize the worst-case/longest frames (which arguably is important - as the longest frames can cause UI jank).

I'm not opposed to trying to measure 'full frame redraw' workloads, but I'm concerned that 'complex' is pathological in a way that's much worse than we could realistically expect. From the first comment, I was seeing ~40x more bytes allocated per-frame than TodoMVC, which itself allocated 3x more bytes than the wondrous.app (which was the easiest thing I could get running by myself). And of all the bytes in 'complex', over half were coming from that time picker.

Is there a small modification you could make to TodoMVC to have it do some more full frame redraws? Maybe resize something for a couple frames?

What are the policies around benchmark inclusion into JetStream? Is there a concept of "experimental" workload where the bar for inclusion is lower (like in WebKit/Speedometer/experimental)? Is it a problem to keep both?

I don't think it's a problem to keep both if we apply the 'disabled' tag. That will keep it tested, but it won't be in the main final score.

(If it's helpful, I'd actually be happy to supply more workloads. Recently I got dart2wasm to be able to compile itself working - which may also be quite interesting (and there's big differences in CPU time and RAM consumption of D8 vs JSC vs JSShell. We also have a more real world larger flutter UI app that has an automated benchmark of a few screens that could be interesting - but I'd need to see if it can be made to run with flute framework instead of real flutter)

I'm happy to see more workloads, but I think for the main final score of the benchmark we need to balance things to not overly weight one particular language or workload. DotNet got two subtests, so I could see an argument for having two Dart subtests. But I don't want the other subtest to be 'complex' right now for the reasons stated above.

@mkustermann
Copy link
Contributor

but I'm concerned that 'complex' is pathological in a way that's much worse than we could realistically expect.

That's a very fair concern and is due to this inefficient CupertinoTimePicker implementation. I can try using a different widget instead of that one in Complex benchmark and see how that influences the profile & allocation rate (**)

Is there a small modification you could make to TodoMVC to have it do some more full frame redraws? Maybe resize something for a couple frames?

I'll try to make the (quickly made) TodoMVC example do more work. Amongst other things, I'd like it to scroll down the list when adding new items - as scrolling is an important thing for apps (**)

I'm happy to see more workloads, but I think for the main final score of the benchmark we need to balance things to not overly weight one particular language or workload.

That makes sense. Happy to land some of those in disabled state or make an aggregate number out of sub-benchmarks.

If all the vendors prefer TodoMVC over Complex atm, then how about we leave both in and make Complex disabled for now?

(**) I'll be OOO for a few more weeks over the summer, so it may only happen afterwards.

(Sidenote: Consider waiting until #95 is finished or update this PR to include latest https://github.com/mkustermann/wasm_gc_benchmarks - to avoid updating the binaries twice)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants