-
Notifications
You must be signed in to change notification settings - Fork 3
Optimize computation algorithm #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thank you for doing pr. I read your comment and really appreciate your optimization hints! I tested your fractal_calculation.ts and got speed increase from 5 to 7.7 fps (54% boost). However, the fractal itself was glitched (white places are transparent, so I think they weren't filled): I want try to do your implementation from scratch to find out where is bug. By the way, I already tried using optimized comparison with threshold 0.001 and got fps increase from 5 to 6.7 (with 0.000002 it speed ups to 5.6 fps). I agree with most of your points and find all of them very helpful, but I'd like to talk more about some of them:
Isn't that what SIMD commands were made for? I didn't delve into the story of SIMD commands, but it seems to me that the fact that vector commands do same things in one go is essentially "doing everything in a row". I might be wrong, so please correct me if I am.
Done.
You mean loading 2 complex numbers at once? I don't quite understand what Thanks again for valuable advice. I will try to improve my code according to your recommendations. |
I realized yesterday while mulling that my math was wrong. (Counterexample to
I was referring to loading 4 entire complex numbers at once and then using the scalar algorithm, just rewritten to use SIMD operations. Essentially, what you'd have is this: let low_2_complex_nums = *(roots_chunk.as_ptr() as *const v128);
let high_2_complex_nums = *(roots_chunk.as_ptr() as *const v128).offset(2);
let reals = i32x4_shuffle::<0, 2, 4, 6>(low_2_complex_nums, high_2_complex_nums);
let imaginaries = i32x4_shuffle::<1, 3, 5, 7>(low_2_complex_nums, high_2_complex_nums); And then you'd do the scalar algorithm as you did with single complex numbers, branching and all, effectively following the scalar algorithm. In the end, it'd look similar to my JS code (mod the logic error with the condition), just using vector operations instead. This is what would give you the speedup, as you're operating at just under scalar speed while processing 4 items at once. |
Well actually despite your conditions are not identically equal, your variant of check that used
Hm, I didn't think about that. I should try this approach, thank you for suggesting! UPD. I found that you used wrong formula for distance calculation in let ratio = real / imag;
return real * Math.sqrt(1 + ratio * ratio); Result of squaring must be multiplied by let ratio = real / imag;
return Math.abs(imag) * Math.sqrt(1 + ratio * ratio); |
So I optimized JS code and now it has same speed as wasm-scalar in Chrome, I'm honestly surprised. Though in firefox JS is still ~3 times slower than wasm. |
Good catch, and that's probably where the glitchiness came from. Been a while since I've done numerical processing, can't you tell? 😅 |
Worth noting you may still derive some speed benefit from WebAssembly simply from using smaller integers. But yeah, JS after JIT compilation is usually reasonably close to native. (Firefox is probably just not inlining something it should - that's my guess.) |
@dead-claudia I tried to rewrite wasm-simd algorithm to make it load 4 complex numbers as you said. But it haven't raised performance: instead fps dropped from 21 to 15. You can check new code here. I think this has to do with the fact, that when you process 4 complex numbers with one root, you still need to do same calculations per cycle as when processing 2 complex numbers with one root, because you need two vectors instead of one to store complex numbers. This means that, in theory, there shouldn't be speed gain. Although my initial algorithm processed one complex number with two roots. The loss of speed, I think, is due to the inability to exit the approximation loop prematurely. Just compare this (old) to this (new). So, in my opinion, it's impossible to exactly replicate scalar implementation using SIMDs, or at least I don't know how to recreate this neat |
Yeah, I was imagining a condition check there, using essentially this algorithm (as mentioned earlier):
Based on your prior numbers, this might bring it roughly equal. (It's surprising, but seems I didn't deconstruct the algorithm enough to confirm the number of operations per instruction first.) |
This isn't all JS-specific (in fact, most of it isn't), though I used JS for familiarity. I haven't tested or benchmarked this (part of why it's a draft PR), but I created the PR anyways just to help explain.
The long 104-line comment explains most of it, but there's a few key insights into this:
a * sqrt(1 + (a/b)^2)
for computing the two-dimensional Euclidean norm, as it's more numerically stable. (JS has aMath.hypot(...components)
for this reason.) This happens to require an extra floating point division, but you can reclaim with a floating-pointfma
where available. (This applies across the board, though I see you're already using the WebGLdistance
function, so that's not an issue in the GPU code.)Also, separately, I have a couple larger nits:
type Root = [number, number]
androot: Root[]
would be more idiomatic of a type for yourroots
. (You can also label them as[x: number, y: number]
for added clarity.) Likewise,type Color = [number, number, number, number]
+colors: Color[]
would be clearer and more idiomatic. This would also cause the array's length to get checked for, so you wouldn't be as likely to accidentally forget an entry or add one too many entries.{i,u}8x16_swizzle(a)
may yield slightly faster code for the lane shifting than*_shuffle(a, a)
due to better instruction selection, though that will need benchmarked.let real = f32x4_shuffle::<0, 2, 4, 6>(a, b); let imag = f32x4_shuffle::<0, 2, 4, 6>(a, b);
so you can do it in full batches of 4 using a similar algorithm to the scalar variant. This would require slightly more registers of course and it'd also complicate finding the matching lane (TL;DR:let lane_idx = i32x4_bitmask(f32x4_lt(square_norm, f32x4_splat(0.000002))).trailing_zeros()
, lane found iflane_idx < 4
), but you might derive a full 1.5x speedup from that if not more as you're doing even the expensive operations likesqrt
fully parallel.