Skip to content

Optimize computation algorithm #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dead-claudia
Copy link

This isn't all JS-specific (in fact, most of it isn't), though I used JS for familiarity. I haven't tested or benchmarked this (part of why it's a draft PR), but I created the PR anyways just to help explain.

The long 104-line comment explains most of it, but there's a few key insights into this:

  • Every object allocated in JS is allocated on the heap. (Even stacks are commonly heap-allocated.) In performance-critical code, it's common to avoid object allocation entirely. Temporaries that are immediately read from and dropped are typically optimized for and usually are only allocated in the really fast nursery, so the allocation cost here is relatively low (though still non-negligible in hot loops).
  • CPUs do better if you can do everything in a row, and memory loads outside L1 cache tend to be significantly costlier than the computation. This is why I flattened the roots: to ensure that the inner loop is dealing with exactly one contiguous array. (I'm using a typed array to match idioms, but normal arrays are more idiomatic.)
  • You should prefer a * sqrt(1 + (a/b)^2) for computing the two-dimensional Euclidean norm, as it's more numerically stable. (JS has a Math.hypot(...components) for this reason.) This happens to require an extra floating point division, but you can reclaim with a floating-point fma where available. (This applies across the board, though I see you're already using the WebGL distance function, so that's not an issue in the GPU code.)
  • You can do a little math and take advantage of both sides being positive to only do one comparison in the inner loop - this may yield a moderate performance boost as you wouldn't need to do the 2 absolute values + 2 floating-point comparisons, but only one single comparison with an already-calculated result. I explain the math in the big comment as well. This holds for all variants, and would also simplify your SIMD condition substantially.

Also, separately, I have a couple larger nits:

  • In the TypeScript stuff, using type Root = [number, number] and root: Root[] would be more idiomatic of a type for your roots. (You can also label them as [x: number, y: number] for added clarity.) Likewise, type Color = [number, number, number, number] + colors: Color[] would be clearer and more idiomatic. This would also cause the array's length to get checked for, so you wouldn't be as likely to accidentally forget an entry or add one too many entries.
  • You're using 32-bit floating point arithmetic for WebAssembly, while JS uses 64-bit floating point arithmetic natively, so it's not really an apples-to-apples comparison. This should be made clear in the demo.
  • In your SIMD calculation, {i,u}8x16_swizzle(a) may yield slightly faster code for the lane shifting than *_shuffle(a, a) due to better instruction selection, though that will need benchmarked.
  • In your SIMD calculation, I would suggest loading chunks of 4 and doing let real = f32x4_shuffle::<0, 2, 4, 6>(a, b); let imag = f32x4_shuffle::<0, 2, 4, 6>(a, b); so you can do it in full batches of 4 using a similar algorithm to the scalar variant. This would require slightly more registers of course and it'd also complicate finding the matching lane (TL;DR: let lane_idx = i32x4_bitmask(f32x4_lt(square_norm, f32x4_splat(0.000002))).trailing_zeros(), lane found if lane_idx < 4), but you might derive a full 1.5x speedup from that if not more as you're doing even the expensive operations like sqrt fully parallel.

@alordash
Copy link
Owner

alordash commented Mar 26, 2022

Thank you for doing pr. I read your comment and really appreciate your optimization hints!

I tested your fractal_calculation.ts and got speed increase from 5 to 7.7 fps (54% boost). However, the fractal itself was glitched (white places are transparent, so I think they weren't filled):

Rendered fractal

newton

I want try to do your implementation from scratch to find out where is bug. By the way, I already tried using optimized comparison with threshold 0.001 and got fps increase from 5 to 6.7 (with 0.000002 it speed ups to 5.6 fps).

I agree with most of your points and find all of them very helpful, but I'd like to talk more about some of them:

  1. CPUs do better if you can do everything in a row

Isn't that what SIMD commands were made for? I didn't delve into the story of SIMD commands, but it seems to me that the fact that vector commands do same things in one go is essentially "doing everything in a row". I might be wrong, so please correct me if I am.
I understand that JS doesn't have SIMD commands, but if I were to sacrifice readability of code for this kind of optimization, I would prefer to use SIMD calculations straight away. At least in this case it's not that hard.

  1. (About f32 and f64 numbers in wasm and JS) This should be made clear in the demo.

Done.

  1. In your SIMD calculation, I would suggest loading chunks of 4

You mean loading 2 complex numbers at once? I don't quite understand what a and b mean in your example.
I have tried loading 2 complex numbers instead of single one in simd_newton_method_approx function and got only ~5% performance improvement. Here's that implementation. Actually, I think it hasn't and won't speed up process significantly because even when I load twice as much data, I also need to do almost twice as many calculations. For example, I need to calculate difference two times because I need to calculate difference between each of loaded complex numbers and each root. Also it greatly complicates code (perhaps because I wrote it badly).

Thanks again for valuable advice. I will try to improve my code according to your recommendations.

@dead-claudia
Copy link
Author

dead-claudia commented Mar 28, 2022

@alordash

I tested your fractal_calculation.ts and got speed increase from 5 to 7.7 fps (54% boost). However, the fractal itself was glitched (white places are transparent, so I think they weren't filled):

I want try to do your implementation from scratch to find out where is bug. By the way, I already tried using optimized comparison with threshold 0.001 and got fps increase from 5 to 6.7 (with 0.000002 it speed ups to 5.6 fps).

I realized yesterday while mulling that my math was wrong. (Counterexample to a > b AND c > da + c > b + d: a = 2, b = 1, c = 1, d = 1.52 > 1 AND 1 > 1.5 doesn't hold yet 2 + 1 > 1 + 1.5 does.) So feel free to ignore that optimization as it's not actually valid.

I have tried loading 2 complex numbers instead of single one in simd_newton_method_approx function and got only ~5% performance improvement. Here's that implementation. Actually, I think it hasn't and won't speed up process significantly because even when I load twice as much data, I also need to do almost twice as many calculations. For example, I need to calculate difference two times because I need to calculate difference between each of loaded complex numbers and each root. Also it greatly complicates code (perhaps because I wrote it badly).

I was referring to loading 4 entire complex numbers at once and then using the scalar algorithm, just rewritten to use SIMD operations. Essentially, what you'd have is this:

let low_2_complex_nums = *(roots_chunk.as_ptr() as *const v128);
let high_2_complex_nums = *(roots_chunk.as_ptr() as *const v128).offset(2);
let reals = i32x4_shuffle::<0, 2, 4, 6>(low_2_complex_nums, high_2_complex_nums);
let imaginaries = i32x4_shuffle::<1, 3, 5, 7>(low_2_complex_nums, high_2_complex_nums);

And then you'd do the scalar algorithm as you did with single complex numbers, branching and all, effectively following the scalar algorithm. In the end, it'd look similar to my JS code (mod the logic error with the condition), just using vector operations instead.

This is what would give you the speedup, as you're operating at just under scalar speed while processing 4 items at once.

@alordash
Copy link
Owner

alordash commented Mar 30, 2022

@dead-claudia

I realized yesterday while mulling that my math was wrong. (Counterexample to a > b AND c > d → a + c > b + d: a = 2, b = 1, c = 1, d = 1.5 → 2 > 1 AND 1 > 1.5 doesn't hold yet 2 + 1 > 1 + 1.5 does.) So feel free to ignore that optimization as it's not actually valid.

Well actually despite your conditions are not identically equal, your variant of check that used squareNorm is correct. My original condition was wrong: |dx| < D AND |dy| < D is not the right way to determine whether point (dx, dy) is close to origin. Correct way is to use following condition: dx^2 + dy^2 < D^2, and that's what you did. Fractal renders correctly using this condition.

I was referring to loading 4 entire complex numbers at once and then using the scalar algorithm, just rewritten to use SIMD operations.

This is what would give you the speedup, as you're operating at just under scalar speed while processing 4 items at once.

Hm, I didn't think about that. I should try this approach, thank you for suggesting!

UPD. I found that you used wrong formula for distance calculation in complexAbs function:

let ratio = real / imag;
return real * Math.sqrt(1 + ratio * ratio);

Result of squaring must be multiplied by imag, not real. Also, since imag can be negative, we need to take it's absolute value to match original hypotenuse formula. Correct formula is:

let ratio = real / imag;
return Math.abs(imag) * Math.sqrt(1 + ratio * ratio);

@alordash
Copy link
Owner

alordash commented Mar 30, 2022

So I optimized JS code and now it has same speed as wasm-scalar in Chrome, I'm honestly surprised. Though in firefox JS is still ~3 times slower than wasm.
Test setup: 1248x734px canvas, 3 roots, 1 thread, 11 iterations, using test-animation script.
Optimizations: PR: #6.
Guess it's time to optimize wasm realization.

@dead-claudia
Copy link
Author

@alordash

Result of squaring must be multiplied by imag, not real. Also, since imag can be negative, we need to take it's absolute value to match original hypotenuse formula. Correct formula is: [snip]

Good catch, and that's probably where the glitchiness came from. Been a while since I've done numerical processing, can't you tell? 😅

@dead-claudia
Copy link
Author

So I optimized JS code and now it has same speed as wasm-scalar in Chrome, I'm honestly surprised. Though in firefox JS is still ~3 times slower than wasm.

Worth noting you may still derive some speed benefit from WebAssembly simply from using smaller integers. But yeah, JS after JIT compilation is usually reasonably close to native. (Firefox is probably just not inlining something it should - that's my guess.)

@alordash
Copy link
Owner

alordash commented Apr 1, 2022

@dead-claudia I tried to rewrite wasm-simd algorithm to make it load 4 complex numbers as you said. But it haven't raised performance: instead fps dropped from 21 to 15. You can check new code here.

I think this has to do with the fact, that when you process 4 complex numbers with one root, you still need to do same calculations per cycle as when processing 2 complex numbers with one root, because you need two vectors instead of one to store complex numbers. This means that, in theory, there shouldn't be speed gain. Although my initial algorithm processed one complex number with two roots.

The loss of speed, I think, is due to the inability to exit the approximation loop prematurely. Just compare this (old) to this (new).
To be honest, I doubt I made a mistake in calculation formula. Fractal is still rendered correctly and I checked scalar and new SIMD implementations side by side, they're fundamentally identical.

So, in my opinion, it's impossible to exactly replicate scalar implementation using SIMDs, or at least I don't know how to recreate this neat if condition that returns from function prematurely.

@dead-claudia
Copy link
Author

dead-claudia commented Apr 4, 2022

The loss of speed, I think, is due to the inability to exit the approximation loop prematurely.

Yeah, I was imagining a condition check there, using essentially this algorithm (as mentioned earlier):

// Sets all bits of each matched lane
let cmp = f32x4.lt(square_norm, delta);
// Get high bit of each lane (in this case, 1 if
// `square_norm < delta` and 0 otherwise
let bits = i32x4.bitmask(cmp):
if bits != 0 {
    // Index of lane = find first set - 1 =
    // number of leading zeroes
    // Add to the current index and you have the
    // actual index
    let index = current_index + bits.leading_zeros();
    return index;
}

Based on your prior numbers, this might bring it roughly equal. (It's surprising, but seems I didn't deconstruct the algorithm enough to confirm the number of operations per instruction first.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants