-
Notifications
You must be signed in to change notification settings - Fork 13
use SIMD.jl for x86 and naive_findmin for :aarch64
#151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
after fixing that, on Apple M4, I get: this looks not bad??!? @graeme-a-stewart Although, there's an unfortuante degredation on 1.12 |
Co-authored-by: Jerry Ling <[email protected]>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #151 +/- ##
==========================================
+ Coverage 80.00% 80.46% +0.46%
==========================================
Files 21 21
Lines 1315 1341 +26
==========================================
+ Hits 1052 1079 +27
+ Misses 263 262 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Nvidia Grace AMD Ryzen 7 5700G |
yeah ok so we use the SIMD.jl for x86 and this PR for ARM? |
naive_findminnaive_findmin for :aarch64
|
what happened to CI lol... |
|
For completeness, with SIMD for x86_64 AMD Ryzen 7 5700G |
|
I will test against this PR later |
|
Would be good to run CI on aarch64 (Linux and/or macOS), no? |
so this is same as nightly, and slower than Julia 1.11 |
|
I took a look at performance on my M2 Pro (where the original patch in #83 was sucky). For 1.11.4 I got: ~/.julia/dev/JetReconstruction/examples/ [main] ./benchmark.sh
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 171.11460874999997 ± 4.616430628080113 μs
Lowest time per event 165.90792000000002 μs
Processed 100 events 16 times
Average time per event 674.727474375 ± 18.35030922586858 μs
Lowest time per event 656.97167 μs
ee H Durham
Processed 100 events 16 times
Average time per event 28.442970625 ± 0.30894069468529917 μs
Lowest time per event 27.61167 μsand with this patch: ~/.julia/dev/JetReconstruction/examples/ [naive_findmin] ./benchmark.sh 256
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 175.676691875 ± 12.098880482560652 μs
Lowest time per event 167.775 μs
pp 14 TeV Plain
Processed 100 events 16 times
Average time per event 671.201953125 ± 5.289937381395405 μs
Lowest time per event 662.80833 μs
ee H Durham
Processed 100 events 16 times
Average time per event 28.853904375 ± 0.5296152717583557 μs
Lowest time per event 27.69208 μsPerformance loss is marginal - we can live with it. I also looked at 1.12.0-beta3: ~/.julia/dev/JetReconstruction/examples/ [main] ./benchmark.sh
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 186.92546749999997 ± 10.934245123593346 μs
Lowest time per event 175.50416 μs
pp 14 TeV Plain
Processed 100 events 16 times
Average time per event 684.5363275000001 ± 11.514369132124589 μs
Lowest time per event 674.45708 μs
ee H Durham
Processed 100 events 16 times
Average time per event 29.18406375 ± 1.3979762370840427 μs
Lowest time per event 28.45625 μs ~/.julia/dev/JetReconstruction/examples/ [naive_findmin] ./benchmark.sh
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 189.20445375 ± 8.512180852002729 μs
Lowest time per event 177.52208 μs
pp 14 TeV Plain
Processed 100 events 16 times
Average time per event 683.996173125 ± 9.641498982006677 μs
Lowest time per event 674.17833 μs
ee H Durham
Processed 100 events 16 times
Average time per event 30.5012775 ± 0.44837712099670635 μs
Lowest time per event 29.78125 μsSo the performance loss with 1.12 is more significant and we should report that. |
I can't think of anything specific that would cuase this. In fact, I think the micro benchmark for Another supporting evidence is that even with LV.jl, 1.12 is slower, and LV.jl should be independent of Julia versions (I don't believe LLVM regressed ) |
|
FYI I also want to merge the immutable structs patch, #146, before considering merging this one |
|
Just to note, LoopVectorization seems to be broken with Julia nightly for 2 weeks or so |
Maybe the main work happens in the dependencies but since the announcement they had 9 commits out of which only 1 wasn't documentation, CI or bumping compat VectorizationBase still has this https://github.com/JuliaSIMD/VectorizationBase.jl/blob/129a0e533202ee3bf1b60a925f74ad153e8bcdd7/README.md?plain=1#L12 |
|
Maybe we should get a collecting can and contribute to point 1... 💰 ? |
|
I was experimenting with Reactant.jl and from my understanding it's currently not useful for us as it requires separate compilation for each array size |
this is correct |
|
As LoopVectorisation lives on, let's close this PR, but keep the branch, should it be needed later. |
|
Branches of closed PRs can always be restored at any point. |
|
There was another problem in context of static compilation since LoopVectorization prevents trimming and also is super slow with PackageCompiler |
|
Noted! But right now LoopVectorisation is still the best solution for the Julia repository. I don't care much about PackageCompiler, but understanding / fixing the trimming issue would be nice. |
Just a test if I understand how to replace
fast_findminimplementation as in #83