use SIMD.jl for x86 and `naive_findmin` for `:aarch64` #151

m-fila · 2025-05-14T15:36:49Z

Just a test if I understand how to replace fast_findmin implementation as in #83

src/Utils.jl

Moelf · 2025-05-14T17:09:10Z

after fixing that, on Apple M4, I get:

julia +1.12 --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32


# main
Processed 100 events 32 times
Average time per event 135.5111340625 ± 2.9835167009602035 μs
Lowest time per event 131.43084000000002 μs

# this PR
Processed 100 events 32 times
Average time per event 138.62045531249998 ± 3.5853189437051123 μs
Lowest time per event 134.35666 μs

# main with 1.12-beta3
Processed 100 events 32 times
Average time per event 147.96477937499995 ± 4.669684823430356 μs
Lowest time per event 139.74125 μs

# this PR with 1.12-beta3
Processed 100 events 32 times
Average time per event 152.88778656250003 ± 5.847938860407047 μs
Lowest time per event 142.36916000000002 μs

this looks not bad??!? @graeme-a-stewart Although, there's an unfortuante degredation on 1.12

Co-authored-by: Jerry Ling <[email protected]>

codecov · 2025-05-14T17:45:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.46%. Comparing base (d3c97f3) to head (bfbae69).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #151      +/-   ##
==========================================
+ Coverage   80.00%   80.46%   +0.46%     
==========================================
  Files          21       21              
  Lines        1315     1341      +26     
==========================================
+ Hits         1052     1079      +27     
+ Misses        263      262       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

m-fila · 2025-05-14T18:59:11Z

Nvidia Grace

julia +1.12 --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32

# main
Processed 100 events 32 times
Average time per event 299.6035559375 ± 7.914782725959611 μs
Lowest time per event 280.66586 μs

# this PR
Processed 100 events 32 times
Average time per event 303.1709484375001 ± 8.819781288405308 μs
Lowest time per event 286.57968 μs

# main with 1.12-beta3
Processed 100 events 32 times
Average time per event 304.46982093750006 ± 13.922266876931305 μs
Lowest time per event 281.39805 μs

# this PR with 1.12-beta3
Processed 100 events 32 times
Average time per event 308.6548684375 ± 11.580493168517764 μs
Lowest time per event 288.65558 μs

AMD Ryzen 7 5700G

# main
Processed 100 events 32 times
Average time per event 188.28086562500002 ± 5.529621156269813 μs
Lowest time per event 181.3509 μs

# this PR
Processed 100 events 32 times
Average time per event 197.2376540625 ± 6.685658460456483 μs
Lowest time per event 186.34546999999998 μs

# main with 1.12-beta3
Processed 100 events 32 times
Average time per event 193.79122999999998 ± 11.42792473599183 μs
Lowest time per event 176.08296 μs

# this PR with 1.12-beta3
Processed 100 events 32 times
Average time per event 205.20970093750003 ± 9.736581454767695 μs
Lowest time per event 187.33396000000002 μs

Moelf · 2025-05-14T19:05:17Z

AMD Ryzen 7 5700G

yeah ok so we use the SIMD.jl for x86 and this PR for ARM?

Moelf · 2025-05-14T19:13:00Z

what happened to CI lol...

m-fila · 2025-05-14T19:33:12Z

For completeness, with SIMD for x86_64

AMD Ryzen 7 5700G

# main
Processed 100 events 32 times
Average time per event 187.38033500000006 ± 5.74337091484025 μs
Lowest time per event 177.03909 μs

# this PR
Processed 100 events 32 times
Average time per event 191.52783343750005 ± 6.355636113548961 μs
Lowest time per event 179.90162 μs

# main with 1.12-beta3
Processed 100 events 32 times
Average time per event 195.5927890625 ± 12.11654841840912 μs
Lowest time per event 178.41858 μs

# this PR with 1.12-beta3
Processed 100 events 32 times
Average time per event 194.36534874999998 ± 10.480610090304218 μs
Lowest time per event 176.65239000000003 μs

Moelf · 2025-05-15T05:24:31Z

JuliaLang/julia#58418

I will test against this PR later

giordano · 2025-05-15T10:54:21Z

Would be good to run CI on aarch64 (Linux and/or macOS), no?

Moelf · 2025-05-15T14:26:30Z

julia #58418 on M4
Processed 100 events 32 times
Average time per event 151.89809874999995 ± 5.095528318872066 μs
Lowest time per event 143.37917 μs

so this is same as nightly, and slower than Julia 1.11

graeme-a-stewart · 2025-05-19T13:07:03Z

I took a look at performance on my M2 Pro (where the original patch in #83 was sucky). For 1.11.4 I got:

 ~/.julia/dev/JetReconstruction/examples/ [main] ./benchmark.sh
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 171.11460874999997 ± 4.616430628080113 μs
Lowest time per event 165.90792000000002 μs
Processed 100 events 16 times
Average time per event 674.727474375 ± 18.35030922586858 μs
Lowest time per event 656.97167 μs
ee H Durham
Processed 100 events 16 times
Average time per event 28.442970625 ± 0.30894069468529917 μs
Lowest time per event 27.61167 μs

and with this patch:

 ~/.julia/dev/JetReconstruction/examples/ [naive_findmin] ./benchmark.sh 256
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 175.676691875 ± 12.098880482560652 μs
Lowest time per event 167.775 μs
pp 14 TeV Plain
Processed 100 events 16 times
Average time per event 671.201953125 ± 5.289937381395405 μs
Lowest time per event 662.80833 μs
ee H Durham
Processed 100 events 16 times
Average time per event 28.853904375 ± 0.5296152717583557 μs
Lowest time per event 27.69208 μs

Performance loss is marginal - we can live with it.

I also looked at 1.12.0-beta3:

 ~/.julia/dev/JetReconstruction/examples/ [main] ./benchmark.sh
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 186.92546749999997 ± 10.934245123593346 μs
Lowest time per event 175.50416 μs
pp 14 TeV Plain
Processed 100 events 16 times
Average time per event 684.5363275000001 ± 11.514369132124589 μs
Lowest time per event 674.45708 μs
ee H Durham
Processed 100 events 16 times
Average time per event 29.18406375 ± 1.3979762370840427 μs
Lowest time per event 28.45625 μs

 ~/.julia/dev/JetReconstruction/examples/ [naive_findmin] ./benchmark.sh
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 189.20445375 ± 8.512180852002729 μs
Lowest time per event 177.52208 μs
pp 14 TeV Plain
Processed 100 events 16 times
Average time per event 683.996173125 ± 9.641498982006677 μs
Lowest time per event 674.17833 μs
ee H Durham
Processed 100 events 16 times
Average time per event 30.5012775 ± 0.44837712099670635 μs
Lowest time per event 29.78125 μs

So the performance loss with 1.12 is more significant and we should report that.

Moelf · 2025-05-19T14:26:24Z

So the performance loss with 1.12 is more significant and we should report that.

I can't think of anything specific that would cuase this. In fact, I think the micro benchmark for findmin() shows that 1.12 and nightly are better than 1.11.5, so I suspect there may be regression in somewhere else.

Another supporting evidence is that even with LV.jl, 1.12 is slower, and LV.jl should be independent of Julia versions (I don't believe LLVM regressed )

graeme-a-stewart · 2025-05-21T09:57:56Z

FYI I also want to merge the immutable structs patch, #146, before considering merging this one

m-fila · 2025-08-27T07:59:04Z

Just to note, LoopVectorization seems to be broken with Julia nightly for 2 weeks or so

Moelf · 2025-08-27T14:16:30Z

https://github.com/JuliaSIMD/LoopVectorization.jl/blob/945c2f7b92df094cafe91ec29d56ee2fc29c5726/README.md?plain=1#L17-L19

let's see of this pan out?

m-fila · 2025-08-27T14:27:01Z

https://github.com/JuliaSIMD/LoopVectorization.jl/blob/945c2f7b92df094cafe91ec29d56ee2fc29c5726/README.md?plain=1#L17-L19

let's see of this pan out?

Maybe the main work happens in the dependencies but since the announcement they had 9 commits out of which only 1 wasn't documentation, CI or bumping compat

VectorizationBase still has this https://github.com/JuliaSIMD/VectorizationBase.jl/blob/129a0e533202ee3bf1b60a925f74ad153e8bcdd7/README.md?plain=1#L12

graeme-a-stewart · 2025-08-27T14:31:52Z

Maybe we should get a collecting can and contribute to point 1... 💰 ?

m-fila · 2025-08-27T18:11:10Z

I was experimenting with Reactant.jl and from my understanding it's currently not useful for us as it requires separate compilation for each array size

Moelf · 2025-08-27T20:16:57Z

as it requires separate compilation for each array size

this is correct

graeme-a-stewart · 2025-10-17T13:36:31Z

As LoopVectorisation lives on, let's close this PR, but keep the branch, should it be needed later.

giordano · 2025-10-17T13:46:36Z

Branches of closed PRs can always be restored at any point.

m-fila · 2025-10-17T15:57:22Z

There was another problem in context of static compilation since LoopVectorization prevents trimming and also is super slow with PackageCompiler

graeme-a-stewart · 2025-10-20T13:16:59Z

Noted! But right now LoopVectorisation is still the best solution for the Julia repository. I don't care much about PackageCompiler, but understanding / fixing the trimming issue would be nice.

using naive_findmin implementation

353f023

m-fila mentioned this pull request May 14, 2025

Test removal of LoopVectorization.jl #83

Closed

Moelf reviewed May 14, 2025

View reviewed changes

src/Utils.jl Show resolved Hide resolved

slice dij

6a6a6c8

Co-authored-by: Jerry Ling <[email protected]>

Moelf mentioned this pull request May 14, 2025

use SIMD.jl directly instead of LV.jl for fast_findmin() #84

Closed

Moelf changed the title ~~use naive_findmin~~ use SIMD.jl for x86 and naive_findmin for :aarch64 May 14, 2025

Moelf marked this pull request as ready for review May 14, 2025 19:12

diff implementation on :aarch64

30e237f

m-fila force-pushed the naive_findmin branch from 6eeb2f5 to 30e237f Compare May 14, 2025 19:19

Moelf added 2 commits May 14, 2025 16:56

cleanup(?) code

c9fe359

fix off by one

dd97a88

m-fila mentioned this pull request May 15, 2025

run ci with arm #152

Merged

format

8df5c2d

Moelf requested a review from graeme-a-stewart May 16, 2025 13:18

Merge branch 'main' into naive_findmin

18be031

graeme-a-stewart marked this pull request as draft May 21, 2025 09:58

Moelf added 2 commits May 21, 2025 10:26

Merge branch 'main' into naive_findmin

c3a5e08

Merge branch 'main' into naive_findmin

70cad13

Merge branch 'main' into naive_findmin

36704d9

Moelf marked this pull request as ready for review June 18, 2025 01:06

graeme-a-stewart mentioned this pull request Jun 19, 2025

Jet Flavour Extension #164

Closed

m-fila mentioned this pull request Jun 19, 2025

AirspeedVelocity.jl for benchmark in CI #162

Open

Moelf added 2 commits June 27, 2025 11:07

Merge branch 'main' into naive_findmin

0cb6ff5

Merge branch 'main' into naive_findmin

0fc6ccf

Merge branch 'main' into naive_findmin

bfbae69

graeme-a-stewart mentioned this pull request Aug 28, 2025

Valencia jets 🍊 #202

Open

graeme-a-stewart closed this Oct 17, 2025

use SIMD.jl for x86 and naive_findmin for :aarch64 #151

use SIMD.jl for x86 and naive_findmin for :aarch64 #151

Uh oh!

Conversation

m-fila commented May 14, 2025

Uh oh!

Uh oh!

Moelf commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

m-fila commented May 14, 2025

Uh oh!

Moelf commented May 14, 2025

Uh oh!

Moelf commented May 14, 2025

Uh oh!

m-fila commented May 14, 2025

Uh oh!

Moelf commented May 15, 2025

Uh oh!

giordano commented May 15, 2025

Uh oh!

Moelf commented May 15, 2025

Uh oh!

graeme-a-stewart commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Moelf commented May 19, 2025

Uh oh!

graeme-a-stewart commented May 21, 2025

Uh oh!

m-fila commented Aug 27, 2025

Uh oh!

Moelf commented Aug 27, 2025

Uh oh!

m-fila commented Aug 27, 2025

Uh oh!

graeme-a-stewart commented Aug 27, 2025

Uh oh!

m-fila commented Aug 27, 2025

Uh oh!

Moelf commented Aug 27, 2025

Uh oh!

graeme-a-stewart commented Oct 17, 2025

Uh oh!

giordano commented Oct 17, 2025

Uh oh!

m-fila commented Oct 17, 2025

Uh oh!

graeme-a-stewart commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

use SIMD.jl for x86 and `naive_findmin` for `:aarch64` #151

use SIMD.jl for x86 and `naive_findmin` for `:aarch64` #151

Moelf commented May 14, 2025 •

edited

Loading

codecov bot commented May 14, 2025 •

edited

Loading

graeme-a-stewart commented May 19, 2025 •

edited

Loading