Skip to content

Conversation

@m-fila
Copy link
Member

@m-fila m-fila commented May 14, 2025

Just a test if I understand how to replace fast_findmin implementation as in #83

@Moelf
Copy link
Member

Moelf commented May 14, 2025

after fixing that, on Apple M4, I get:

julia +1.12 --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32


# main
Processed 100 events 32 times
Average time per event 135.5111340625 ± 2.9835167009602035 μs
Lowest time per event 131.43084000000002 μs

# this PR
Processed 100 events 32 times
Average time per event 138.62045531249998 ± 3.5853189437051123 μs
Lowest time per event 134.35666 μs

# main with 1.12-beta3
Processed 100 events 32 times
Average time per event 147.96477937499995 ± 4.669684823430356 μs
Lowest time per event 139.74125 μs

# this PR with 1.12-beta3
Processed 100 events 32 times
Average time per event 152.88778656250003 ± 5.847938860407047 μs
Lowest time per event 142.36916000000002 μs

this looks not bad??!? @graeme-a-stewart Although, there's an unfortuante degredation on 1.12

Co-authored-by: Jerry Ling <[email protected]>
@codecov
Copy link

codecov bot commented May 14, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.46%. Comparing base (d3c97f3) to head (bfbae69).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #151      +/-   ##
==========================================
+ Coverage   80.00%   80.46%   +0.46%     
==========================================
  Files          21       21              
  Lines        1315     1341      +26     
==========================================
+ Hits         1052     1079      +27     
+ Misses        263      262       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@m-fila
Copy link
Member Author

m-fila commented May 14, 2025

Nvidia Grace

julia +1.12 --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32

# main
Processed 100 events 32 times
Average time per event 299.6035559375 ± 7.914782725959611 μs
Lowest time per event 280.66586 μs

# this PR
Processed 100 events 32 times
Average time per event 303.1709484375001 ± 8.819781288405308 μs
Lowest time per event 286.57968 μs

# main with 1.12-beta3
Processed 100 events 32 times
Average time per event 304.46982093750006 ± 13.922266876931305 μs
Lowest time per event 281.39805 μs

# this PR with 1.12-beta3
Processed 100 events 32 times
Average time per event 308.6548684375 ± 11.580493168517764 μs
Lowest time per event 288.65558 μs

AMD Ryzen 7 5700G

# main
Processed 100 events 32 times
Average time per event 188.28086562500002 ± 5.529621156269813 μs
Lowest time per event 181.3509 μs

# this PR
Processed 100 events 32 times
Average time per event 197.2376540625 ± 6.685658460456483 μs
Lowest time per event 186.34546999999998 μs

# main with 1.12-beta3
Processed 100 events 32 times
Average time per event 193.79122999999998 ± 11.42792473599183 μs
Lowest time per event 176.08296 μs

# this PR with 1.12-beta3
Processed 100 events 32 times
Average time per event 205.20970093750003 ± 9.736581454767695 μs
Lowest time per event 187.33396000000002 μs

@Moelf
Copy link
Member

Moelf commented May 14, 2025

AMD Ryzen 7 5700G

yeah ok so we use the SIMD.jl for x86 and this PR for ARM?

@Moelf Moelf changed the title use naive_findmin use SIMD.jl for x86 and naive_findmin for :aarch64 May 14, 2025
@Moelf Moelf marked this pull request as ready for review May 14, 2025 19:12
@Moelf
Copy link
Member

Moelf commented May 14, 2025

what happened to CI lol...

@m-fila
Copy link
Member Author

m-fila commented May 14, 2025

For completeness, with SIMD for x86_64

AMD Ryzen 7 5700G

# main
Processed 100 events 32 times
Average time per event 187.38033500000006 ± 5.74337091484025 μs
Lowest time per event 177.03909 μs

# this PR
Processed 100 events 32 times
Average time per event 191.52783343750005 ± 6.355636113548961 μs
Lowest time per event 179.90162 μs

# main with 1.12-beta3
Processed 100 events 32 times
Average time per event 195.5927890625 ± 12.11654841840912 μs
Lowest time per event 178.41858 μs

# this PR with 1.12-beta3
Processed 100 events 32 times
Average time per event 194.36534874999998 ± 10.480610090304218 μs
Lowest time per event 176.65239000000003 μs

@Moelf
Copy link
Member

Moelf commented May 15, 2025

JuliaLang/julia#58418

I will test against this PR later

@giordano
Copy link
Member

Would be good to run CI on aarch64 (Linux and/or macOS), no?

@m-fila m-fila mentioned this pull request May 15, 2025
@Moelf
Copy link
Member

Moelf commented May 15, 2025

julia #58418 on M4
Processed 100 events 32 times
Average time per event 151.89809874999995 ± 5.095528318872066 μs
Lowest time per event 143.37917 μs

so this is same as nightly, and slower than Julia 1.11

@Moelf Moelf requested a review from graeme-a-stewart May 16, 2025 13:18
@graeme-a-stewart
Copy link
Member

graeme-a-stewart commented May 19, 2025

I took a look at performance on my M2 Pro (where the original patch in #83 was sucky). For 1.11.4 I got:

~/.julia/dev/JetReconstruction/examples/ [main] ./benchmark.sh
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 171.11460874999997 ± 4.616430628080113 μs
Lowest time per event 165.90792000000002 μs
Processed 100 events 16 times
Average time per event 674.727474375 ± 18.35030922586858 μs
Lowest time per event 656.97167 μs
ee H Durham
Processed 100 events 16 times
Average time per event 28.442970625 ± 0.30894069468529917 μs
Lowest time per event 27.61167 μs

and with this patch:

~/.julia/dev/JetReconstruction/examples/ [naive_findmin] ./benchmark.sh 256
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 175.676691875 ± 12.098880482560652 μs
Lowest time per event 167.775 μs
pp 14 TeV Plain
Processed 100 events 16 times
Average time per event 671.201953125 ± 5.289937381395405 μs
Lowest time per event 662.80833 μs
ee H Durham
Processed 100 events 16 times
Average time per event 28.853904375 ± 0.5296152717583557 μs
Lowest time per event 27.69208 μs

Performance loss is marginal - we can live with it.

I also looked at 1.12.0-beta3:

~/.julia/dev/JetReconstruction/examples/ [main] ./benchmark.sh
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 186.92546749999997 ± 10.934245123593346 μs
Lowest time per event 175.50416 μs
pp 14 TeV Plain
Processed 100 events 16 times
Average time per event 684.5363275000001 ± 11.514369132124589 μs
Lowest time per event 674.45708 μs
ee H Durham
Processed 100 events 16 times
Average time per event 29.18406375 ± 1.3979762370840427 μs
Lowest time per event 28.45625 μs
~/.julia/dev/JetReconstruction/examples/ [naive_findmin] ./benchmark.sh
pp 14TeV Tiled
Processed 100 events 16 times
Average time per event 189.20445375 ± 8.512180852002729 μs
Lowest time per event 177.52208 μs
pp 14 TeV Plain
Processed 100 events 16 times
Average time per event 683.996173125 ± 9.641498982006677 μs
Lowest time per event 674.17833 μs
ee H Durham
Processed 100 events 16 times
Average time per event 30.5012775 ± 0.44837712099670635 μs
Lowest time per event 29.78125 μs

So the performance loss with 1.12 is more significant and we should report that.

@Moelf
Copy link
Member

Moelf commented May 19, 2025

So the performance loss with 1.12 is more significant and we should report that.

I can't think of anything specific that would cuase this. In fact, I think the micro benchmark for findmin() shows that 1.12 and nightly are better than 1.11.5, so I suspect there may be regression in somewhere else.

Another supporting evidence is that even with LV.jl, 1.12 is slower, and LV.jl should be independent of Julia versions (I don't believe LLVM regressed )

@graeme-a-stewart
Copy link
Member

FYI I also want to merge the immutable structs patch, #146, before considering merging this one

@graeme-a-stewart graeme-a-stewart marked this pull request as draft May 21, 2025 09:58
@Moelf Moelf marked this pull request as ready for review June 18, 2025 01:06
@m-fila
Copy link
Member Author

m-fila commented Aug 27, 2025

Just to note, LoopVectorization seems to be broken with Julia nightly for 2 weeks or so

@Moelf
Copy link
Member

Moelf commented Aug 27, 2025

@m-fila
Copy link
Member Author

m-fila commented Aug 27, 2025

https://github.com/JuliaSIMD/LoopVectorization.jl/blob/945c2f7b92df094cafe91ec29d56ee2fc29c5726/README.md?plain=1#L17-L19

let's see of this pan out?

Maybe the main work happens in the dependencies but since the announcement they had 9 commits out of which only 1 wasn't documentation, CI or bumping compat

VectorizationBase still has this https://github.com/JuliaSIMD/VectorizationBase.jl/blob/129a0e533202ee3bf1b60a925f74ad153e8bcdd7/README.md?plain=1#L12

@graeme-a-stewart
Copy link
Member

Maybe we should get a collecting can and contribute to point 1... 💰 ?

@m-fila
Copy link
Member Author

m-fila commented Aug 27, 2025

I was experimenting with Reactant.jl and from my understanding it's currently not useful for us as it requires separate compilation for each array size

@Moelf
Copy link
Member

Moelf commented Aug 27, 2025

as it requires separate compilation for each array size

this is correct

@graeme-a-stewart
Copy link
Member

As LoopVectorisation lives on, let's close this PR, but keep the branch, should it be needed later.

@giordano
Copy link
Member

Branches of closed PRs can always be restored at any point.

@m-fila
Copy link
Member Author

m-fila commented Oct 17, 2025

There was another problem in context of static compilation since LoopVectorization prevents trimming and also is super slow with PackageCompiler

@graeme-a-stewart
Copy link
Member

Noted! But right now LoopVectorisation is still the best solution for the Julia repository. I don't care much about PackageCompiler, but understanding / fixing the trimming issue would be nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants