Skip to content

Small Tags performance improvements #6185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

etki
Copy link

@etki etki commented Apr 26, 2025

This is the tiniest performance bump born out of the same story as #6113. There are few changes that benefit insignificantly (single digit percent) - it's completely OK to say "we'd want to keep things as we have here" (i'm only having fun and getting some extra experience here). It's easier to annotate specific code parts, so you will find individual comments below.

The performance was evaluated with benchmarks from #6174, OpenJDK 21.0.2 and Intel N100 fixed at 2GHz. Since there is different processing for different subtypes of Iterable, a parameter matrix with different distributions of these implementations is introduced, when category is absent, then all implementations are uniformly distributed, and when category is X, it means that 90% of input is X, and the rest is evenly divided between other implementations. Every instance contains 0-68 tags (a uniformly selected number), with the exception of null. Please also note that set uses HashSet, which should be slower than JDK9+ Set.of(). What is used by the real clients more (Set.of() or HashSet) is unknown to me.

Concat: common case, any implementation equiprobable, all standard counters
metric version score
operations / nanosecond main 3453.615 ± 12.311 ns/op
operations / nanosecond PR 3332.854 ± 15.092 ns/op
CPI main 0.492 clks/insn
CPI PR 0.496 clks/insn
IPC main 2.034 insns/clk
IPC PR 2.018 insns/clk
L1-dcache-loads:u main 4510.264 #/op
L1-dcache-loads:u PR 4294.050 #/op
L1-dcache-stores:u main 1118.892 #/op
L1-dcache-stores:u PR 1068.037 #/op
L1-icache-load-misses:u main 6.082 #/op
L1-icache-load-misses:u PR 4.619 #/op
L1-icache-loads:u main 3927.357 #/op
L1-icache-loads:u PR 3685.723 #/op
LLC-load-misses:u main 0.004 #/op
LLC-load-misses:u PR 0.004 #/op
LLC-loads:u main 1.251 #/op
LLC-loads:u PR 1.177 #/op
LLC-store-misses:u main 0.031 #/op
LLC-store-misses:u PR 0.037 #/op
LLC-stores:u main 0.548 #/op
LLC-stores:u PR 0.513 #/op
branch-misses:u main 100.710 #/op
branch-misses:u PR 101.320 #/op
branches:u main 2650.911 #/op
branches:u PR 2576.678 #/op
cycles:u main 6825.222 #/op
cycles:u PR 6590.561 #/op
dTLB-load-misses:u main 0.192 #/op
dTLB-load-misses:u PR 0.189 #/op
dTLB-loads:u main 4528.881 #/op
dTLB-loads:u PR 4288.404 #/op
dTLB-store-misses:u main 0.012 #/op
dTLB-store-misses:u PR 0.012 #/op
dTLB-stores:u main 1120.625 #/op
dTLB-stores:u PR 1067.922 #/op
gc.alloc.rate main 195.408 ± 0.699 MB/sec
gc.alloc.rate PR 186.338 ± 0.852 MB/sec
gc.alloc.rate.norm main 707.700 ± 0.005 B/op
gc.alloc.rate.norm PR 651.317 ± 0.004 B/op
gc.count main 843.000 counts
gc.count PR 808.000 counts
gc.time main 442.000 ms
gc.time PR 425.000 ms
iTLB-load-misses:u main 0.819 #/op
iTLB-load-misses:u PR 0.637 #/op
instructions:u main 13883.594 #/op
instructions:u PR 13299.872 #/op
Concat: matrix
source addition version ns/op instructions/op
absent absent main 3453.615 ± 12.311 13883.594
absent absent PR 3332.854 ± 15.092 13299.872
null absent main 1553.365 ± 8.611 6514.379
null absent PR 1503.860 ± 7.085 6321.385
tags absent main 2680.919 ± 15.952 10880.208
tags absent PR 2627.852 ± 12.536 10873.698
list absent main 3257.486 ± 8.496 13152.455
list absent PR 3125.463 ± 14.789 12622.147
set absent main 7297.398 ± 12.989 28640.856
set absent PR 7078.009 ± 6.843 27978.398
iterable absent main 3427.177 ± 18.206 13727.844
iterable absent PR 3313.541 ± 7.130 13313.240
absent null main 1476.249 ± 2.349 5986.755
absent null PR 1438.450 ± 2.499 5885.197
null null main 194.101 ± 1.208 1023.842
null null PR 193.595 ± 1.029 1003.814
tags null main 310.539 ± 1.333 1564.860
tags null PR 294.154 ± 1.905 1498.560
list null main 838.107 ± 1.863 3741.842
list null PR 843.229 ± 2.546 3783.434
set null main 4835.388 ± 4.700 19511.663
set null PR 4741.637 ± 11.517 19240.720
iterable null main 1477.644 ± 3.676 6062.766
iterable null PR 1421.480 ± 3.933 5870.912
absent tags main 2452.261 ± 13.277 9854.579
absent tags PR 2398.422 ± 12.465 9668.906
null tags main 335.398 ± 2.342 1686.879
null tags PR 310.834 ± 2.399 1531.988
tags tags main 1515.396 ± 3.010 6266.138
tags tags PR 1473.454 ± 7.279 5994.906
list tags main 2111.102 ± 6.931 8482.150
list tags PR 2019.852 ± 2.921 8381.749
set tags main 6109.320 ± 13.158 24362.321
set tags PR 5985.968 ± 11.379 23515.979
iterable tags main 2458.731 ± 5.363 9575.156
iterable tags PR 2414.234 ± 11.044 9666.771
absent list main 3033.881 ± 4.737 12102.216
absent list PR 3003.112 ± 14.427 12292.615
null list main 845.153 ± 2.803 3868.300
null list PR 849.731 ± 1.805 3804.974
tags list main 2067.637 ± 2.247 8646.913
tags list PR 2023.014 ± 5.055 8451.190
list list main 2634.236 ± 5.456 10851.092
list list PR 2586.595 ± 5.508 10808.443
set list main 6802.838 ± 10.706 26487.684
set list PR 6557.061 ± 16.576 25838.552
iterable list main 3054.109 ± 7.344 12233.005
iterable list PR 2915.368 ± 8.865 11786.043
absent set main 7288.231 ± 11.302 29111.822
absent set PR 7101.247 ± 16.265 28367.122
null set main 4811.101 ± 8.812 19770.286
null set PR 4657.866 ± 4.688 19259.373
tags set main 6104.217 ± 7.143 24668.652
tags set PR 5942.757 ± 11.124 23852.320
list set main 6679.737 ± 8.977 26837.767
list set PR 6518.054 ± 12.780 26184.015
set set main 10592.943 ± 24.444 42749.874
set set PR 10396.221 ± 19.124 41706.941
iterable set main 7270.077 ± 6.210 29241.466
iterable set PR 7126.497 ± 23.751 28332.141
absent iterable main 3440.212 ± 13.319 13685.000
absent iterable PR 3315.371 ± 10.643 13295.999
null iterable main 1568.332 ± 7.186 6629.074
null iterable PR 1490.758 ± 3.835 6334.845
tags iterable main 2664.903 ± 6.252 10839.812
tags iterable PR 2562.616 ± 15.844 10392.880
list iterable main 3212.028 ± 6.383 13003.759
list iterable PR 3151.855 ± 14.485 12674.469
set iterable main 7238.005 ± 14.185 28831.130
set iterable PR 7120.775 ± 18.257 28047.948
iterable iterable main 3493.186 ± 11.978 13820.076
iterable iterable PR 3303.420 ± 11.901 13352.936
Of: common case, all implementations equiprobable, standard counters
metric version score
nanoseconds / operation main 1252.774 ± 2.228 ns/op
nanoseconds / operation PR 1212.606 ± 3.810 ns/op
CPI main 0.481 clks/insn
CPI PR 0.469 clks/insn
IPC main 2.080 insns/clk
IPC PR 2.134 insns/clk
L1-dcache-loads:u main 1622.461 #/op
L1-dcache-loads:u PR 1609.423 #/op
L1-dcache-stores:u main 412.436 #/op
L1-dcache-stores:u PR 450.941 #/op
L1-icache-load-misses:u main 1.797 #/op
L1-icache-load-misses:u PR 1.488 #/op
L1-icache-loads:u main 1406.455 #/op
L1-icache-loads:u PR 1360.288 #/op
LLC-load-misses:u main 0.001 #/op
LLC-load-misses:u PR 0.001 #/op
LLC-loads:u main 0.154 #/op
LLC-loads:u PR 0.138 #/op
LLC-store-misses:u main 0.013 #/op
LLC-store-misses:u PR 0.012 #/op
LLC-stores:u main 0.169 #/op
LLC-stores:u PR 0.167 #/op
branch-misses:u main 35.651 #/op
branch-misses:u PR 34.963 #/op
branches:u main 998.491 #/op
branches:u PR 965.930 #/op
cycles:u main 2481.143 #/op
cycles:u PR 2398.149 #/op
dTLB-load-misses:u main 0.064 #/op
dTLB-load-misses:u PR 0.060 #/op
dTLB-loads:u main 1624.072 #/op
dTLB-loads:u PR 1610.435 #/op
dTLB-store-misses:u main 0.003 #/op
dTLB-store-misses:u PR 0.003 #/op
dTLB-stores:u main 412.881 #/op
dTLB-stores:u PR 450.939 #/op
gc.alloc.rate main 181.771 ± 0.276 MB/sec
gc.alloc.rate PR 173.941 ± 0.484 MB/sec
gc.alloc.rate.norm main 238.822 ± 0.002 B/op
gc.alloc.rate.norm PR 221.213 ± 0.001 B/op
gc.count main 971.000 counts
gc.count PR 879.000 counts
gc.time main 501.000 ms
gc.time PR 462.000 ms
iTLB-load-misses:u main 0.235 #/op
iTLB-load-misses:u PR 0.225 #/op
instructions:u main 5161.326 #/op
instructions:u PR 5117.102 #/op
Of: distributions

Not sure what's going on with lsit here. Needs just a more more detailed look, but i'm not sure if i'll ever have the time.

source version ns/op instructions/op
absent main 1252.774 ± 2.228 5161.326
absent PR 1212.606 ± 3.810 5117.102
null main 73.245 ± 0.453 521.557
null PR 70.854 ± 0.376 504.099
tags main 88.168 ± 0.343 619.457
tags PR 69.111 ± 0.371 504.862
list main 589.633 ± 1.413 2770.170
list PR 603.960 ± 1.753 2773.680
set main 4578.833 ± 9.501 18547.898
set PR 4451.159 ± 9.031 18022.200
iterable main 1255.498 ± 3.867 5136.213
iterable PR 1226.269 ± 4.025 5058.964

@etki etki force-pushed the tags-performance-boost branch from a268978 to 1e8788f Compare April 26, 2025 22:48
tags[j++] = tags[i];
current = next;
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've extracted dedup to a separate method to see that in isolation i can reduce ~1000 ns to ~750 ns on workstation by this (not sure how many elements i was feeding in, either 64 or 128). I haven't tested explicitly the same above, but both can be dropped if you feel this is too much.

return toTags(tagsCollection.toArray(EMPTY_TAG_ARRAY));
}
else if (emptyIterable(tags)) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emptyIterable was used more widely in a version that hasn't reached publication, maybe we want to get rid of it completely. There are some repeated checks, compiler should be able to pick up invariants during inlining, but i've never explicitly checked.

return Tags.empty();
}
else if (tags instanceof Tags) {
if (tags instanceof Tags) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since EMPTY is instanceof Tags, it will fall in this branch as well. Instanceof requires one extra memory lookup though, so idk.

Null is also recognized only at the emptyIterable stage, but this shouldn't bring additional cost, i don't expect JIT to try to check for null thrice

}
if (Objects.equals(this, other)) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Object.equals is basically a null-safe version of calling equals directly, and the first thing this code was doing is an access to other that isn't possible in the null case.

@etki etki force-pushed the tags-performance-boost branch from 1e8788f to a574645 Compare April 26, 2025 23:09
@etki etki force-pushed the tags-performance-boost branch from a574645 to c28c6a6 Compare April 26, 2025 23:11
@jonatan-ivanov jonatan-ivanov added the performance Issues related to general performance label Apr 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues related to general performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants