-
Notifications
You must be signed in to change notification settings - Fork 152
changed mul_heuristic for non-float #514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
In most cases - this does replace the heuristic for floats. The only exceptions are for |
You're right; I saw I will investigate heuristic performance for |
The results of the tests on MMatrix MINIMUM TIME t.loop
t.loop t.chuck t.unroll t.default t.default/ size
2.073 4.901 2.074 2.075 0.999 (unroll default) ( 2, 2, 2)
7.730 10.557 7.726 7.727 1.000*** (unroll default) ( 3, 3, 3)
3.368 7.995 4.652 4.652 0.724 (unroll default) ( 4, 3, 3)
10.812 13.692 10.806 10.807 1.000 (unroll default) ( 3, 3, 4)
14.407 18.288 14.409 14.408 1.000 (unroll default) ( 3, 4, 4)
4.950 10.241 6.779 6.757 0.733 (unroll default) ( 4, 4, 3)
6.338 10.832 6.448 6.445 0.983*** (unroll default) ( 4, 4, 4)
29.619 23.715 30.400 30.395 0.974 (unroll default) ( 5, 5, 5)
45.788 52.059 47.297 47.295 0.968 (unroll default) ( 6, 6, 6)
58.872 65.491 53.873 53.679 1.097 (unroll default) ( 8, 8, 8)
60.449 57.003 55.448 55.440 1.090 (unroll default) (14, 3, 9)
60.964 50.867 121.848 123.269 0.495 (unroll default) ( 9, 14, 3)
115.015 121.230 107.557 107.557 1.069 (unroll default) ( 3, 9, 14)
117.127 214.249 95.178 215.998 0.542 (chuck default) (14, 4, 12)
53.251 63.855 86.645 63.409 0.840 (chuck default) (12, 14, 4)
67.459 322.691 77.533 321.113 0.210 (chuck default) ( 4, 12, 14)
MAXIMUM TIME t.loop
t.loop t.chuck t.unroll t.default t.default/ size
2.175 5.521 2.205 2.315 0.939 (unroll default) ( 2, 2, 2)
9.154 11.515 8.598 8.361 1.095*** (unroll default) ( 3, 3, 3)
3.791 10.087 5.679 5.154 0.736 (unroll default) ( 4, 3, 3)
11.909 15.495 12.752 11.993 0.993 (unroll default) ( 3, 3, 4)
15.451 21.098 15.546 15.617 0.989 (unroll default) ( 3, 4, 4)
5.984 11.948 7.474 7.438 0.804 (unroll default) ( 4, 4, 3)
6.873 12.161 7.285 7.137 0.963*** (unroll default) ( 4, 4, 4)
32.455 26.016 34.875 33.796 0.960 (unroll default) ( 5, 5, 5)
51.458 57.137 53.678 52.737 0.976 (unroll default) ( 6, 6, 6)
65.131 72.903 61.722 65.883 0.989 (unroll default) ( 8, 8, 8)
70.474 65.449 63.732 62.473 1.128 (unroll default) (14, 3, 9)
68.526 57.733 140.109 139.151 0.492 (unroll default) ( 9, 14, 3)
124.767 141.089 117.295 116.641 1.070 (unroll default) ( 3, 9, 14)
131.910 240.933 100.761 229.787 0.574 (chuck default) (14, 4, 12)
55.108 66.231 94.707 68.277 0.807 (chuck default) (12, 14, 4)
71.788 337.778 83.411 331.628 0.216 (chuck default) ( 4, 12, 14) SizedMatrix MINIMUM TIME t.loop
t.loop t.chuck t.unroll t.default t.default/ size
2.076 5.332 2.075 2.333 0.890 (unroll default) ( 2, 2, 2)
7.730 12.676 7.728 7.727 1.000*** (unroll default) ( 3, 3, 3)
5.611 11.683 6.077 6.074 0.924 (unroll default) ( 4, 3, 3)
10.813 16.146 10.808 10.808 1.000 (unroll default) ( 3, 3, 4)
14.412 20.924 14.408 14.408 1.000 (unroll default) ( 3, 4, 4)
4.394 12.685 4.532 4.537 0.968 (unroll default) ( 4, 4, 3)
5.372 14.006 5.648 5.505 0.976*** (unroll default) ( 4, 4, 4)
28.421 31.509 29.633 29.627 0.959 (unroll default) ( 5, 5, 5)
31.403 52.888 47.851 47.851 0.656 (unroll default) ( 6, 6, 6)
58.365 61.879 54.455 54.254 1.076 (unroll default) ( 8, 8, 8)
63.941 67.916 58.543 58.743 1.088 (unroll default) (14, 3, 9)
59.923 73.609 112.948 114.238 0.525 (unroll default) ( 9, 14, 3)
111.190 117.087 105.574 105.481 1.054 (unroll default) ( 3, 9, 14)
115.633 125.493 98.594 121.604 0.951 (chuck default) (14, 4, 12)
45.285 69.400 80.747 69.402 0.653 (chuck default) (12, 14, 4)
47.402 125.304 64.636 125.528 0.378 (chuck default) ( 4, 12, 14)
MAXIMUM TIME t.loop
t.loop t.chuck t.unroll t.default t.default/ size
2.581 6.403 2.476 2.978 0.867 (unroll default) ( 2, 2, 2)
9.001 13.992 8.679 8.724 1.032*** (unroll default) ( 3, 3, 3)
6.212 13.189 6.785 7.354 0.845 (unroll default) ( 4, 3, 3)
12.004 17.699 12.555 11.615 1.034 (unroll default) ( 3, 3, 4)
17.216 23.072 16.049 16.166 1.065 (unroll default) ( 3, 4, 4)
5.227 14.766 5.478 5.377 0.972 (unroll default) ( 4, 4, 3)
5.982 15.971 6.078 6.564 0.911*** (unroll default) ( 4, 4, 4)
30.591 35.521 34.247 32.578 0.939 (unroll default) ( 5, 5, 5)
33.845 58.170 55.175 55.989 0.604 (unroll default) ( 6, 6, 6)
71.162 69.424 58.067 60.038 1.185 (unroll default) ( 8, 8, 8)
68.896 76.411 66.313 69.118 0.997 (unroll default) (14, 3, 9)
67.477 82.571 131.625 126.844 0.532 (unroll default) ( 9, 14, 3)
119.902 130.886 118.403 117.470 1.021 (unroll default) ( 3, 9, 14)
129.110 139.837 114.198 142.633 0.905 (chuck default) (14, 4, 12)
51.730 78.507 92.106 77.134 0.671 (chuck default) (12, 14, 4)
51.967 137.541 71.263 138.583 0.375 (chuck default) ( 4, 12, 14) SMatrix MINIMUM TIME t.loop
t.loop t.chuck t.unroll t.default t.default/ size
1.819 5.154 2.068 1.822 0.998 (unroll default) ( 2, 2, 2)
7.729 10.531 7.727 7.733 0.999*** (unroll default) ( 3, 3, 3)
3.356 7.770 3.111 3.359 0.999 (unroll default) ( 4, 3, 3)
10.812 13.235 10.809 10.810 1.000 (unroll default) ( 3, 3, 4)
14.408 15.872 14.407 14.407 1.000 (unroll default) ( 3, 4, 4)
3.623 9.020 3.881 3.874 0.935 (unroll default) ( 4, 4, 3)
4.651 11.093 4.657 4.651 1.000*** (unroll default) ( 4, 4, 4)
27.926 25.224 27.258 28.663 0.974 (unroll default) ( 5, 5, 5)
29.991 39.040 29.194 29.304 1.023 (unroll default) ( 6, 6, 6)
52.218 61.553 66.954 75.179 0.695 (unroll default) ( 8, 8, 8)
90.728 78.258 60.175 53.989 1.680 (unroll default) (14, 3, 9)
60.112 59.161 85.392 86.895 0.692 (unroll default) ( 9, 14, 3)
112.377 99.059 112.281 106.007 1.060 (unroll default) ( 3, 9, 14)
113.125 105.840 130.766 111.691 1.013 (chuck default) (14, 4, 12)
45.473 101.002 87.289 71.480 0.636 (chuck default) (12, 14, 4)
47.969 87.554 61.987 83.615 0.574 (chuck default) ( 4, 12, 14)
MAXIMUM TIME t.loop
t.loop t.chuck t.unroll t.default t.default/ size
1.973 5.698 2.340 2.044 0.965 (unroll default) ( 2, 2, 2)
8.508 11.575 8.467 8.435 1.009*** (unroll default) ( 3, 3, 3)
3.691 8.869 3.451 3.726 0.990 (unroll default) ( 4, 3, 3)
11.700 14.741 11.419 11.732 0.997 (unroll default) ( 3, 3, 4)
15.426 17.431 15.594 15.864 0.972 (unroll default) ( 3, 4, 4)
4.291 9.738 4.093 4.175 1.028 (unroll default) ( 4, 4, 3)
5.043 12.557 5.054 5.014 1.006*** (unroll default) ( 4, 4, 4)
30.257 28.311 30.703 31.825 0.951 (unroll default) ( 5, 5, 5)
31.887 44.124 33.123 31.700 1.006 (unroll default) ( 6, 6, 6)
59.612 68.653 73.701 82.356 0.724 (unroll default) ( 8, 8, 8)
98.975 88.194 66.272 60.927 1.624 (unroll default) (14, 3, 9)
71.787 65.768 94.558 99.952 0.718 (unroll default) ( 9, 14, 3)
123.761 120.008 128.358 114.799 1.078 (unroll default) ( 3, 9, 14)
123.287 116.476 146.230 124.308 0.992 (chuck default) (14, 4, 12)
53.789 111.039 98.186 78.839 0.682 (chuck default) (12, 14, 4)
52.806 99.494 71.560 95.252 0.554 (chuck default) ( 4, 12, 14) using StaticArrays
using BenchmarkTools
using Statistics
using Printf
test_cases = [
(2,2,2),
(3,3,3),
(4,3,3),
(3,3,4),
(3,4,4),
(4,4,3),
(4,4,4),
(5,5,5),
(6,6,6),
(8,8,8),
(14,3,9),
(9,14,3),
(3,9,14),
(14,4,12),
(12,14,4),
(4,12,14)
]
n_cases = length(test_cases)
data_min = zeros(n_cases, 4)
data_mean = zeros(n_cases, 4)
for k = 1:n_cases
i1, i2, i3 = test_cases[k]
if false
println("SMatrix")
A = rand(SMatrix{i1,i2,Float64,i1*i2})
B = rand(SMatrix{i2,i3,Float64,i2*i3})
elseif false
println("MMatrix")
A = rand(MMatrix{i1,i2,Float64,i1*i2})
B = rand(MMatrix{i2,i3,Float64,i2*i3})
else
println("SizedMatrix")
A = Size(i1,i2)(rand(i1,i2))
B = Size(i2,i3)(rand(i2,i3))
end
# NOTE: At least as of Julia 1.4, the use of Ref is necessary to prevent unwanted compiler optimizations
# info_loop = @benchmark StaticArrays.mul_loop($(Size(A)),$(Size(B)),$A,$B)
# info_chunks = @benchmark StaticArrays.mul_unrolled_chunks($(Size(A)),$(Size(B)),$A,$B)
# info_unrolled = @benchmark StaticArrays.mul_unrolled($(Size(A)),$(Size(B)),$A,$B)
# info_default = @benchmark $A * $B
info_loop = @benchmark StaticArrays.mul_loop($(Size(A)),$(Size(B)),$(Ref(A))[],$(Ref(B))[])
info_chunks = @benchmark StaticArrays.mul_unrolled_chunks($(Size(A)),$(Size(B)),$(Ref(A))[],$(Ref(B))[])
info_unrolled = @benchmark StaticArrays.mul_unrolled($(Size(A)),$(Size(B)),$(Ref(A))[],$(Ref(B))[])
info_default = @benchmark $(Ref(A))[] * $(Ref(B))[]
min_loop = info_loop.times[1]
min_chunks = info_chunks.times[1]
min_unrolled = info_unrolled.times[1]
min_default = info_default.times[1]
data_min[k, :] = [min_loop, min_chunks, min_unrolled, min_default]
mean_loop = mean(info_loop.times)
mean_chunks = mean(info_chunks.times)
mean_unrolled = mean(info_unrolled.times)
mean_default = mean(info_default.times)
data_mean[k, :] = [mean_loop, mean_chunks, mean_unrolled, mean_default]
println("$k/$n_cases")
end
function force_pad(x::Float64)
s = @sprintf "% 1.3f" x
(x < 100) && (s = " " * s)
(x < 10) && (s = " " * s)
return s
end
for (description, data_compare) = [("MINIMUM TIME", data_min), ("MAXIMUM TIME", data_mean)]
println()
println(description * " t.loop")
println(" t.loop t.chuck t.unroll t.default t.default/ size")
for k = 1:n_cases
case_k = test_cases[k]
time_k = data_compare[k, :]
time_loop = time_k[1]
time_default = time_k[4]
for kk = 1:4
print(force_pad(time_k[kk]), " ")
end
print(force_pad(time_loop / time_default))
(all(case_k .== (3, 3, 3)) || all(case_k .== (4, 4, 4))) ? print("*** ") : print(" ")
if prod(case_k) <= 8^3
print("(unroll default)")
elseif maximum(case_k) <= 14
print(" (chuck default)")
else
println("will unroll already no need to test this case")
end
string_size = " (" * lpad(case_k[1], 2) * "," * lpad(case_k[2], 3) * "," * lpad(case_k[3], 3) * ")"
println(string_size)
end
end |
This needs a deeper look; possibly it's just ready to merge. Any thoughts @andyferris? From your comment I can't tell whether you're worried about replacing older heuristics or you think it's fine? |
I have re-run this benchmarks and for example for
And here is for
We do need better heuristics here but the one proposed in this PR is IMO worse for Ref: #814 . |
@mateuszbaran, times measured by benchmarking are in nanoseconds. We expect most operations to take at least 1.0 nanoseconds (about 3 clock cycles), so times like 0.025 nanoseconds are the result of unwanted compiler optimizations. In other words, the compiler is simplifying the expression before it can be benchmarked, so the thing you're measuring the time it takes to evaluate something much simpler, a constant in many cases. As mentioned in #597 (comment), one way to circumvent this issue is to use In this case, using Ref has the side-effect of making the times of the default method equal to none of the measured methods, so you'll have to compare the first three columns, and not rely on the stated ratio. |
Yes, that's a good point, but still looping isn't consistently better or just as good as unrolled or chunked multiplication, even after adding |
Motivated by #513, This PR changes the multiplication heuristic for non-float types. This should fix a performance issue that affects
Dual
.Although I didn't do it on this PR because it would be too controversial, I think that the same heuristic should be used for floats. Maybe the compiler has come a long way, but it seems to be able to do the right thing most of the time (
mul_loop
outperforms the current heuristic on average). Given the current state of the compiler (see item 4), I think that we should reconsider if this is something we still need to do?Fixes #513