Skip to content

Conversation

SavicStefan
Copy link

This PR adds the implementation for ACC_TYPE_VEC2. This change, with non-coopmat shaders, using ACC_TYPE_VEC2 improves caching behavior, as accessing 32-bit values is generally more efficient than accessing 16-bit values.

Performance Comparison (Without coopmat and coopmat2) NVIDIA GeForce RTX 4060 Ti
Name Before (us/run) After (us/run) Δ% (Improvement)
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336) 5767.64 5479.83 +4.99%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336) 5421.40 5047.91 +6.88%
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336) 5281.02 6002.14 −13.66%
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336) 2741.43 2748.71 −0.27%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336) 2766.60 2764.23 +0.09%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336) 2877.49 2875.25 +0.08%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336) 2869.17 2867.33 +0.06%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336) 2887.17 2890.27 −0.11%
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336) 4976.57 4043.75 +18.75%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336) 4938.25 4120.32 +16.56%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336) 5287.85 4548.30 +13.99%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336) 5373.34 4566.63 +15.01%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336) 5769.13 4907.47 +14.94%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336) 5507.98 4524.96 +17.85%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336) 4877.02 4043.75 +17.07%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336) 5010.98 4112.35 +17.94%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336) 4863.99 4065.67 +16.41%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336) 4957.83 4129.54 +16.70%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336) 4583.30 3788.42 +17.34%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336) 5128.29 4280.64 +16.52%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336) 4885.91 3992.67 +18.27%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336) 4933.56 4084.30 +17.22%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336) 5389.60 4489.23 +16.67%
Performance before(Without coopmat and coopmat2) NVIDIA GeForce RTX 4060 Ti
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   174 runs -  5767.64 us/run -  60.13 GFLOP/run -  10.43 TFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   186 runs -  5421.40 us/run -  60.13 GFLOP/run -  11.09 TFLOPS
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  190 runs -  5281.02 us/run -  60.13 GFLOP/run -  11.39 TFLOPS
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  366 runs -  2741.43 us/run -  60.13 GFLOP/run -  21.93 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  362 runs -  2766.60 us/run -  60.13 GFLOP/run -  21.73 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  348 runs -  2877.49 us/run -  60.13 GFLOP/run -  20.90 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  350 runs -  2869.17 us/run -  60.13 GFLOP/run -  20.96 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  348 runs -  2887.17 us/run -  60.13 GFLOP/run -  20.83 TFLOPS
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 202 runs -  4976.57 us/run -  60.13 GFLOP/run -  12.08 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  204 runs -  4938.25 us/run -  60.13 GFLOP/run -  12.18 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  190 runs -  5287.85 us/run -  60.13 GFLOP/run -  11.37 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  188 runs -  5373.34 us/run -  60.13 GFLOP/run -  11.19 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  174 runs -  5769.13 us/run -  60.13 GFLOP/run -  10.42 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  182 runs -  5507.98 us/run -  60.13 GFLOP/run -  10.92 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):               206 runs -  4877.02 us/run -  60.13 GFLOP/run -  12.33 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                200 runs -  5010.98 us/run -  60.13 GFLOP/run -  12.00 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 206 runs -  4863.99 us/run -  60.13 GFLOP/run -  12.36 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):               202 runs -  4957.83 us/run -  60.13 GFLOP/run -  12.13 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 220 runs -  4583.30 us/run -  60.13 GFLOP/run -  13.12 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 196 runs -  5128.29 us/run -  60.13 GFLOP/run -  11.73 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                206 runs -  4885.91 us/run -  60.13 GFLOP/run -  12.31 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 204 runs -  4933.56 us/run -  60.13 GFLOP/run -  12.19 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                186 runs -  5389.60 us/run -  60.13 GFLOP/run -  11.16 TFLOPS

Performance after(Without coopmat and coopmat2) NVIDIA GeForce RTX 4060 Ti
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   184 runs -  5479.83 us/run -  60.13 GFLOP/run -  10.97 TFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                   200 runs -  5047.91 us/run -  60.13 GFLOP/run -  11.91 TFLOPS
MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  168 runs -  6002.14 us/run -  60.13 GFLOP/run -  10.02 TFLOPS
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  364 runs -  2748.71 us/run -  60.13 GFLOP/run -  21.88 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  362 runs -  2764.23 us/run -  60.13 GFLOP/run -  21.75 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  348 runs -  2875.25 us/run -  60.13 GFLOP/run -  20.91 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  350 runs -  2867.33 us/run -  60.13 GFLOP/run -  20.97 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  348 runs -  2890.27 us/run -  60.13 GFLOP/run -  20.80 TFLOPS
MUL_MAT(type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 248 runs -  4043.75 us/run -  60.13 GFLOP/run -  14.87 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  244 runs -  4120.32 us/run -  60.13 GFLOP/run -  14.59 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  220 runs -  4548.30 us/run -  60.13 GFLOP/run -  13.22 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  220 runs -  4566.63 us/run -  60.13 GFLOP/run -  13.17 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  204 runs -  4907.47 us/run -  60.13 GFLOP/run -  12.25 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  222 runs -  4524.96 us/run -  60.13 GFLOP/run -  13.29 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):               248 runs -  4043.75 us/run -  60.13 GFLOP/run -  14.87 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                244 runs -  4112.35 us/run -  60.13 GFLOP/run -  14.62 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 246 runs -  4065.67 us/run -  60.13 GFLOP/run -  14.79 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):               244 runs -  4129.54 us/run -  60.13 GFLOP/run -  14.56 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 264 runs -  3788.42 us/run -  60.13 GFLOP/run -  15.87 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 234 runs -  4280.64 us/run -  60.13 GFLOP/run -  14.05 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                252 runs -  3992.67 us/run -  60.13 GFLOP/run -  15.06 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                 246 runs -  4084.30 us/run -  60.13 GFLOP/run -  14.72 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                224 runs -  4489.23 us/run -  60.13 GFLOP/run -  13.39 TFLOPS

@SavicStefan SavicStefan requested a review from 0cc4m as a code owner September 23, 2025 15:07
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Sep 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant