Skip to content

Conversation

danbev
Copy link
Member

@danbev danbev commented Sep 22, 2025

This PR consists of three commits but the first two are part of #16004.

This commit add support for testing the ggml-cpu repack feature which
enables the repackaging of quantized data into more optimal layout for
matrix multiplication for specific hardware architectures.

The motivation is to enable the testing of a cpu backend that uses
repacked data against a reference cpu backend that does not use repacked
data.

Building:

$ cmake -B build \
    -DGGML_CPU_REF_BACKEND=ON \
    -DGGML_BACKEND_DL=ON \
    -DGGML_CPU_ALL_VARIANTS=ON

List availble cpu architectures/variants:

$ ./build/bin/test-backend-ops cpu-variants --list
CPU variants:
  CPU-alderlake   - 12th Gen Intel(R) Core(TM) i7-1260P

Run tests:

./build/bin/test-backend-ops cpu-variants \
    --variant CPU-alderlake \
    -o "MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)"

Testing CPU variant 'CPU-alderlake' against cpu-ref backend...

repack: repack tensor a with q4_0_8x8
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
repack: repack tensor a with q4_0_8x8
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  14491/14491 tests passed

All matrix multiplication tests can be run by use specifying -o "MUL_MAT, MUL_MAT_ID".

This commit introduces a CPU reference implementation for GGML,
designed primarily for testing and validation purposes.

The motivation for this addition is to have a pure C CPU backend
implementation that does not use any hardware-specific optimizations
or intrinsics. This will allow for testing the CPU backend variants
against the reference implementation to ensure correctness
This commit add support for testing the ggml-cpu repack feature which
enables the repackaging of quantized data into more optimal layout for
matrix multiplication for specific hardware architectures.

The motivation is to enable the testing of a cpu backend that uses
repacked data against a reference cpu backend that does not use repacked
data.

Building:
```console
$ cmake -B build \
    -DGGML_CPU_REF_BACKEND=ON
    -DGGML_BACKEND_DL=ON \
    -DGGML_CPU_ALL_VARIANTS=ON
```

List availble cpu architectures/variants:
```console
$ ./build/bin/test-backend-ops cpu-variants --list
CPU variants:
  CPU-alderlake   - 12th Gen Intel(R) Core(TM) i7-1260P
```
Run tests:
```console
./build-ref/bin/test-backend-ops cpu-variants \
    --variant CPU-alderlake \
    -o "MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)"

Testing CPU variant 'CPU-alderlake' against cpu-ref backend...

repack: repack tensor a with q4_0_8x8
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
repack: repack tensor a with q4_0_8x8
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  14491/14491 tests passed
```
All matrix multiplication tests can be run by use specifying
`-o "MUL_MAT"` but it may be harder to spot the ones that use repacking.
@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Sep 22, 2025
@danbev danbev marked this pull request as ready for review September 24, 2025 08:37
@danbev danbev requested a review from slaren as a code owner September 24, 2025 08:37
@ggerganov
Copy link
Member

I don't think the reference CPU backend is using the generic implementations for some reason. Here is what I do on Mac:

cmake .. -DGGML_CPU_REF_BACKEND=ON -DGGML_BACKEND_DL=ON -DGGML_BLAS=OFF -DGGML_METAL=OFF
make -j && ./bin/test-backend-ops cpu-variants --list

load_backend: loaded CPU backend from llama.cpp/build-cpu-ref/bin/libggml-cpu.so
load_backend: loaded CPU backend from llama.cpp/build-cpu-ref/bin/libggml-cpu-ref.so
CPU variants:
  CPU             - Apple M4 Max

make -j && ./bin/test-backend-ops cpu-variants --variant CPU -o MUL_MAT -p "q4_0"

If I put prints like this, I see only the Arm implementation being executed:

diff --git a/ggml/src/ggml-cpu/arch/arm/quants.c b/ggml/src/ggml-cpu/arch/arm/quants.c
index aadbb487e..fff646064 100644
--- a/ggml/src/ggml-cpu/arch/arm/quants.c
+++ b/ggml/src/ggml-cpu/arch/arm/quants.c
@@ -138,6 +138,7 @@ void quantize_row_q8_K(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, in
 //===================================== Dot products =================================
 
 void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
+    printf("arm\n");
     const int qk = QK8_0;
     const int nb = n / qk;
 
diff --git a/ggml/src/ggml-cpu/quants.c b/ggml/src/ggml-cpu/quants.c
index 365cb36d2..3694ae145 100644
--- a/ggml/src/ggml-cpu/quants.c
+++ b/ggml/src/ggml-cpu/quants.c
@@ -113,6 +113,7 @@ void quantize_row_q8_K_generic(const float * GGML_RESTRICT x, void * GGML_RESTRI
 //===================================== Dot products =================================
 
 void ggml_vec_dot_q4_0_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
+    printf("generic\n");
     const int qk = QK8_0;
     const int nb = n / qk;

@danbev
Copy link
Member Author

danbev commented Sep 24, 2025

@ggerganov Thanks for trying this out. I'll take a closer look shortly 👍

With fix in 22ef44d I was able to to see both arm and generic output though I did have to add the generic printf to ggml-quants.c:

diff --git a/ggml/src/ggml-cpu/arch/arm/quants.c b/ggml/src/ggml-cpu/arch/arm/quants.c
index aadbb487e..689148ae7 100644
--- a/ggml/src/ggml-cpu/arch/arm/quants.c
+++ b/ggml/src/ggml-cpu/arch/arm/quants.c
@@ -138,6 +138,7 @@ void quantize_row_q8_K(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, in
 //===================================== Dot products =================================

 void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
+    printf("arm...\n");
     const int qk = QK8_0;
     const int nb = n / qk;

diff --git a/ggml/src/ggml-quants.c b/ggml/src/ggml-quants.c
index 727932123..058b0f6a6 100644
--- a/ggml/src/ggml-quants.c
+++ b/ggml/src/ggml-quants.c
@@ -197,6 +197,7 @@ void quantize_row_q5_1_ref(const float * GGML_RESTRICT x, block_q5_1 * GGML_REST

 // reference implementation for deterministic creation of model files
 void quantize_row_q8_0_ref(const float * GGML_RESTRICT x, block_q8_0 * GGML_RESTRICT y, int64_t k) {
+    printf("generic....\n");
     assert(k % QK8_0 == 0);
     const int nb = k / QK8_0;

I've also stepped through this in the

debugger

(lldb) target create "build/bin/test-backend-ops"
Current executable set to '/Users/danbev/work/ai/llama.cpp/build/bin/test-backend-ops' (arm64).
(lldb) settings set -- target.run-args  "cpu-variants" "--variant" "CPU" "-o" "MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)"
(lldb) br set -f ggml-backend.cpp -l 2035                                                             Breakpoint 1: where = libggml-base.dylib`::ggml_backend_compare_graph_backend(ggml_backend_t, ggml_backend_t, ggml_cgraph *, ggml_backend_eval_callback, void *, ggml_tensor *) + 668 at ggml-backend.cpp:2035:40, address = 0x0000000000022734
(lldb) r
Process 58647 launched: '/Users/danbev/work/ai/llama.cpp/build/bin/test-backend-ops' (arm64)
ggml_backend_load_best: failed to find ggml_backend_score in /Users/danbev/work/ai/llama.cpp/build/bin/libggml-cpu-ref.so
load_backend: loaded CPU backend from /Users/danbev/work/ai/llama.cpp/build/bin/libggml-cpu.so
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M3)
load_backend: loaded CPU backend from /Users/danbev/work/ai/llama.cpp/build/bin/libggml-cpu-ref.so
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU-ref (Apple M3)
Testing CPU variant 'CPU' against cpu-ref backend...

Process 58647 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x000000010041a734 libggml-base.dylib`ggml_backend_compare_graph_backend(backend1=0x00006000014a8090, backend2=0x00006000014a8000, graph=0x0000000100224020, callback=(test-backend-ops`test_case::eval(ggml_backend*, ggml_backend*, char const*, printer*)::'lambda'(int, ggml_tensor*, ggml_tensor*, void*)::__invoke(int, ggml_tensor*, ggml_tensor*, void*) at test-backend-ops.cpp:1204), user_data=0x000000016fdfddb8, test_node=0x0000000000000000) at ggml-backend.cpp:2035:40
   2032	            struct ggml_cgraph g1v = ggml_graph_view(g1, i, i + 1);
   2033	            struct ggml_cgraph g2v = ggml_graph_view(g2, i, i + 1);
   2034
-> 2035	            ggml_backend_graph_compute(backend1, &g1v);
   2036	            ggml_backend_graph_compute(backend2, &g2v);
   2037
   2038	            if (ggml_is_view_op(t1->op)) {
Target 0: (test-backend-ops) stopped.
(lldb) n
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
Process 58647 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step over
    frame #0: 0x000000010041a740 libggml-base.dylib`ggml_backend_compare_graph_backend(backend1=0x00006000014a8090, backend2=0x00006000014a8000, graph=0x0000000100224020, callback=(test-backend-ops`test_case::eval(ggml_backend*, ggml_backend*, char const*, printer*)::'lambda'(int, ggml_tensor*, ggml_tensor*, void*)::__invoke(int, ggml_tensor*, ggml_tensor*, void*) at test-backend-ops.cpp:1204), user_data=0x000000016fdfddb8, test_node=0x0000000000000000) at ggml-backend.cpp:2036:40
   2033	            struct ggml_cgraph g2v = ggml_graph_view(g2, i, i + 1);
   2034
   2035	            ggml_backend_graph_compute(backend1, &g1v);
-> 2036	            ggml_backend_graph_compute(backend2, &g2v);
   2037
   2038	            if (ggml_is_view_op(t1->op)) {
   2039	                continue;
Target 0: (test-backend-ops) stopped.
(lldb) n
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
Process 58647 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step over
    frame #0: 0x000000010041a748 libggml-base.dylib`ggml_backend_compare_graph_backend(backend1=0x00006000014a8090, backend2=0x00006000014a8000, graph=0x0000000100224020, callback=(test-backend-ops`test_case::eval(ggml_backend*, ggml_backend*, char const*, printer*)::'lambda'(int, ggml_tensor*, ggml_tensor*, void*)::__invoke(int, ggml_tensor*, ggml_tensor*, void*) at test-backend-ops.cpp:1204), user_data=0x000000016fdfddb8, test_node=0x0000000000000000) at ggml-backend.cpp:2038:33
   2035	            ggml_backend_graph_compute(backend1, &g1v);
   2036	            ggml_backend_graph_compute(backend2, &g2v);
   2037
-> 2038	            if (ggml_is_view_op(t1->op)) {
   2039	                continue;
   2040	            }
   2041
Target 0: (test-backend-ops) stopped.

I realized that this was because I had not disabled GGML_LLAMAFILE when building. Setting this to OFF then printed the output like you added. I'm looking into adding disabling this in the build now. I've disabled LLAMAFILE, HBM, OpenMP, and KleidiAI for the cpu reference backend now.

Set GGML_SYSTEM_ARCH to cpu-ref when GGML_CPU_REF_BACKEND is enabled
to force a generic backend implementation.
Disable HBM, OpenMP, KleidiaI for CPU ref backend.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants