-
Notifications
You must be signed in to change notification settings - Fork 13.1k
ggml : add repack testing support #16182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This commit introduces a CPU reference implementation for GGML, designed primarily for testing and validation purposes. The motivation for this addition is to have a pure C CPU backend implementation that does not use any hardware-specific optimizations or intrinsics. This will allow for testing the CPU backend variants against the reference implementation to ensure correctness
This commit add support for testing the ggml-cpu repack feature which enables the repackaging of quantized data into more optimal layout for matrix multiplication for specific hardware architectures. The motivation is to enable the testing of a cpu backend that uses repacked data against a reference cpu backend that does not use repacked data. Building: ```console $ cmake -B build \ -DGGML_CPU_REF_BACKEND=ON -DGGML_BACKEND_DL=ON \ -DGGML_CPU_ALL_VARIANTS=ON ``` List availble cpu architectures/variants: ```console $ ./build/bin/test-backend-ops cpu-variants --list CPU variants: CPU-alderlake - 12th Gen Intel(R) Core(TM) i7-1260P ``` Run tests: ```console ./build-ref/bin/test-backend-ops cpu-variants \ --variant CPU-alderlake \ -o "MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)" Testing CPU variant 'CPU-alderlake' against cpu-ref backend... repack: repack tensor a with q4_0_8x8 MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK repack: repack tensor a with q4_0_8x8 MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK 14491/14491 tests passed ``` All matrix multiplication tests can be run by use specifying `-o "MUL_MAT"` but it may be harder to spot the ones that use repacking.
I don't think the reference CPU backend is using the cmake .. -DGGML_CPU_REF_BACKEND=ON -DGGML_BACKEND_DL=ON -DGGML_BLAS=OFF -DGGML_METAL=OFF
make -j && ./bin/test-backend-ops cpu-variants --list
load_backend: loaded CPU backend from llama.cpp/build-cpu-ref/bin/libggml-cpu.so
load_backend: loaded CPU backend from llama.cpp/build-cpu-ref/bin/libggml-cpu-ref.so
CPU variants:
CPU - Apple M4 Max
make -j && ./bin/test-backend-ops cpu-variants --variant CPU -o MUL_MAT -p "q4_0" If I put prints like this, I see only the Arm implementation being executed: diff --git a/ggml/src/ggml-cpu/arch/arm/quants.c b/ggml/src/ggml-cpu/arch/arm/quants.c
index aadbb487e..fff646064 100644
--- a/ggml/src/ggml-cpu/arch/arm/quants.c
+++ b/ggml/src/ggml-cpu/arch/arm/quants.c
@@ -138,6 +138,7 @@ void quantize_row_q8_K(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, in
//===================================== Dot products =================================
void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
+ printf("arm\n");
const int qk = QK8_0;
const int nb = n / qk;
diff --git a/ggml/src/ggml-cpu/quants.c b/ggml/src/ggml-cpu/quants.c
index 365cb36d2..3694ae145 100644
--- a/ggml/src/ggml-cpu/quants.c
+++ b/ggml/src/ggml-cpu/quants.c
@@ -113,6 +113,7 @@ void quantize_row_q8_K_generic(const float * GGML_RESTRICT x, void * GGML_RESTRI
//===================================== Dot products =================================
void ggml_vec_dot_q4_0_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
+ printf("generic\n");
const int qk = QK8_0;
const int nb = n / qk; |
@ggerganov Thanks for trying this out. I'll take a closer look shortly 👍 With fix in 22ef44d I was able to to see both arm and generic output though I did have to add the generic printf to ggml-quants.c: diff --git a/ggml/src/ggml-cpu/arch/arm/quants.c b/ggml/src/ggml-cpu/arch/arm/quants.c
index aadbb487e..689148ae7 100644
--- a/ggml/src/ggml-cpu/arch/arm/quants.c
+++ b/ggml/src/ggml-cpu/arch/arm/quants.c
@@ -138,6 +138,7 @@ void quantize_row_q8_K(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, in
//===================================== Dot products =================================
void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
+ printf("arm...\n");
const int qk = QK8_0;
const int nb = n / qk;
diff --git a/ggml/src/ggml-quants.c b/ggml/src/ggml-quants.c
index 727932123..058b0f6a6 100644
--- a/ggml/src/ggml-quants.c
+++ b/ggml/src/ggml-quants.c
@@ -197,6 +197,7 @@ void quantize_row_q5_1_ref(const float * GGML_RESTRICT x, block_q5_1 * GGML_REST
// reference implementation for deterministic creation of model files
void quantize_row_q8_0_ref(const float * GGML_RESTRICT x, block_q8_0 * GGML_RESTRICT y, int64_t k) {
+ printf("generic....\n");
assert(k % QK8_0 == 0);
const int nb = k / QK8_0;
I've also stepped through this in the debugger(lldb) target create "build/bin/test-backend-ops"
Current executable set to '/Users/danbev/work/ai/llama.cpp/build/bin/test-backend-ops' (arm64).
(lldb) settings set -- target.run-args "cpu-variants" "--variant" "CPU" "-o" "MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)"
(lldb) br set -f ggml-backend.cpp -l 2035 Breakpoint 1: where = libggml-base.dylib`::ggml_backend_compare_graph_backend(ggml_backend_t, ggml_backend_t, ggml_cgraph *, ggml_backend_eval_callback, void *, ggml_tensor *) + 668 at ggml-backend.cpp:2035:40, address = 0x0000000000022734
(lldb) r
Process 58647 launched: '/Users/danbev/work/ai/llama.cpp/build/bin/test-backend-ops' (arm64)
ggml_backend_load_best: failed to find ggml_backend_score in /Users/danbev/work/ai/llama.cpp/build/bin/libggml-cpu-ref.so
load_backend: loaded CPU backend from /Users/danbev/work/ai/llama.cpp/build/bin/libggml-cpu.so
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M3)
load_backend: loaded CPU backend from /Users/danbev/work/ai/llama.cpp/build/bin/libggml-cpu-ref.so
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU-ref (Apple M3)
Testing CPU variant 'CPU' against cpu-ref backend...
Process 58647 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
frame #0: 0x000000010041a734 libggml-base.dylib`ggml_backend_compare_graph_backend(backend1=0x00006000014a8090, backend2=0x00006000014a8000, graph=0x0000000100224020, callback=(test-backend-ops`test_case::eval(ggml_backend*, ggml_backend*, char const*, printer*)::'lambda'(int, ggml_tensor*, ggml_tensor*, void*)::__invoke(int, ggml_tensor*, ggml_tensor*, void*) at test-backend-ops.cpp:1204), user_data=0x000000016fdfddb8, test_node=0x0000000000000000) at ggml-backend.cpp:2035:40
2032 struct ggml_cgraph g1v = ggml_graph_view(g1, i, i + 1);
2033 struct ggml_cgraph g2v = ggml_graph_view(g2, i, i + 1);
2034
-> 2035 ggml_backend_graph_compute(backend1, &g1v);
2036 ggml_backend_graph_compute(backend2, &g2v);
2037
2038 if (ggml_is_view_op(t1->op)) {
Target 0: (test-backend-ops) stopped.
(lldb) n
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
Process 58647 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step over
frame #0: 0x000000010041a740 libggml-base.dylib`ggml_backend_compare_graph_backend(backend1=0x00006000014a8090, backend2=0x00006000014a8000, graph=0x0000000100224020, callback=(test-backend-ops`test_case::eval(ggml_backend*, ggml_backend*, char const*, printer*)::'lambda'(int, ggml_tensor*, ggml_tensor*, void*)::__invoke(int, ggml_tensor*, ggml_tensor*, void*) at test-backend-ops.cpp:1204), user_data=0x000000016fdfddb8, test_node=0x0000000000000000) at ggml-backend.cpp:2036:40
2033 struct ggml_cgraph g2v = ggml_graph_view(g2, i, i + 1);
2034
2035 ggml_backend_graph_compute(backend1, &g1v);
-> 2036 ggml_backend_graph_compute(backend2, &g2v);
2037
2038 if (ggml_is_view_op(t1->op)) {
2039 continue;
Target 0: (test-backend-ops) stopped.
(lldb) n
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
Process 58647 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step over
frame #0: 0x000000010041a748 libggml-base.dylib`ggml_backend_compare_graph_backend(backend1=0x00006000014a8090, backend2=0x00006000014a8000, graph=0x0000000100224020, callback=(test-backend-ops`test_case::eval(ggml_backend*, ggml_backend*, char const*, printer*)::'lambda'(int, ggml_tensor*, ggml_tensor*, void*)::__invoke(int, ggml_tensor*, ggml_tensor*, void*) at test-backend-ops.cpp:1204), user_data=0x000000016fdfddb8, test_node=0x0000000000000000) at ggml-backend.cpp:2038:33
2035 ggml_backend_graph_compute(backend1, &g1v);
2036 ggml_backend_graph_compute(backend2, &g2v);
2037
-> 2038 if (ggml_is_view_op(t1->op)) {
2039 continue;
2040 }
2041
Target 0: (test-backend-ops) stopped. I realized that this was because I had not disabled GGML_LLAMAFILE when building. Setting this to OFF then printed the output like you added. |
Set GGML_SYSTEM_ARCH to cpu-ref when GGML_CPU_REF_BACKEND is enabled to force a generic backend implementation.
Disable GGML_LLAMAFILE for ref backend.
Disable HBM, OpenMP, KleidiaI for CPU ref backend.
This PR consists of three commits but the first two are part of #16004.
This commit add support for testing the ggml-cpu repack feature which
enables the repackaging of quantized data into more optimal layout for
matrix multiplication for specific hardware architectures.
The motivation is to enable the testing of a cpu backend that uses
repacked data against a reference cpu backend that does not use repacked
data.
Building:
List availble cpu architectures/variants:
Run tests:
All matrix multiplication tests can be run by use specifying
-o "MUL_MAT, MUL_MAT_ID"
.