forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 21
Speed Optimization tips
John edited this page Jul 7, 2023
·
1 revision
CUDA Optimizing inference speed
- Thread count will be optimal between 1 and 8. The system will try to choose a good number if you do not specify
-t
- For large prompts n_batch can speed up processing 10-20 times but additional VRAM of 500-1700 MB is required. That's
-b 512
- Batch sizes up to 8k have been tested but RAM usage increases quadratic with tokens and batch size.
- Multi GPU systems can benefit from single GPU processing when the model is small enough. That's
--override-max-gpu 1
- Multi GPU systems with different GPUs benefit from custom tensor splitting to load one GPU heavier. To load the 2nd GPU stronger:
--tensor-split 1,3
-mg 1
- Need to squeeze a model into VRAM but 1-2 layers don't fit ? Try
--gpu-reserve-mb-main 1
to reduce reserved VRAM to 1 MB, you can use negative numbers to force VRAM swapping - Wish to reduce VRAM usage and offload less layers? Use
-ngl 10
to only load 10 layers - Want to dive into details ? Use
--debug-timings <1,2,3>
to get detailed statistics on performance of each operation, how and where it was performed and it's total impact
CPU Optimization
- Use less threads
-t x
than you have physical cores - If your CPU has hybrid cores (like Intel Atom efficiency cores) then your settings depends a lot on your workload and configuration. In most cases you will have the best performance with a number close to your performance cores.
- You can compile with OpenBLAS, if you are ready for some hassle. This will boost the speed of batched prompt ingestion (-b x), so if you handle large prompts on CPU only then consider that route.
- For large prompts experiment batched processing, most CPUs will work fine with -b 16 to -b 64