Speed Optimization tips

CUDA Optimizing inference speed

Thread count will be optimal between 1 and 8. The system will try to choose a good number if you do not specify -t
For large prompts n_batch can speed up processing 10-20 times but additional VRAM of 500-1700 MB is required. That's -b 512
Batch sizes up to 8k have been tested but RAM usage increases quadratic with tokens and batch size.
Multi GPU systems can benefit from single GPU processing when the model is small enough. That's --override-max-gpu 1
Multi GPU systems with different GPUs benefit from custom tensor splitting to load one GPU heavier. To load the 2nd GPU stronger: --tensor-split 1,3 -mg 1
Need to squeeze a model into VRAM but 1-2 layers don't fit ? Try --gpu-reserve-mb-main 1 to reduce reserved VRAM to 1 MB, you can use negative numbers to force VRAM swapping
Wish to reduce VRAM usage and offload less layers? Use -ngl 10 to only load 10 layers
Want to dive into details ? Use --debug-timings <1,2,3> to get detailed statistics on performance of each operation, how and where it was performed and it's total impact

CPU Optimization

Use less threads -t x than you have physical cores
If your CPU has hybrid cores (like Intel Atom efficiency cores) then your settings depends a lot on your workload and configuration. In most cases you will have the best performance with a number close to your performance cores.
You can compile with OpenBLAS, if you are ready for some hassle. This will boost the speed of batched prompt ingestion (-b x), so if you handle large prompts on CPU only then consider that route.
For large prompts experiment batched processing, most CPUs will work fine with -b 16 to -b 64

Provide feedback