Skip to content

Speed Optimization tips

John edited this page Jul 7, 2023 · 1 revision

CUDA Optimizing inference speed

  • Thread count will be optimal between 1 and 8. The system will try to choose a good number if you do not specify -t
  • For large prompts n_batch can speed up processing 10-20 times but additional VRAM of 500-1700 MB is required. That's -b 512
  • Batch sizes up to 8k have been tested but RAM usage increases quadratic with tokens and batch size.
  • Multi GPU systems can benefit from single GPU processing when the model is small enough. That's --override-max-gpu 1
  • Multi GPU systems with different GPUs benefit from custom tensor splitting to load one GPU heavier. To load the 2nd GPU stronger: --tensor-split 1,3 -mg 1
  • Need to squeeze a model into VRAM but 1-2 layers don't fit ? Try --gpu-reserve-mb-main 1 to reduce reserved VRAM to 1 MB, you can use negative numbers to force VRAM swapping
  • Wish to reduce VRAM usage and offload less layers? Use -ngl 10 to only load 10 layers
  • Want to dive into details ? Use --debug-timings <1,2,3> to get detailed statistics on performance of each operation, how and where it was performed and it's total impact

CPU Optimization

  • Use less threads -t x than you have physical cores
  • If your CPU has hybrid cores (like Intel Atom efficiency cores) then your settings depends a lot on your workload and configuration. In most cases you will have the best performance with a number close to your performance cores.
  • You can compile with OpenBLAS, if you are ready for some hassle. This will boost the speed of batched prompt ingestion (-b x), so if you handle large prompts on CPU only then consider that route.
  • For large prompts experiment batched processing, most CPUs will work fine with -b 16 to -b 64