llama.cpp vs vllm performance comparison #15180
JohannesGaessler
started this conversation in
Show and tell
Replies: 1 comment 3 replies
-
What is FlashInfer and why it wasn't used? |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I benchmarked llama.cpp vs. vllm. The TL;DR is that in the space that I tested llama.cpp needed 93.6-100.2% of the time to finish a request that vllm did for a single parallel request or 99.2-125.6% of the time for 16 parallel requests.
Methodology
Using the newly updated
scripts/server-bench.py
I sent parallel requests to the llama.cpp server (withLLAMA_SET_ROWS=1
) or the vllm server (without FlashInfer) using the OAI-compatible APIs. Both servers served Qwen 2.5 Instruct 3b, vllm with BF16, llama.cpp with FP16 (because the support for BF16 still has some issues, vllm seems to be the same speed for FP16 and BF16). The hardware was single RTX 4090s frequency limited to 1350 MHz. Each server received a fixed number of concurrent requests with a fixed number of prompt tokens and generation tokens. For 1/16 requests a maximum context of 32768/26624 tokens was probed. The number of requests per run was 32 x the number of parallel requests, for each datapoint 6 independent runs were averaged. To separate the effects of the prompt length and the generation length, the following function was fit to the data:where$\mathrm{rt}$ is the runtime in seconds, $n_p$ / $n_g$ are the numbers of prompt/generation tokens, $p_0$ / $g_0$ are the base runtime per prompt/generation token, and $p_c$ / $g_c$ are the runtimes per prompt/generation token and per context depth. In effect this function fits a runtime with a constant part per token (weights) and a runtime proportional to context depth (attention). Fits were done using kafe2.
Commands
Results
Data
Fit
1 concurrent request:
16 concurrent requests:
The runtimes are overall relatively close. I think the llama.cpp performance for 16 parallel requests could be improved by reducing the constant runtime per generated token. One thing that could be done is move some of the samplers like top-k, top-p, and min-p into the ggml graph in order to cut down on the token candidates before passing them to the rest of the sampler chain. Also more operation fusion and using FP16/BF16 for the ggml graphs would probably help. I'm not sure how reliable the estimates of the runtime per token and context depth are since the model is a poor fit to the vllm data; sadly I was not able to probe vllm at very deep contexts since the CUDA backend would crash.
Beta Was this translation helpful? Give feedback.
All reactions