Description
Anything you want to discuss about vllm.
I am currently using an AMD GPU to test vLLM. As recommended on the official documents, I used a docker image and container to run vLLM with ROCm. I am running some benchmark tests with the vLLM server using the Llama 3 8B Instruct model with the benchmark_serving.py file provided by vLLM.
I noticed that when I deploy a new container and run the benchmark test I get a TPS of around 250. However, even if I stop the vLLM server and start it again, the second time I run the benchmark using the same inputs, I get a TPS of around 1000. On the other hand, if I remove the container and deploy a new one, I get a TPS of around 250 again. Has anybody noticed this same issue before?
I think maybe it has something to do with the values used for Paged Attention being cached somewhere in the container? And only when the container is removed it seems like it is properly reset? If anyone can provide some insight, it would be very helpful.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.