From 98a305ccda765c8b3f709d89c029ad8a0a982800 Mon Sep 17 00:00:00 2001 From: Zhang Jason Date: Fri, 25 Jul 2025 13:54:04 +0800 Subject: [PATCH 1/3] Update README.md to add AMD GPU instructions --- README.md | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index f56a7339..03209006 100644 --- a/README.md +++ b/README.md @@ -52,7 +52,19 @@ See `bench.py` for benchmark. | vLLM | 133,966 | 98.37 | 1361.84 | | Nano-vLLM | 133,966 | 93.41 | 1434.13 | - +## Run Nano-vllm on AMD GPU +### CDNA3 datacenter GPUs +Access AMD CDNA3 datacenter GPU through [AMD developer cloud] (https://www.amd.com/en/developer/resources/cloud-access/amd-developer-cloud.html).It is stronly advised to use the pre-built vLLM GPU instance, which has already installed Triton and Rocm version flash attention library.if you have other MI300 GPU resource, you could try the docker image of rocm/vllm:latest to run nano-vllm + +### RDNA3/4 Desktop GPUs +1) use the pre-built rocm docker image to setup ROCM, like rocm/pytorch:latest +2) install the rocm-version flash triton library + a. git clone --recursive https://github.com/Dao-AILab/flash-attention.git + b. cd flash-attention + c. FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" python setup.py install +3) install nano-vllm +4) run benchmark: FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" python3 bench.py + ## Star History -[![Star History Chart](https://api.star-history.com/svg?repos=GeeeekExplorer/nano-vllm&type=Date)](https://www.star-history.com/#GeeeekExplorer/nano-vllm&Date) \ No newline at end of file +[![Star History Chart](https://api.star-history.com/svg?repos=GeeeekExplorer/nano-vllm&type=Date)](https://www.star-history.com/#GeeeekExplorer/nano-vllm&Date) From a44042ab2813758c17137dae31890fc876bd5b05 Mon Sep 17 00:00:00 2001 From: Zhang Jason Date: Sat, 26 Jul 2025 09:51:19 +0800 Subject: [PATCH 2/3] Update README.md update the format of markdown file --- README.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 03209006..34e14a11 100644 --- a/README.md +++ b/README.md @@ -56,14 +56,16 @@ See `bench.py` for benchmark. ### CDNA3 datacenter GPUs Access AMD CDNA3 datacenter GPU through [AMD developer cloud] (https://www.amd.com/en/developer/resources/cloud-access/amd-developer-cloud.html).It is stronly advised to use the pre-built vLLM GPU instance, which has already installed Triton and Rocm version flash attention library.if you have other MI300 GPU resource, you could try the docker image of rocm/vllm:latest to run nano-vllm -### RDNA3/4 Desktop GPUs +### RDNA3 Desktop GPUs 1) use the pre-built rocm docker image to setup ROCM, like rocm/pytorch:latest 2) install the rocm-version flash triton library - a. git clone --recursive https://github.com/Dao-AILab/flash-attention.git - b. cd flash-attention - c. FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" python setup.py install -3) install nano-vllm -4) run benchmark: FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" python3 bench.py + ```bash + git clone --recursive https://github.com/Dao-AILab/flash-attention.git + cd flash-attention + FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" python setup.py install + ``` +4) install nano-vllm +5) run benchmark: FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" python3 bench.py ## Star History From 476d27cc4949f9bd0ff287200d74e0dc785086c8 Mon Sep 17 00:00:00 2001 From: Zhang Jason Date: Sat, 26 Jul 2025 10:14:13 +0800 Subject: [PATCH 3/3] Update README.md --- README.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 34e14a11..8b6ea76f 100644 --- a/README.md +++ b/README.md @@ -57,15 +57,19 @@ See `bench.py` for benchmark. Access AMD CDNA3 datacenter GPU through [AMD developer cloud] (https://www.amd.com/en/developer/resources/cloud-access/amd-developer-cloud.html).It is stronly advised to use the pre-built vLLM GPU instance, which has already installed Triton and Rocm version flash attention library.if you have other MI300 GPU resource, you could try the docker image of rocm/vllm:latest to run nano-vllm ### RDNA3 Desktop GPUs +#### using pre-built rocm navi vllm docker image +in these [ROCM vllm navi images](https://hub.docker.com/r/rocm/vllm-dev/tags?name=navi), flash attention library has been installed already. you can choose the latest verison for your test. +#### using pre-built rocm pytorch docker image 1) use the pre-built rocm docker image to setup ROCM, like rocm/pytorch:latest 2) install the rocm-version flash triton library - ```bash +```bash git clone --recursive https://github.com/Dao-AILab/flash-attention.git cd flash-attention FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" python setup.py install - ``` +``` 4) install nano-vllm 5) run benchmark: FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" python3 bench.py + ## Star History