vllm-project · russellb · Jun 23, 2025 · Jun 19, 2025
diff --git a/.buildkite/nightly-benchmarks/nightly-annotation.md b/.buildkite/nightly-benchmarks/nightly-annotation.md
@@ -16,7 +16,7 @@ Please download the visualization scripts in the post
   - Download `nightly-benchmarks.zip`.
   - In the same folder, run the following code:
 
-  ```console
+  ```bash
   export HF_TOKEN=<your HF token>
   apt update
   apt install -y git

@@ -10,7 +10,7 @@ title: Using Docker
 vLLM offers an official Docker image for deployment.
 The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
 
-```console
+```bash
 docker run --runtime nvidia --gpus all \
     -v ~/.cache/huggingface:/root/.cache/huggingface \
     --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
@@ -22,7 +22,7 @@ docker run --runtime nvidia --gpus all \
 
 This image can also be used with other container engines such as [Podman](https://podman.io/).
 
-```console
+```bash
 podman run --gpus all \
   -v ~/.cache/huggingface:/root/.cache/huggingface \
   --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
@@ -71,7 +71,7 @@ You can add any other [engine-args][engine-args] you need after the image tag (`
 
 You can build and run vLLM from source via the provided <gh-file:docker/Dockerfile>. To build vLLM:
 
-```console
+```bash
 # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
 DOCKER_BUILDKIT=1 docker build . \
     --target vllm-openai \
@@ -99,7 +99,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
 
 ??? Command
 
-    ```console
+    ```bash
     # Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
     python3 use_existing_torch.py
     DOCKER_BUILDKIT=1 docker build . \
@@ -118,7 +118,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
 
     Run the following command on your host machine to register QEMU user static handlers:
 
-    ```console
+    ```bash
     docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
     ```
 
@@ -128,7 +128,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
 
 To run vLLM with the custom-built Docker image:
 
-```console
+```bash
 docker run --runtime nvidia --gpus all \
     -v ~/.cache/huggingface:/root/.cache/huggingface \
     -p 8000:8000 \

@@ -15,7 +15,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac
 
 - Start the vLLM server with the supported chat completion model, e.g.
 
-```console
+```bash
 vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096
 ```
 

@@ -11,7 +11,7 @@ title: AutoGen
 
 - Setup [AutoGen](https://microsoft.github.io/autogen/0.2/docs/installation/) environment
 
-```console
+```bash
 pip install vllm
 
 # Install AgentChat and OpenAI client from Extensions
@@ -23,7 +23,7 @@ pip install -U "autogen-agentchat" "autogen-ext[openai]"
 
 - Start the vLLM server with the supported chat completion model, e.g.
 
-```console
+```bash
 python -m vllm.entrypoints.openai.api_server \
     --model mistralai/Mistral-7B-Instruct-v0.2
 ```

@@ -11,14 +11,14 @@ vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebr
 
 To install the Cerebrium client, run:
 
-```console
+```bash
 pip install cerebrium
 cerebrium login
 ```
 
 Next, create your Cerebrium project, run:
 
-```console
+```bash
 cerebrium init vllm-project
 ```
 
@@ -58,7 +58,7 @@ Next, let us add our code to handle inference for the LLM of your choice (`mistr
 
 Then, run the following code to deploy it to the cloud:
 
-```console
+```bash
 cerebrium deploy
 ```
 

@@ -15,7 +15,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac
 
 - Start the vLLM server with the supported chat completion model, e.g.
 
-```console
+```bash
 vllm serve qwen/Qwen1.5-0.5B-Chat
 ```
 

@@ -18,13 +18,13 @@ This guide walks you through deploying Dify using a vLLM backend.
 
 - Start the vLLM server with the supported chat completion model, e.g.
 
-```console
+```bash
 vllm serve Qwen/Qwen1.5-7B-Chat
 ```
 
 - Start the Dify server with docker compose ([details](https://github.com/langgenius/dify?tab=readme-ov-file#quick-start)):
 
-```console
+```bash
 git clone https://github.com/langgenius/dify.git
 cd dify
 cd docker

@@ -11,14 +11,14 @@ vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/),
 
 To install dstack client, run:
 
-```console
+```bash
 pip install "dstack[all]
 dstack server
 ```
 
 Next, to configure your dstack project, run:
 
-```console
+```bash
 mkdir -p vllm-dstack
 cd vllm-dstack
 dstack init

@@ -13,15 +13,15 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac
 
 - Setup vLLM and Haystack environment
 
-```console
+```bash
 pip install vllm haystack-ai
 ```
 
 ## Deploy
 
 - Start the vLLM server with the supported chat completion model, e.g.
 
-```console
+```bash
 vllm serve mistralai/Mistral-7B-Instruct-v0.1
 ```
 

@@ -22,15 +22,15 @@ Before you begin, ensure that you have the following:
 
 To install the chart with the release name `test-vllm`:
 
-```console
+```bash
 helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY
 ```
 
 ## Uninstalling the Chart
 
 To uninstall the `test-vllm` deployment:
 
-```console
+```bash
 helm uninstall test-vllm --namespace=ns-vllm
 ```
 

@@ -18,7 +18,7 @@ And LiteLLM supports all models on VLLM.
 
 - Setup vLLM and litellm environment
 
-```console
+```bash
 pip install vllm litellm
 ```
 
@@ -28,7 +28,7 @@ pip install vllm litellm
 
 - Start the vLLM server with the supported chat completion model, e.g.
 
-```console
+```bash
 vllm serve qwen/Qwen1.5-0.5B-Chat
 ```
 
@@ -56,7 +56,7 @@ vllm serve qwen/Qwen1.5-0.5B-Chat
 
 - Start the vLLM server with the supported embedding model, e.g.
 
-```console
+```bash
 vllm serve BAAI/bge-base-en-v1.5
 ```
 

@@ -7,13 +7,13 @@ title: Open WebUI
 
 2. Start the vLLM server with the supported chat completion model, e.g.
 
-```console
+```bash
 vllm serve qwen/Qwen1.5-0.5B-Chat
 ```
 
 1. Start the [Open WebUI](https://github.com/open-webui/open-webui) docker container (replace the vllm serve host and vllm serve port):
 
-```console
+```bash
 docker run -d -p 3000:8080 \
 --name open-webui \
 -v open-webui:/app/backend/data \

@@ -15,7 +15,7 @@ Here are the integrations:
 
 - Setup vLLM and langchain environment
 
-```console
+```bash
 pip install -U vllm \
             langchain_milvus langchain_openai \
             langchain_community beautifulsoup4 \
@@ -26,14 +26,14 @@ pip install -U vllm \
 
 - Start the vLLM server with the supported embedding model, e.g.
 
-```console
+```bash
 # Start embedding service (port 8000)
 vllm serve ssmits/Qwen2-7B-Instruct-embed-base
 ```
 
 - Start the vLLM server with the supported chat completion model, e.g.
 
-```console
+```bash
 # Start chat service (port 8001)
 vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
 ```
@@ -52,7 +52,7 @@ python retrieval_augmented_generation_with_langchain.py
 
 - Setup vLLM and llamaindex environment
 
-```console
+```bash
 pip install vllm \
             llama-index llama-index-readers-web \
             llama-index-llms-openai-like    \
@@ -64,14 +64,14 @@ pip install vllm \
 
 - Start the vLLM server with the supported embedding model, e.g.
 
-```console
+```bash
 # Start embedding service (port 8000)
 vllm serve ssmits/Qwen2-7B-Instruct-embed-base
 ```
 
 - Start the vLLM server with the supported chat completion model, e.g.
 
-```console
+```bash
 # Start chat service (port 8001)
 vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
 ```

@@ -15,7 +15,7 @@ vLLM can be **run and scaled to multiple service replicas on clouds and Kubernet
 - Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)).
 - Check that `sky check` shows clouds or Kubernetes are enabled.
 
-```console
+```bash
 pip install skypilot-nightly
 sky check
 ```
@@ -71,7 +71,7 @@ See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypil
 
 Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):
 
-```console
+```bash
 HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
 ```
 
@@ -83,7 +83,7 @@ Check the output of the command. There will be a shareable gradio link (like the
 
 **Optional**: Serve the 70B model instead of the default 8B and use more GPU:
 
-```console
+```bash
 HF_TOKEN="your-huggingface-token" \
   sky launch serving.yaml \
   --gpus A100:8 \
@@ -159,15 +159,15 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut
 
 Start the serving the Llama-3 8B model on multiple replicas:
 
-```console
+```bash
 HF_TOKEN="your-huggingface-token" \
   sky serve up -n vllm serving.yaml \
   --env HF_TOKEN
 ```
 
 Wait until the service is ready:
 
-```console
+```bash
 watch -n10 sky serve status vllm
 ```
 
@@ -271,13 +271,13 @@ This will scale the service up to when the QPS exceeds 2 for each replica.
 
 To update the service with the new config:
 
-```console
+```bash
 HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN
 ```
 
 To stop the service:
 
-```console
+```bash
 sky serve down vllm
 ```
 
@@ -317,7 +317,7 @@ It is also possible to access the Llama-3 service with a separate GUI frontend,
 
 1. Start the chat web UI:
 
-    ```console
+    ```bash
     sky launch \
       -c gui ./gui.yaml \
       --env ENDPOINT=$(sky serve status --endpoint vllm)

@@ -15,21 +15,21 @@ It can be quickly integrated with vLLM as a backend API server, enabling powerfu
 
 - Start the vLLM server with the supported chat completion model, e.g.
 
-```console
+```bash
 vllm serve qwen/Qwen1.5-0.5B-Chat
 ```
 
 - Install streamlit and openai:
 
-```console
+```bash
 pip install streamlit openai
 ```
 
 - Use the script: <gh-file:examples/online_serving/streamlit_openai_chatbot_webserver.py>
 
 - Start the streamlit web UI and start to chat:
 
-```console
+```bash
 streamlit run streamlit_openai_chatbot_webserver.py
 
 # or specify the VLLM_API_BASE or VLLM_API_KEY

@@ -7,7 +7,7 @@ vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-sta
 
 To install Llama Stack, run
 
-```console
+```bash
 pip install llama-stack -q
 ```
-Original file line number
+Diff line change
@@ Expand Up @@
     - Start the vLLM server with the supported chat completion model, e.g.
-    ```console
+    ```bash
     vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096
     ```
@@ Expand Down @@