Docs: Updates Benchmark Guide (#789)

danehans · web-flow · commit e7944d11d0f8 · 2025-05-07T08:59:15.000-07:00
Signed-off-by: Daneyon Hansen &lt;daneyon.hansen@solo.io&gt;
diff --git a/config/manifests/benchmark/benchmark.yaml b/config/manifests/benchmark/benchmark.yaml
@@ -37,7 +37,7 @@ spec:
         - name: BACKEND
           value: vllm
         - name: PORT
-          value: "8081"
+          value: "80"
         - name: INPUT_LENGTH
           value: "1024"
         - name: OUTPUT_LENGTH
diff --git a/site-src/guides/index.md b/site-src/guides/index.md
@@ -76,7 +76,7 @@ This quickstart guide is intended for engineers familiar with k8s and model serv
    kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/inferencemodel.yaml
    ```
 
-### Deploy the InferencePool and Extension
+### Deploy the InferencePool and Endpoint Picker Extension
 
    ```bash
    kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/inferencepool-resources.yaml
diff --git a/site-src/performance/benchmark/index.md b/site-src/performance/benchmark/index.md
@@ -1,45 +1,49 @@
 # Benchmark
 
-This user guide shows how to run benchmarks against a vLLM deployment, by using both the Gateway API
-inference extension, and a Kubernetes service as the load balancing strategy. The
-benchmark uses the [Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG)
-tool to generate load and collect results.
+This user guide shows how to run benchmarks against a vLLM model server deployment by using both Gateway API
+Inference Extension, and a Kubernetes service as the load balancing strategy. The benchmark uses the
+[Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG) tool to generate
+load and collect results.
 
 ## Prerequisites
 
 ### Deploy the inference extension and sample model server
 
-Follow this user guide https://gateway-api-inference-extension.sigs.k8s.io/guides/ to deploy the
-sample vLLM application, and the inference extension.
+Follow the [getting started guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/#getting-started-with-gateway-api-inference-extension)
+to deploy the vLLM model server, CRDs, etc.
+
+__Note:__ Only the GPU-based model server deployment option is supported for benchmark testing.
 
 ### [Optional] Scale the sample vLLM deployment
 
-You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision. 
+You are more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision.
 
 ```bash
-kubectl scale --replicas=8 -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml
+kubectl scale deployment vllm-llama3-8b-instruct --replicas=8
 ```
 
 ### Expose the model server via a k8s service
 
-As the baseline, let's also expose the vLLM deployment as a k8s service:
+To establish a baseline, expose the vLLM deployment as a k8s service:
 
 ```bash
-kubectl expose -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml --port=8081 --target-port=8000 --type=LoadBalancer
+kubectl expose deployment vllm-llama3-8b-instruct --port=80 --target-port=8000 --type=LoadBalancer
 ```
 
 ## Run benchmark
 
-The LPG benchmark tool works by sending traffic to the specified target IP and port, and collect results. Follow the steps below to run a single benchmark. You can deploy multiple LPG instances if you want to run benchmarks in parallel against different targets.
+The LPG benchmark tool works by sending traffic to the specified target IP and port, and collecting the results.
+Follow the steps below to run a single benchmark. Multiple LPG instances can be deployed to run benchmarks in
+parallel against different targets.
 
 1. Check out the repo.
-    
+
     ```bash
     git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension
     cd gateway-api-inference-extension
     ```
 
-1. Get the target IP. Examples below show how to get the IP of a gateway or a LoadBalancer k8s service.
+1. Get the target IP. The examples below shows how to get the IP of a gateway or a k8s service.
 
     ```bash
     # Get gateway IP
@@ -51,32 +55,43 @@ The LPG benchmark tool works by sending traffic to the specified target IP and p
     echo $SVC_IP
     ```
 
-1. Then update the `<target-ip>` in `./config/manifests/benchmark/benchmark.yaml` to your target IP. Feel free to adjust other parameters such as request_rates as well. For a complete list of LPG configurations, pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark).
+1. Then update the `<target-ip>` in `./config/manifests/benchmark/benchmark.yaml` to the value of `$SVC_IP` or `$GW_IP`.
+   Feel free to adjust other parameters such as `request_rates` as well. For a complete list of LPG configurations, refer to the
+   [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark).
 
-1. Start the benchmark tool. `kubectl apply -f ./config/manifests/benchmark/benchmark.yaml`
+1. Start the benchmark tool.
 
-1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable
-to specify what this benchmark is for. For instance, `inference-extension` or `k8s-svc`. When the LPG tool finishes benchmarking, it will print a log line `LPG_FINISHED`,
-the script below will watch for that log line and then start downloading results.
+    ```bash
+    kubectl apply -f ./config/manifests/benchmark/benchmark.yaml
+    ```
+
+1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable to specify what this
+   benchmark is for. For instance, `inference-extension` or `k8s-svc`. When the LPG tool finishes benchmarking, it will print
+   a log line `LPG_FINISHED`. The script below will watch for that log line and then start downloading results.
 
     ```bash
-    benchmark_id='my-benchmark' ./tools/benchmark/download-benchmark-results.bash
+    benchmark_id='k8s-svc' ./tools/benchmark/download-benchmark-results.bash
     ```
-    1. After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/my-benchmark/results/json` folder. Here is a [sample json file](./sample.json).
+
+    After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/k8s-svc/results/json` folder.
+    Here is a [sample json file](./sample.json). Replace `k8s-svc` with `inference-extension` when running an inference extension benchmark.
 
 ### Tips
 
+* When using a `benchmark_id` other than `k8s-svc` or `inference-extension`, the labels in `./tools/benchmark/benchmark.ipynb` must be
+  updated accordingly to analyze the results.
 * You can specify `run_id="runX"` environment variable when running the `./download-benchmark-results.bash` script.
 This is useful when you run benchmarks multiple times to get a more statistically meaningful results and group the results accordingly.
 * Update the `request_rates` that best suit your benchmark environment.
 
 ### Advanced Benchmark Configurations
 
-Pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a detailed list of configuration knobs.
+Refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a
+detailed list of configuration knobs.
 
 ## Analyze the results
 
-This guide shows how to run the jupyter notebook using vscode.
+This guide shows how to run the jupyter notebook using vscode after completing k8s service and inference extension benchmarks.
 
 1. Create a python virtual environment.
 
@@ -92,6 +107,6 @@ This guide shows how to run the jupyter notebook using vscode.
     ```
 
 1. Open the notebook `./tools/benchmark/benchmark.ipynb`, and run each cell. At the end you should
-    see a bar chart like below where **"ie"** represents inference extension. This chart is generated using this benchmarking tool with 6 vLLM (v1) model servers (H100 80 GB), [llama2-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main) and the [ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json).
-    
-    ![alt text](example-bar-chart.png)
+   see a bar chart like below where __"ie"__ represents inference extension. This chart is generated using this benchmarking tool with 6 vLLM (v1) model servers (H100 80 GB), [llama2-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main) and the [ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json).
+
+    ![alt text](example-bar-chart.png)
diff --git a/tools/benchmark/benchmark.ipynb b/tools/benchmark/benchmark.ipynb
@@ -21,7 +21,7 @@
         "#@title Configuration. Edit this before running the rest.\n",
         "\n",
         "OUTPUT_DIR='output'\n",
-        "RUN_ID='example-run'\n",
+        "RUN_ID='default-run'\n",
         "# Path to the benchmark dir under `gateway-api-inference-extension/benchmark`\n",
         "BENCHMARK_DIR =\"./\"\n",
         "# A regex to match the model name, which matches the output file name.\n",
@@ -229,7 +229,7 @@
         "    plot_func(curAx, m)\n",
         "  return fig, axes\n",
         "\n",
-        "def plot_bar(labels, groups, metrics=CORE_METRICS, num_plots_per_row=NUM_PLOTS_PER_ROW, interactive=INTERACTIVE_PLOT, annotate=False):\n",
+        "def plot_bar(labels, groups, metrics=CORE_METRICS, num_plots_per_row=NUM_PLOTS_PER_ROW, interactive=False, annotate=False):\n",
         "    labels = [label.alias for label in labels]\n",
         "    logger.debug(f'Prnting bar chart for {labels}')\n",
         "    logger.debug(f'groups: {groups}')\n",