Skip to content

Docs: Updates Benchmark Guide #789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion config/manifests/benchmark/benchmark.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ spec:
- name: BACKEND
value: vllm
- name: PORT
value: "8081"
value: "80"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! 8081 was the port back then when we had envoy patches.

- name: INPUT_LENGTH
value: "1024"
- name: OUTPUT_LENGTH
Expand Down
2 changes: 1 addition & 1 deletion site-src/guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ This quickstart guide is intended for engineers familiar with k8s and model serv
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/inferencemodel.yaml
```

### Deploy the InferencePool and Extension
### Deploy the InferencePool and Endpoint Picker Extension

```bash
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/inferencepool-resources.yaml
Expand Down
65 changes: 40 additions & 25 deletions site-src/performance/benchmark/index.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,49 @@
# Benchmark

This user guide shows how to run benchmarks against a vLLM deployment, by using both the Gateway API
inference extension, and a Kubernetes service as the load balancing strategy. The
benchmark uses the [Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG)
tool to generate load and collect results.
This user guide shows how to run benchmarks against a vLLM model server deployment by using both Gateway API
Inference Extension, and a Kubernetes service as the load balancing strategy. The benchmark uses the
[Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG) tool to generate
load and collect results.

## Prerequisites

### Deploy the inference extension and sample model server

Follow this user guide https://gateway-api-inference-extension.sigs.k8s.io/guides/ to deploy the
sample vLLM application, and the inference extension.
Follow the [getting started guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/#getting-started-with-gateway-api-inference-extension)
to deploy the vLLM model server, CRDs, etc.

__Note:__ Only the GPU-based model server deployment option is supported for benchmark testing.

### [Optional] Scale the sample vLLM deployment

You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision.
You are more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision.

```bash
kubectl scale --replicas=8 -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml
kubectl scale deployment vllm-llama3-8b-instruct --replicas=8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I suggest changing replicas to 6 as the example in the end uses 6 replicas, and the new regression test PR also uses 6

#755

```

### Expose the model server via a k8s service

As the baseline, let's also expose the vLLM deployment as a k8s service:
To establish a baseline, expose the vLLM deployment as a k8s service:

```bash
kubectl expose -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml --port=8081 --target-port=8000 --type=LoadBalancer
kubectl expose deployment vllm-llama3-8b-instruct --port=80 --target-port=8000 --type=LoadBalancer
```

## Run benchmark

The LPG benchmark tool works by sending traffic to the specified target IP and port, and collect results. Follow the steps below to run a single benchmark. You can deploy multiple LPG instances if you want to run benchmarks in parallel against different targets.
The LPG benchmark tool works by sending traffic to the specified target IP and port, and collecting the results.
Follow the steps below to run a single benchmark. Multiple LPG instances can be deployed to run benchmarks in
parallel against different targets.

1. Check out the repo.

```bash
git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension
cd gateway-api-inference-extension
```

1. Get the target IP. Examples below show how to get the IP of a gateway or a LoadBalancer k8s service.
1. Get the target IP. The examples below shows how to get the IP of a gateway or a k8s service.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Get the target IP. The examples below shows how to get the IP of a gateway or a k8s service.
1. Get the target IP. The example below shows how to get the IP of a gateway or a k8s service.


```bash
# Get gateway IP
Expand All @@ -51,32 +55,43 @@ The LPG benchmark tool works by sending traffic to the specified target IP and p
echo $SVC_IP
```

1. Then update the `<target-ip>` in `./config/manifests/benchmark/benchmark.yaml` to your target IP. Feel free to adjust other parameters such as request_rates as well. For a complete list of LPG configurations, pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark).
1. Then update the `<target-ip>` in `./config/manifests/benchmark/benchmark.yaml` to the value of `$SVC_IP` or `$GW_IP`.
Feel free to adjust other parameters such as `request_rates` as well. For a complete list of LPG configurations, refer to the
[LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark).

1. Start the benchmark tool. `kubectl apply -f ./config/manifests/benchmark/benchmark.yaml`
1. Start the benchmark tool.

1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable
to specify what this benchmark is for. For instance, `inference-extension` or `k8s-svc`. When the LPG tool finishes benchmarking, it will print a log line `LPG_FINISHED`,
the script below will watch for that log line and then start downloading results.
```bash
kubectl apply -f ./config/manifests/benchmark/benchmark.yaml
```

1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable to specify what this
benchmark is for. For instance, `inference-extension` or `k8s-svc`. When the LPG tool finishes benchmarking, it will print
a log line `LPG_FINISHED`. The script below will watch for that log line and then start downloading results.

```bash
benchmark_id='my-benchmark' ./tools/benchmark/download-benchmark-results.bash
benchmark_id='k8s-svc' ./tools/benchmark/download-benchmark-results.bash
```
1. After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/my-benchmark/results/json` folder. Here is a [sample json file](./sample.json).

After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/k8s-svc/results/json` folder.
Here is a [sample json file](./sample.json). Replace `k8s-svc` with `inference-extension` when running an inference extension benchmark.

### Tips

* When using a `benchmark_id` other than `k8s-svc` or `inference-extension`, the labels in `./tools/benchmark/benchmark.ipynb` must be
updated accordingly to analyze the results.
* You can specify `run_id="runX"` environment variable when running the `./download-benchmark-results.bash` script.
This is useful when you run benchmarks multiple times to get a more statistically meaningful results and group the results accordingly.
* Update the `request_rates` that best suit your benchmark environment.

### Advanced Benchmark Configurations

Pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a detailed list of configuration knobs.
Refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a
detailed list of configuration knobs.

## Analyze the results

This guide shows how to run the jupyter notebook using vscode.
This guide shows how to run the jupyter notebook using vscode after completing k8s service and inference extension benchmarks.

1. Create a python virtual environment.

Expand All @@ -92,6 +107,6 @@ This guide shows how to run the jupyter notebook using vscode.
```

1. Open the notebook `./tools/benchmark/benchmark.ipynb`, and run each cell. At the end you should
see a bar chart like below where **"ie"** represents inference extension. This chart is generated using this benchmarking tool with 6 vLLM (v1) model servers (H100 80 GB), [llama2-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main) and the [ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json).
![alt text](example-bar-chart.png)
see a bar chart like below where __"ie"__ represents inference extension. This chart is generated using this benchmarking tool with 6 vLLM (v1) model servers (H100 80 GB), [llama2-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main) and the [ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json).

![alt text](example-bar-chart.png)
4 changes: 2 additions & 2 deletions tools/benchmark/benchmark.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
"#@title Configuration. Edit this before running the rest.\n",
"\n",
"OUTPUT_DIR='output'\n",
"RUN_ID='example-run'\n",
"RUN_ID='default-run'\n",
"# Path to the benchmark dir under `gateway-api-inference-extension/benchmark`\n",
"BENCHMARK_DIR =\"./\"\n",
"# A regex to match the model name, which matches the output file name.\n",
Expand Down Expand Up @@ -229,7 +229,7 @@
" plot_func(curAx, m)\n",
" return fig, axes\n",
"\n",
"def plot_bar(labels, groups, metrics=CORE_METRICS, num_plots_per_row=NUM_PLOTS_PER_ROW, interactive=INTERACTIVE_PLOT, annotate=False):\n",
"def plot_bar(labels, groups, metrics=CORE_METRICS, num_plots_per_row=NUM_PLOTS_PER_ROW, interactive=False, annotate=False):\n",
" labels = [label.alias for label in labels]\n",
" logger.debug(f'Prnting bar chart for {labels}')\n",
" logger.debug(f'groups: {groups}')\n",
Expand Down