Skip to content

Commit e7944d1

Browse files
authored
Docs: Updates Benchmark Guide (#789)
Signed-off-by: Daneyon Hansen <[email protected]>
1 parent 35d7f64 commit e7944d1

File tree

4 files changed

+44
-29
lines changed

4 files changed

+44
-29
lines changed

config/manifests/benchmark/benchmark.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ spec:
3737
- name: BACKEND
3838
value: vllm
3939
- name: PORT
40-
value: "8081"
40+
value: "80"
4141
- name: INPUT_LENGTH
4242
value: "1024"
4343
- name: OUTPUT_LENGTH

site-src/guides/index.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ This quickstart guide is intended for engineers familiar with k8s and model serv
7676
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/inferencemodel.yaml
7777
```
7878

79-
### Deploy the InferencePool and Extension
79+
### Deploy the InferencePool and Endpoint Picker Extension
8080

8181
```bash
8282
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/inferencepool-resources.yaml
+40-25
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,49 @@
11
# Benchmark
22

3-
This user guide shows how to run benchmarks against a vLLM deployment, by using both the Gateway API
4-
inference extension, and a Kubernetes service as the load balancing strategy. The
5-
benchmark uses the [Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG)
6-
tool to generate load and collect results.
3+
This user guide shows how to run benchmarks against a vLLM model server deployment by using both Gateway API
4+
Inference Extension, and a Kubernetes service as the load balancing strategy. The benchmark uses the
5+
[Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG) tool to generate
6+
load and collect results.
77

88
## Prerequisites
99

1010
### Deploy the inference extension and sample model server
1111

12-
Follow this user guide https://gateway-api-inference-extension.sigs.k8s.io/guides/ to deploy the
13-
sample vLLM application, and the inference extension.
12+
Follow the [getting started guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/#getting-started-with-gateway-api-inference-extension)
13+
to deploy the vLLM model server, CRDs, etc.
14+
15+
__Note:__ Only the GPU-based model server deployment option is supported for benchmark testing.
1416

1517
### [Optional] Scale the sample vLLM deployment
1618

17-
You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision.
19+
You are more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision.
1820

1921
```bash
20-
kubectl scale --replicas=8 -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml
22+
kubectl scale deployment vllm-llama3-8b-instruct --replicas=8
2123
```
2224

2325
### Expose the model server via a k8s service
2426

25-
As the baseline, let's also expose the vLLM deployment as a k8s service:
27+
To establish a baseline, expose the vLLM deployment as a k8s service:
2628

2729
```bash
28-
kubectl expose -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml --port=8081 --target-port=8000 --type=LoadBalancer
30+
kubectl expose deployment vllm-llama3-8b-instruct --port=80 --target-port=8000 --type=LoadBalancer
2931
```
3032

3133
## Run benchmark
3234

33-
The LPG benchmark tool works by sending traffic to the specified target IP and port, and collect results. Follow the steps below to run a single benchmark. You can deploy multiple LPG instances if you want to run benchmarks in parallel against different targets.
35+
The LPG benchmark tool works by sending traffic to the specified target IP and port, and collecting the results.
36+
Follow the steps below to run a single benchmark. Multiple LPG instances can be deployed to run benchmarks in
37+
parallel against different targets.
3438

3539
1. Check out the repo.
36-
40+
3741
```bash
3842
git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension
3943
cd gateway-api-inference-extension
4044
```
4145

42-
1. Get the target IP. Examples below show how to get the IP of a gateway or a LoadBalancer k8s service.
46+
1. Get the target IP. The examples below shows how to get the IP of a gateway or a k8s service.
4347

4448
```bash
4549
# Get gateway IP
@@ -51,32 +55,43 @@ The LPG benchmark tool works by sending traffic to the specified target IP and p
5155
echo $SVC_IP
5256
```
5357

54-
1. Then update the `<target-ip>` in `./config/manifests/benchmark/benchmark.yaml` to your target IP. Feel free to adjust other parameters such as request_rates as well. For a complete list of LPG configurations, pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark).
58+
1. Then update the `<target-ip>` in `./config/manifests/benchmark/benchmark.yaml` to the value of `$SVC_IP` or `$GW_IP`.
59+
Feel free to adjust other parameters such as `request_rates` as well. For a complete list of LPG configurations, refer to the
60+
[LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark).
5561

56-
1. Start the benchmark tool. `kubectl apply -f ./config/manifests/benchmark/benchmark.yaml`
62+
1. Start the benchmark tool.
5763

58-
1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable
59-
to specify what this benchmark is for. For instance, `inference-extension` or `k8s-svc`. When the LPG tool finishes benchmarking, it will print a log line `LPG_FINISHED`,
60-
the script below will watch for that log line and then start downloading results.
64+
```bash
65+
kubectl apply -f ./config/manifests/benchmark/benchmark.yaml
66+
```
67+
68+
1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable to specify what this
69+
benchmark is for. For instance, `inference-extension` or `k8s-svc`. When the LPG tool finishes benchmarking, it will print
70+
a log line `LPG_FINISHED`. The script below will watch for that log line and then start downloading results.
6171

6272
```bash
63-
benchmark_id='my-benchmark' ./tools/benchmark/download-benchmark-results.bash
73+
benchmark_id='k8s-svc' ./tools/benchmark/download-benchmark-results.bash
6474
```
65-
1. After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/my-benchmark/results/json` folder. Here is a [sample json file](./sample.json).
75+
76+
After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/k8s-svc/results/json` folder.
77+
Here is a [sample json file](./sample.json). Replace `k8s-svc` with `inference-extension` when running an inference extension benchmark.
6678

6779
### Tips
6880

81+
* When using a `benchmark_id` other than `k8s-svc` or `inference-extension`, the labels in `./tools/benchmark/benchmark.ipynb` must be
82+
updated accordingly to analyze the results.
6983
* You can specify `run_id="runX"` environment variable when running the `./download-benchmark-results.bash` script.
7084
This is useful when you run benchmarks multiple times to get a more statistically meaningful results and group the results accordingly.
7185
* Update the `request_rates` that best suit your benchmark environment.
7286

7387
### Advanced Benchmark Configurations
7488

75-
Pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a detailed list of configuration knobs.
89+
Refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a
90+
detailed list of configuration knobs.
7691

7792
## Analyze the results
7893

79-
This guide shows how to run the jupyter notebook using vscode.
94+
This guide shows how to run the jupyter notebook using vscode after completing k8s service and inference extension benchmarks.
8095

8196
1. Create a python virtual environment.
8297

@@ -92,6 +107,6 @@ This guide shows how to run the jupyter notebook using vscode.
92107
```
93108

94109
1. Open the notebook `./tools/benchmark/benchmark.ipynb`, and run each cell. At the end you should
95-
see a bar chart like below where **"ie"** represents inference extension. This chart is generated using this benchmarking tool with 6 vLLM (v1) model servers (H100 80 GB), [llama2-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main) and the [ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json).
96-
97-
![alt text](example-bar-chart.png)
110+
see a bar chart like below where __"ie"__ represents inference extension. This chart is generated using this benchmarking tool with 6 vLLM (v1) model servers (H100 80 GB), [llama2-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main) and the [ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json).
111+
112+
![alt text](example-bar-chart.png)

tools/benchmark/benchmark.ipynb

+2-2
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
"#@title Configuration. Edit this before running the rest.\n",
2222
"\n",
2323
"OUTPUT_DIR='output'\n",
24-
"RUN_ID='example-run'\n",
24+
"RUN_ID='default-run'\n",
2525
"# Path to the benchmark dir under `gateway-api-inference-extension/benchmark`\n",
2626
"BENCHMARK_DIR =\"./\"\n",
2727
"# A regex to match the model name, which matches the output file name.\n",
@@ -229,7 +229,7 @@
229229
" plot_func(curAx, m)\n",
230230
" return fig, axes\n",
231231
"\n",
232-
"def plot_bar(labels, groups, metrics=CORE_METRICS, num_plots_per_row=NUM_PLOTS_PER_ROW, interactive=INTERACTIVE_PLOT, annotate=False):\n",
232+
"def plot_bar(labels, groups, metrics=CORE_METRICS, num_plots_per_row=NUM_PLOTS_PER_ROW, interactive=False, annotate=False):\n",
233233
" labels = [label.alias for label in labels]\n",
234234
" logger.debug(f'Prnting bar chart for {labels}')\n",
235235
" logger.debug(f'groups: {groups}')\n",

0 commit comments

Comments
 (0)