From 188d82e0507014d83bc3d15a3a39d795f032f3b8 Mon Sep 17 00:00:00 2001 From: Daneyon Hansen Date: Tue, 6 May 2025 16:41:56 -0700 Subject: [PATCH] Docs: Updates Benchmark Guide Signed-off-by: Daneyon Hansen --- config/manifests/benchmark/benchmark.yaml | 2 +- site-src/guides/index.md | 2 +- site-src/performance/benchmark/index.md | 65 ++++++++++++++--------- tools/benchmark/benchmark.ipynb | 4 +- 4 files changed, 44 insertions(+), 29 deletions(-) diff --git a/config/manifests/benchmark/benchmark.yaml b/config/manifests/benchmark/benchmark.yaml index c784730e8..abf9ae5f6 100644 --- a/config/manifests/benchmark/benchmark.yaml +++ b/config/manifests/benchmark/benchmark.yaml @@ -37,7 +37,7 @@ spec: - name: BACKEND value: vllm - name: PORT - value: "8081" + value: "80" - name: INPUT_LENGTH value: "1024" - name: OUTPUT_LENGTH diff --git a/site-src/guides/index.md b/site-src/guides/index.md index be1b972f8..89811263f 100644 --- a/site-src/guides/index.md +++ b/site-src/guides/index.md @@ -76,7 +76,7 @@ This quickstart guide is intended for engineers familiar with k8s and model serv kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/inferencemodel.yaml ``` -### Deploy the InferencePool and Extension +### Deploy the InferencePool and Endpoint Picker Extension ```bash kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/inferencepool-resources.yaml diff --git a/site-src/performance/benchmark/index.md b/site-src/performance/benchmark/index.md index 39457bf66..160cc26fb 100644 --- a/site-src/performance/benchmark/index.md +++ b/site-src/performance/benchmark/index.md @@ -1,45 +1,49 @@ # Benchmark -This user guide shows how to run benchmarks against a vLLM deployment, by using both the Gateway API -inference extension, and a Kubernetes service as the load balancing strategy. The -benchmark uses the [Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG) -tool to generate load and collect results. +This user guide shows how to run benchmarks against a vLLM model server deployment by using both Gateway API +Inference Extension, and a Kubernetes service as the load balancing strategy. The benchmark uses the +[Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG) tool to generate +load and collect results. ## Prerequisites ### Deploy the inference extension and sample model server -Follow this user guide https://gateway-api-inference-extension.sigs.k8s.io/guides/ to deploy the -sample vLLM application, and the inference extension. +Follow the [getting started guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/#getting-started-with-gateway-api-inference-extension) +to deploy the vLLM model server, CRDs, etc. + +__Note:__ Only the GPU-based model server deployment option is supported for benchmark testing. ### [Optional] Scale the sample vLLM deployment -You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision. +You are more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision. ```bash -kubectl scale --replicas=8 -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml +kubectl scale deployment vllm-llama3-8b-instruct --replicas=8 ``` ### Expose the model server via a k8s service -As the baseline, let's also expose the vLLM deployment as a k8s service: +To establish a baseline, expose the vLLM deployment as a k8s service: ```bash -kubectl expose -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml --port=8081 --target-port=8000 --type=LoadBalancer +kubectl expose deployment vllm-llama3-8b-instruct --port=80 --target-port=8000 --type=LoadBalancer ``` ## Run benchmark -The LPG benchmark tool works by sending traffic to the specified target IP and port, and collect results. Follow the steps below to run a single benchmark. You can deploy multiple LPG instances if you want to run benchmarks in parallel against different targets. +The LPG benchmark tool works by sending traffic to the specified target IP and port, and collecting the results. +Follow the steps below to run a single benchmark. Multiple LPG instances can be deployed to run benchmarks in +parallel against different targets. 1. Check out the repo. - + ```bash git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension cd gateway-api-inference-extension ``` -1. Get the target IP. Examples below show how to get the IP of a gateway or a LoadBalancer k8s service. +1. Get the target IP. The examples below shows how to get the IP of a gateway or a k8s service. ```bash # Get gateway IP @@ -51,32 +55,43 @@ The LPG benchmark tool works by sending traffic to the specified target IP and p echo $SVC_IP ``` -1. Then update the `` in `./config/manifests/benchmark/benchmark.yaml` to your target IP. Feel free to adjust other parameters such as request_rates as well. For a complete list of LPG configurations, pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark). +1. Then update the `` in `./config/manifests/benchmark/benchmark.yaml` to the value of `$SVC_IP` or `$GW_IP`. + Feel free to adjust other parameters such as `request_rates` as well. For a complete list of LPG configurations, refer to the + [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark). -1. Start the benchmark tool. `kubectl apply -f ./config/manifests/benchmark/benchmark.yaml` +1. Start the benchmark tool. -1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable -to specify what this benchmark is for. For instance, `inference-extension` or `k8s-svc`. When the LPG tool finishes benchmarking, it will print a log line `LPG_FINISHED`, -the script below will watch for that log line and then start downloading results. + ```bash + kubectl apply -f ./config/manifests/benchmark/benchmark.yaml + ``` + +1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable to specify what this + benchmark is for. For instance, `inference-extension` or `k8s-svc`. When the LPG tool finishes benchmarking, it will print + a log line `LPG_FINISHED`. The script below will watch for that log line and then start downloading results. ```bash - benchmark_id='my-benchmark' ./tools/benchmark/download-benchmark-results.bash + benchmark_id='k8s-svc' ./tools/benchmark/download-benchmark-results.bash ``` - 1. After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/my-benchmark/results/json` folder. Here is a [sample json file](./sample.json). + + After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/k8s-svc/results/json` folder. + Here is a [sample json file](./sample.json). Replace `k8s-svc` with `inference-extension` when running an inference extension benchmark. ### Tips +* When using a `benchmark_id` other than `k8s-svc` or `inference-extension`, the labels in `./tools/benchmark/benchmark.ipynb` must be + updated accordingly to analyze the results. * You can specify `run_id="runX"` environment variable when running the `./download-benchmark-results.bash` script. This is useful when you run benchmarks multiple times to get a more statistically meaningful results and group the results accordingly. * Update the `request_rates` that best suit your benchmark environment. ### Advanced Benchmark Configurations -Pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a detailed list of configuration knobs. +Refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a +detailed list of configuration knobs. ## Analyze the results -This guide shows how to run the jupyter notebook using vscode. +This guide shows how to run the jupyter notebook using vscode after completing k8s service and inference extension benchmarks. 1. Create a python virtual environment. @@ -92,6 +107,6 @@ This guide shows how to run the jupyter notebook using vscode. ``` 1. Open the notebook `./tools/benchmark/benchmark.ipynb`, and run each cell. At the end you should - see a bar chart like below where **"ie"** represents inference extension. This chart is generated using this benchmarking tool with 6 vLLM (v1) model servers (H100 80 GB), [llama2-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main) and the [ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json). - - ![alt text](example-bar-chart.png) \ No newline at end of file + see a bar chart like below where __"ie"__ represents inference extension. This chart is generated using this benchmarking tool with 6 vLLM (v1) model servers (H100 80 GB), [llama2-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main) and the [ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json). + + ![alt text](example-bar-chart.png) diff --git a/tools/benchmark/benchmark.ipynb b/tools/benchmark/benchmark.ipynb index 993279cb9..ffd4c455e 100644 --- a/tools/benchmark/benchmark.ipynb +++ b/tools/benchmark/benchmark.ipynb @@ -21,7 +21,7 @@ "#@title Configuration. Edit this before running the rest.\n", "\n", "OUTPUT_DIR='output'\n", - "RUN_ID='example-run'\n", + "RUN_ID='default-run'\n", "# Path to the benchmark dir under `gateway-api-inference-extension/benchmark`\n", "BENCHMARK_DIR =\"./\"\n", "# A regex to match the model name, which matches the output file name.\n", @@ -229,7 +229,7 @@ " plot_func(curAx, m)\n", " return fig, axes\n", "\n", - "def plot_bar(labels, groups, metrics=CORE_METRICS, num_plots_per_row=NUM_PLOTS_PER_ROW, interactive=INTERACTIVE_PLOT, annotate=False):\n", + "def plot_bar(labels, groups, metrics=CORE_METRICS, num_plots_per_row=NUM_PLOTS_PER_ROW, interactive=False, annotate=False):\n", " labels = [label.alias for label in labels]\n", " logger.debug(f'Prnting bar chart for {labels}')\n", " logger.debug(f'groups: {groups}')\n",