generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 76
Docs: Updates Benchmark Guide #789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -1,45 +1,49 @@ | ||||||
# Benchmark | ||||||
|
||||||
This user guide shows how to run benchmarks against a vLLM deployment, by using both the Gateway API | ||||||
inference extension, and a Kubernetes service as the load balancing strategy. The | ||||||
benchmark uses the [Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG) | ||||||
tool to generate load and collect results. | ||||||
This user guide shows how to run benchmarks against a vLLM model server deployment by using both Gateway API | ||||||
Inference Extension, and a Kubernetes service as the load balancing strategy. The benchmark uses the | ||||||
[Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG) tool to generate | ||||||
load and collect results. | ||||||
|
||||||
## Prerequisites | ||||||
|
||||||
### Deploy the inference extension and sample model server | ||||||
|
||||||
Follow this user guide https://gateway-api-inference-extension.sigs.k8s.io/guides/ to deploy the | ||||||
sample vLLM application, and the inference extension. | ||||||
Follow the [getting started guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/#getting-started-with-gateway-api-inference-extension) | ||||||
to deploy the vLLM model server, CRDs, etc. | ||||||
|
||||||
__Note:__ Only the GPU-based model server deployment option is supported for benchmark testing. | ||||||
|
||||||
### [Optional] Scale the sample vLLM deployment | ||||||
|
||||||
You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision. | ||||||
You are more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision. | ||||||
|
||||||
```bash | ||||||
kubectl scale --replicas=8 -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml | ||||||
kubectl scale deployment vllm-llama3-8b-instruct --replicas=8 | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: I suggest changing replicas to 6 as the example in the end uses 6 replicas, and the new regression test PR also uses 6 |
||||||
``` | ||||||
|
||||||
### Expose the model server via a k8s service | ||||||
|
||||||
As the baseline, let's also expose the vLLM deployment as a k8s service: | ||||||
To establish a baseline, expose the vLLM deployment as a k8s service: | ||||||
|
||||||
```bash | ||||||
kubectl expose -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml --port=8081 --target-port=8000 --type=LoadBalancer | ||||||
kubectl expose deployment vllm-llama3-8b-instruct --port=80 --target-port=8000 --type=LoadBalancer | ||||||
``` | ||||||
|
||||||
## Run benchmark | ||||||
|
||||||
The LPG benchmark tool works by sending traffic to the specified target IP and port, and collect results. Follow the steps below to run a single benchmark. You can deploy multiple LPG instances if you want to run benchmarks in parallel against different targets. | ||||||
The LPG benchmark tool works by sending traffic to the specified target IP and port, and collecting the results. | ||||||
Follow the steps below to run a single benchmark. Multiple LPG instances can be deployed to run benchmarks in | ||||||
parallel against different targets. | ||||||
|
||||||
1. Check out the repo. | ||||||
|
||||||
```bash | ||||||
git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension | ||||||
cd gateway-api-inference-extension | ||||||
``` | ||||||
|
||||||
1. Get the target IP. Examples below show how to get the IP of a gateway or a LoadBalancer k8s service. | ||||||
1. Get the target IP. The examples below shows how to get the IP of a gateway or a k8s service. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
```bash | ||||||
# Get gateway IP | ||||||
|
@@ -51,32 +55,43 @@ The LPG benchmark tool works by sending traffic to the specified target IP and p | |||||
echo $SVC_IP | ||||||
``` | ||||||
|
||||||
1. Then update the `<target-ip>` in `./config/manifests/benchmark/benchmark.yaml` to your target IP. Feel free to adjust other parameters such as request_rates as well. For a complete list of LPG configurations, pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark). | ||||||
1. Then update the `<target-ip>` in `./config/manifests/benchmark/benchmark.yaml` to the value of `$SVC_IP` or `$GW_IP`. | ||||||
Feel free to adjust other parameters such as `request_rates` as well. For a complete list of LPG configurations, refer to the | ||||||
[LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark). | ||||||
|
||||||
1. Start the benchmark tool. `kubectl apply -f ./config/manifests/benchmark/benchmark.yaml` | ||||||
1. Start the benchmark tool. | ||||||
|
||||||
1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable | ||||||
to specify what this benchmark is for. For instance, `inference-extension` or `k8s-svc`. When the LPG tool finishes benchmarking, it will print a log line `LPG_FINISHED`, | ||||||
the script below will watch for that log line and then start downloading results. | ||||||
```bash | ||||||
kubectl apply -f ./config/manifests/benchmark/benchmark.yaml | ||||||
``` | ||||||
|
||||||
1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable to specify what this | ||||||
benchmark is for. For instance, `inference-extension` or `k8s-svc`. When the LPG tool finishes benchmarking, it will print | ||||||
a log line `LPG_FINISHED`. The script below will watch for that log line and then start downloading results. | ||||||
|
||||||
```bash | ||||||
benchmark_id='my-benchmark' ./tools/benchmark/download-benchmark-results.bash | ||||||
benchmark_id='k8s-svc' ./tools/benchmark/download-benchmark-results.bash | ||||||
``` | ||||||
1. After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/my-benchmark/results/json` folder. Here is a [sample json file](./sample.json). | ||||||
|
||||||
After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/k8s-svc/results/json` folder. | ||||||
Here is a [sample json file](./sample.json). Replace `k8s-svc` with `inference-extension` when running an inference extension benchmark. | ||||||
|
||||||
### Tips | ||||||
|
||||||
* When using a `benchmark_id` other than `k8s-svc` or `inference-extension`, the labels in `./tools/benchmark/benchmark.ipynb` must be | ||||||
updated accordingly to analyze the results. | ||||||
* You can specify `run_id="runX"` environment variable when running the `./download-benchmark-results.bash` script. | ||||||
This is useful when you run benchmarks multiple times to get a more statistically meaningful results and group the results accordingly. | ||||||
* Update the `request_rates` that best suit your benchmark environment. | ||||||
|
||||||
### Advanced Benchmark Configurations | ||||||
|
||||||
Pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a detailed list of configuration knobs. | ||||||
Refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a | ||||||
detailed list of configuration knobs. | ||||||
|
||||||
## Analyze the results | ||||||
|
||||||
This guide shows how to run the jupyter notebook using vscode. | ||||||
This guide shows how to run the jupyter notebook using vscode after completing k8s service and inference extension benchmarks. | ||||||
|
||||||
1. Create a python virtual environment. | ||||||
|
||||||
|
@@ -92,6 +107,6 @@ This guide shows how to run the jupyter notebook using vscode. | |||||
``` | ||||||
|
||||||
1. Open the notebook `./tools/benchmark/benchmark.ipynb`, and run each cell. At the end you should | ||||||
see a bar chart like below where **"ie"** represents inference extension. This chart is generated using this benchmarking tool with 6 vLLM (v1) model servers (H100 80 GB), [llama2-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main) and the [ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json). | ||||||
 | ||||||
see a bar chart like below where __"ie"__ represents inference extension. This chart is generated using this benchmarking tool with 6 vLLM (v1) model servers (H100 80 GB), [llama2-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main) and the [ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json). | ||||||
|
||||||
 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! 8081 was the port back then when we had envoy patches.