Skip to content

Commit 63fda7c

Browse files
committed
refine regression doc
1 parent dda45ee commit 63fda7c

File tree

1 file changed

+43
-42
lines changed
  • site-src/performance/regression-testing

1 file changed

+43
-42
lines changed
Original file line numberDiff line numberDiff line change
@@ -1,102 +1,103 @@
11
# Regression Testing
22

3-
Regression testing ensures recent code changes do not negatively impact the performance and stability of the Inference Gateway.
3+
Regression testing verifies that recent code changes have not adversely affected the performance or stability of the Inference Gateway.
44

5-
This guide demonstrates how to run regression tests against the Gateway API inference extension. Benchmarks are conducted using the [Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark/) (LPG) to simulate traffic and collect detailed metrics.
5+
This guide explains how to run regression tests against the Gateway API inference extension using the [Latency Profile Generator (LPG)](https://github.com/AI-Hypercomputer/inference-benchmark/) to simulate traffic and collect performance metrics.
66

77
## Prerequisites
88

9-
Refer to the common setup instructions in [benchmark guide](/site-src/performance/benchmark/index.md) to deploy the inference extension, sample model server, scale the vLLM deployment, and get the Gateway IP.
9+
Refer to the [benchmark guide](/site-src/performance/benchmark/index.md) for common setup steps, including deployment of the inference extension, model server setup, scaling the vLLM deployment, and obtaining the Gateway IP.
1010

11-
## Create the LPG Dockerfile image
11+
## Create the LPG Docker Image
1212

13-
Follow the instructions [here]( https://github.com/AI-Hypercomputer/inference-benchmark/blob/1c92df607751a7ddb04e2152ed7f6aaf85bd9ca7/README.md) to create the LPG image. Specifically, once you are in the inference-benchmark directory follow the steps below:
13+
Follow the detailed instructions [here](https://github.com/AI-Hypercomputer/inference-benchmark/blob/1c92df607751a7ddb04e2152ed7f6aaf85bd9ca7/README.md) to build the LPG Docker image:
1414

15-
* Create a repository in artifact registry to push the image there and use it on your cluster.
15+
* Create an artifact repository:
1616

17-
```
17+
```bash
1818
gcloud artifacts repositories create ai-benchmark --location=us-central1 --repository-format=docker
1919
```
2020

21-
* For Inifinity-Instruct and billsum datasets run
21+
* Prepare datasets for [Infinity-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and [billsum]((https://huggingface.co/datasets/FiscalNote/billsum)):
2222

23-
```
23+
```bash
2424
pip install datasets transformers numpy pandas tqdm matplotlib
2525
python datasets/import_dataset.py --hf_token YOUR_TOKEN
2626
```
2727

28-
* Build a container to run the benchmark directly on a Kubernetes cluster
29-
using the specified Dockerfile.
28+
* Build the benchmark Docker image:
3029

31-
```
30+
```bash
3231
docker build -t inference-benchmark .
3332
```
3433

35-
* Push the image to the repository.
34+
* Push the Docker image to your artifact registry:
3635

37-
```
36+
```bash
3837
docker tag inference-benchmark us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
3938
docker push us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
4039
```
4140

42-
4341
## Conduct Regression Tests
4442

45-
Use the configurations below tailored to evaluate performance shifts before and after code changes. They assume NVIDIA H100 GPUs (80 GB)—adjust them as needed for different hardware, backend counts, or datasets.
43+
Run benchmarks using the configurations below, which are optimized for NVIDIA H100 GPUs (80 GB). Adjust configurations for other hardware as necessary.
4644

4745
### Test Case 1: Single Workload
4846

49-
- **Dataset:** `billsum_conversations.json` generated from the provided Python script `./tools/benchmark/import-datasets`.
50-
* `billsum_conversations.json` is the [huggingface dataset](https://huggingface.co/datasets/FiscalNote/billsum) converted to prompt → response style conversation json that can be consumed by the benchmarking script.
51-
- **Model:** [Llama 3 (8B)](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)(*critical*)
52-
- **Hardware:** NVIDIA H100 GPUs (80 GB)
53-
- **Replicas:** 10 replicas using vLLM
54-
- **Request Rates Range:** 300-350 in steps of 10
47+
- **Dataset:** `billsum_conversations.json` (created from [HuggingFace billsum dataset](https://huggingface.co/datasets/FiscalNote/billsum)).
48+
* This dataset features long prompts, making it prefill-heavy and ideal for testing scenarios that emphasize initial token generation.
49+
- **Model:** [Llama 3 (8B)](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) (*critical*)
50+
- **Replicas:** 10 (vLLM)
51+
- **Request Rates:** 300–350 (increments of 10)
5552

56-
We have provided an example manifest `./config/manifests/regression-testing/single-workload-regression.yaml` here showing the configrations above for the LPG script.
53+
Refer to example manifest:
54+
`./config/manifests/regression-testing/single-workload-regression.yaml`
5755

5856
### Test Case 2: Multi-LoRA
5957

60-
- **Dataset:** `Infinity-Instruct_conversations.json` generated from the provided Python script `./tools/benchmark/import-datasets`.
61-
* `Infinity-Instruct_conversations.json` is the [huggingface dataset](https://huggingface.co/datasets/BAAI/Infinity-Instruct) converted to prompt → response style conversation json that can be consumed by the benchmarking script.
58+
- **Dataset:** `Infinity-Instruct_conversations.json` (created from [HuggingFace Infinity-Instruct dataset](https://huggingface.co/datasets/BAAI/Infinity-Instruct)).
59+
* This dataset has long outputs, making it decode-heavy and useful for testing scenarios focusing on sustained token generation.
6260
- **Model:** [Llama 3 (8B)](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
63-
- **LoRA:** 15 adapters using `nvidia/llama-3.1-nemoguard-8b-topic-control` (rank 8, <1% of base model size) (all adapters are *critical*)
61+
- **LoRA Adapters:** 15 adapters (`nvidia/llama-3.1-nemoguard-8b-topic-control`, rank 8, critical)
6462
- **Hardware:** NVIDIA H100 GPUs (80 GB)
65-
- **Traffic Split:** `5×0.12, 5×0.06, 5×0.02`
66-
* 60% traffic to first five adapters, 30% to next five, 10% to final five—simulating production/dev/test tiers.
63+
- **Traffic Distribution:** 60% (first 5 adapters, each 12%), 30% (next 5, each 6%), 10% (last 5, each 2%) simulating prod/dev/test tiers
6764
- **Max LoRA:** 3
68-
- **Replicas:** 10 replicas using vLLM
69-
- **Request Rates Range:** 20–200 in steps of 20
65+
- **Replicas:** 10 (vLLM)
66+
- **Request Rates:** 20–200 (increments of 20)
7067

71-
- Update your vLLM and inferencemodel deployment to support multiLora as described above.
72-
* vllm deployment: `./config/manifests/regression-testing/vllm/multi-lora-deployment.yaml`
73-
* inferencemodel:: `./config/manifests/inferencemodel.yaml`
68+
Optionally, you can also run benchmarks using the `ShareGPT` dataset for additional coverage.
7469

75-
We have provided an example manifest `./config/manifests/regression-testing/multi-lora-regression.yaml` showing the configrations above for the LPG script.
70+
Update deployments for multi-LoRA support:
71+
- vLLM Deployment: `./config/manifests/regression-testing/vllm/multi-lora-deployment.yaml`
72+
- InferenceModel: `./config/manifests/inferencemodel.yaml`
7673

74+
Refer to example manifest:
75+
`./config/manifests/regression-testing/multi-lora-regression.yaml`
7776

7877
### Execute Benchmarks
7978

80-
Collect benchmarks in two phases: first before making your changes, and then after applying your changes to the Inference Gateway.
79+
Benchmark in two phases: before and after applying your changes:
8180

82-
- **Before applying changes:**
81+
- **Before changes:**
8382

8483
```bash
8584
benchmark_id='regression-before' ./tools/benchmark/download-benchmark-results.bash
8685
```
8786

88-
- **After applying changes:**
87+
- **After changes:**
8988

9089
```bash
9190
benchmark_id='regression-after' ./tools/benchmark/download-benchmark-results.bash
9291
```
9392

9493
## Analyze Benchmark Results
9594

96-
Open the notebook `./tools/benchmark/benchmark.ipynb`, and run each cell. In the last cell update the benchmark ids with `regression-before` and `regression-after`. This will compares latency and throughput between `regression-before` and `regression-after` and perform regression analysis. Specifically, check R² values for:
97-
98-
- **Prompts Attempted and Succeeded:** Expect R² = 1
99-
- **Output Tokens per Minute, P90 per Output Token Latency, and P90 Latency:** Expect R² values close to 1 (allowing acceptable variance).
95+
Use the provided Jupyter notebook (`./tools/benchmark/benchmark.ipynb`) to analyze results:
10096

101-
Identify any significant performance shifts, investigate their root causes, and ensure deployments maintain expected performance standards.
97+
- Update benchmark IDs to `regression-before` and `regression-after`.
98+
- Compare latency and throughput metrics, performing regression analysis.
99+
- Check R² values specifically:
100+
- **Prompts Attempted/Succeeded:** Expect R² ≈ 1
101+
- **Output Tokens per Minute, P90 per Output Token Latency, P90 Latency:** Expect R² close to 1 (allow minor variance).
102102

103+
Identify significant deviations, investigate causes, and confirm performance meets expected standards.

0 commit comments

Comments
 (0)