refine regression doc

kaushikmitr · kaushikmitr · commit 63fda7c9ec43 · 2025-04-30T21:50:50.000Z
diff --git a/site-src/performance/regression-testing/index.md b/site-src/performance/regression-testing/index.md
@@ -1,102 +1,103 @@
 # Regression Testing
 
-Regression testing ensures recent code changes do not negatively impact the performance and stability of the Inference Gateway.
+Regression testing verifies that recent code changes have not adversely affected the performance or stability of the Inference Gateway.
 
-This guide demonstrates how to run regression tests against the Gateway API inference extension. Benchmarks are conducted using the [Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark/) (LPG) to simulate traffic and collect detailed metrics.
+This guide explains how to run regression tests against the Gateway API inference extension using the [Latency Profile Generator (LPG)](https://github.com/AI-Hypercomputer/inference-benchmark/) to simulate traffic and collect performance metrics.
 
 ## Prerequisites
 
-Refer to the common setup instructions in [benchmark guide](/site-src/performance/benchmark/index.md) to deploy the inference extension, sample model server, scale the vLLM deployment, and get the Gateway IP. 
+Refer to the [benchmark guide](/site-src/performance/benchmark/index.md) for common setup steps, including deployment of the inference extension, model server setup, scaling the vLLM deployment, and obtaining the Gateway IP.
 
-## Create the LPG Dockerfile image
+## Create the LPG Docker Image
 
-Follow the instructions [here]( https://github.com/AI-Hypercomputer/inference-benchmark/blob/1c92df607751a7ddb04e2152ed7f6aaf85bd9ca7/README.md) to create the LPG image. Specifically, once you are in the inference-benchmark directory follow the steps below: 
+Follow the detailed instructions [here](https://github.com/AI-Hypercomputer/inference-benchmark/blob/1c92df607751a7ddb04e2152ed7f6aaf85bd9ca7/README.md) to build the LPG Docker image:
 
- * Create a repository in artifact registry to push the image there and use it on your cluster.
+* Create an artifact repository:
 
-```
+```bash
 gcloud artifacts repositories create ai-benchmark --location=us-central1 --repository-format=docker
 ```
 
-*  For Inifinity-Instruct and billsum datasets run
+* Prepare datasets for [Infinity-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and [billsum]((https://huggingface.co/datasets/FiscalNote/billsum)):
 
-```
+```bash
 pip install datasets transformers numpy pandas tqdm matplotlib
 python datasets/import_dataset.py --hf_token YOUR_TOKEN
 ```
 
-* Build a container to run the benchmark directly on a Kubernetes cluster
-using the specified Dockerfile.
+* Build the benchmark Docker image:
 
-```
+```bash
 docker build -t inference-benchmark .
 ```
 
-* Push the image to the repository.
+* Push the Docker image to your artifact registry:
 
-```
+```bash
 docker tag inference-benchmark us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
 docker push us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
 ```
 
-
 ## Conduct Regression Tests
 
-Use the configurations below tailored to evaluate performance shifts before and after code changes. They assume NVIDIA H100 GPUs (80 GB)—adjust them as needed for different hardware, backend counts, or datasets.
+Run benchmarks using the configurations below, which are optimized for NVIDIA H100 GPUs (80 GB). Adjust configurations for other hardware as necessary.
 
 ### Test Case 1: Single Workload
 
-- **Dataset:**  `billsum_conversations.json` generated from the provided Python script `./tools/benchmark/import-datasets`.
-    * `billsum_conversations.json` is the [huggingface dataset](https://huggingface.co/datasets/FiscalNote/billsum) converted to prompt → response style conversation json that can be consumed by the benchmarking script. 
-- **Model:** [Llama 3 (8B)](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)(*critical*)
-- **Hardware:** NVIDIA H100 GPUs (80 GB)
-- **Replicas:** 10 replicas using vLLM
-- **Request Rates Range:** 300-350 in steps of 10
+- **Dataset:** `billsum_conversations.json` (created from [HuggingFace billsum dataset](https://huggingface.co/datasets/FiscalNote/billsum)).
+    * This dataset features long prompts, making it prefill-heavy and ideal for testing scenarios that emphasize initial token generation.
+- **Model:** [Llama 3 (8B)](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) (*critical*)
+- **Replicas:** 10 (vLLM)
+- **Request Rates:** 300–350 (increments of 10)
 
-We have provided an example manifest `./config/manifests/regression-testing/single-workload-regression.yaml` here showing the configrations above for the LPG script.
+Refer to example manifest:
+`./config/manifests/regression-testing/single-workload-regression.yaml`
 
 ### Test Case 2: Multi-LoRA
 
-- **Dataset:**  `Infinity-Instruct_conversations.json` generated from the provided Python script `./tools/benchmark/import-datasets`.
-    *  `Infinity-Instruct_conversations.json` is the [huggingface dataset](https://huggingface.co/datasets/BAAI/Infinity-Instruct) converted to prompt → response style conversation json that can be consumed by the benchmarking script.
+- **Dataset:** `Infinity-Instruct_conversations.json` (created from [HuggingFace Infinity-Instruct dataset](https://huggingface.co/datasets/BAAI/Infinity-Instruct)).
+    * This dataset has long outputs, making it decode-heavy and useful for testing scenarios focusing on sustained token generation.
 - **Model:** [Llama 3 (8B)](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
-- **LoRA:** 15 adapters using `nvidia/llama-3.1-nemoguard-8b-topic-control` (rank 8, <1% of base model size) (all adapters are *critical*)
+- **LoRA Adapters:** 15 adapters (`nvidia/llama-3.1-nemoguard-8b-topic-control`, rank 8, critical)
 - **Hardware:** NVIDIA H100 GPUs (80 GB)
-- **Traffic Split:** `5×0.12, 5×0.06, 5×0.02`
-    * 60% traffic to first five adapters, 30% to next five, 10% to final five—simulating production/dev/test tiers.
+- **Traffic Distribution:** 60% (first 5 adapters, each 12%), 30% (next 5, each 6%), 10% (last 5, each 2%) simulating prod/dev/test tiers
 - **Max LoRA:** 3
-- **Replicas:** 10 replicas using vLLM
-- **Request Rates Range:** 20–200 in steps of 20 
+- **Replicas:** 10 (vLLM)
+- **Request Rates:** 20–200 (increments of 20)
 
-- Update your vLLM and inferencemodel deployment to support multiLora as described above. 
-    * vllm deployment: `./config/manifests/regression-testing/vllm/multi-lora-deployment.yaml`
-    * inferencemodel:: `./config/manifests/inferencemodel.yaml`
+Optionally, you can also run benchmarks using the `ShareGPT` dataset for additional coverage.
 
-We have provided an example manifest `./config/manifests/regression-testing/multi-lora-regression.yaml` showing the configrations above for the LPG script.
+Update deployments for multi-LoRA support:
+- vLLM Deployment: `./config/manifests/regression-testing/vllm/multi-lora-deployment.yaml`
+- InferenceModel: `./config/manifests/inferencemodel.yaml`
 
+Refer to example manifest:
+`./config/manifests/regression-testing/multi-lora-regression.yaml`
 
 ### Execute Benchmarks
 
-Collect benchmarks in two phases: first before making your changes, and then after applying your changes to the Inference Gateway.
+Benchmark in two phases: before and after applying your changes:
 
-- **Before applying changes:**
+- **Before changes:**
 
 ```bash
 benchmark_id='regression-before' ./tools/benchmark/download-benchmark-results.bash
 ```
 
-- **After applying changes:**
+- **After changes:**
 
 ```bash
 benchmark_id='regression-after' ./tools/benchmark/download-benchmark-results.bash
 ```
 
 ## Analyze Benchmark Results
 
-Open the notebook `./tools/benchmark/benchmark.ipynb`, and run each cell. In the last cell update the benchmark ids with `regression-before` and `regression-after`. This will compares latency and throughput between `regression-before` and `regression-after` and perform regression analysis. Specifically, check R² values for:
-
-- **Prompts Attempted and Succeeded:** Expect R² = 1
-- **Output Tokens per Minute, P90 per Output Token Latency, and P90 Latency:** Expect R² values close to 1 (allowing acceptable variance).
+Use the provided Jupyter notebook (`./tools/benchmark/benchmark.ipynb`) to analyze results:
 
-Identify any significant performance shifts, investigate their root causes, and ensure deployments maintain expected performance standards.
+- Update benchmark IDs to `regression-before` and `regression-after`.
+- Compare latency and throughput metrics, performing regression analysis.
+- Check R² values specifically:
+  - **Prompts Attempted/Succeeded:** Expect R² ≈ 1
+  - **Output Tokens per Minute, P90 per Output Token Latency, P90 Latency:** Expect R² close to 1 (allow minor variance).
 
+Identify significant deviations, investigate causes, and confirm performance meets expected standards.