Skip to content

Commit d2283fb

Browse files
committed
address comments
1 parent 036a84d commit d2283fb

File tree

2 files changed

+81
-51
lines changed

2 files changed

+81
-51
lines changed

Feature_Guide/Speculative_Decoding/EAGLE/README.md

Lines changed: 80 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828

2929
# EAGLE Speculative Decoding
3030

31-
This tutorial shows how to build and run a model using EAGLE speculative decoding ([paper](https://arxiv.org/pdf/2401.15077) | [github](hhttps://github.com/SafeAILab/EAGLE/tree/main) | [blog](https://sites.google.com/view/eagle-llm)) in Triton Inference Server with TensorRT-LLM backend on a single node with one GPU.
31+
This tutorial shows how to build and run a model using EAGLE speculative decoding ([paper](https://arxiv.org/pdf/2401.15077) | [github](https://github.com/SafeAILab/EAGLE/tree/main) | [blog](https://sites.google.com/view/eagle-llm)) in Triton Inference Server with TensorRT-LLM backend on a single node with one GPU.
3232

3333
TensorRT-LLM is NVIDIA's recommended solution of running Large Language Models(LLMs) on NVIDIA GPUs. Read more about TensoRT-LLM [here](https://github.com/NVIDIA/TensorRT-LLM) and Triton's TensorRT-LLM Backend [here](https://github.com/triton-inference-server/tensorrtllm_backend).
3434

@@ -230,62 +230,53 @@ You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/
230230
231231
1. Prepare Dataset
232232
233-
We will be using the HumanEval dataset for our evaluation, which is used in the original EAGLE paper. The HumanEval dataset has been converted to the format required by EAGLE and is available [here](https://github.com/SafeAILab/EAGLE/blob/main/eagle/data/humaneval/question.jsonl). To make it compatible for Gen-AI Perf, we need to do another conversion. You may use other datasets besides HumanEval as well, as long as it could be converted to the format required by Gen-AI Perf. Note that MT-bench could not be used since Gen-AI Perf does not support multiturn dataset as input yet. Follow the steps below to download and convert the dataset.
234-
233+
We will be using the HumanEval dataset for our evaluation, which is used in the original EAGLE paper. The HumanEval dataset has been converted to the format required by EAGLE and is available [here](https://github.com/SafeAILab/EAGLE/blob/main/eagle/data/humaneval/question.jsonl). To make it compatible for Gen-AI Perf, we need to do another conversion. You may use other datasets besides HumanEval as well, as long as it could be converted to the
234+
format required by Gen-AI Perf. Note that MT-bench could not be used since Gen-AI Perf does not support multiturn dataset as input yet. Follow the steps below to download and convert the dataset.
235235
```bash
236236
wget https://raw.githubusercontent.com/SafeAILab/EAGLE/main/eagle/data/humaneval/question.jsonl
237237
238238
python3 dataset-converter.py --input_file question.jsonl --output_file converted_humaneval.jsonl
239239
```
240240
241-
2. Get Gen-AI Perf Tool
242-
243-
Gen-AI Perf is available in the SDK container as shown in the [Send an Inference Request](#send-an-inference-request) section. The only difference is that you need to mount the converted dataset to the container:
241+
2. Install GenAI-Perf (Ubuntu 24.04, Python 3.10+)
244242

245243
```bash
246-
docker run --rm -it --net host --shm-size=2g \
247-
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
248-
-v </path/to/tensorrtllm_backend/inflight_batcher_llm/client>:/tensorrtllm_client \
249-
-v </path/to/eagle/and/base/model/>:/hf-models \
250-
-v </path/to/converted/dataset/>:/data \
251-
nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk
244+
pip install genai-perf
252245
```
246+
NOTE: you must already have CUDA 12 installed
253247

254-
3. Run Gen-AI Perf Tool
248+
3. Run Gen-AI Perf
255249

256250
Run the following command in the SDK container:
257251
```bash
258-
pip3 install transformers sentencepiece
259-
260252
genai-perf \
261253
profile \
262254
-m ensemble \
263255
--service-kind triton \
264256
--backend tensorrtllm \
265-
--input-file /data/converted_humaneval.jsonl \
266-
--tokenizer /hf-models/vicuna-7b-v1.3/ \
267-
--concurrency 1 \
268-
--measurement-interval 4000 \
257+
--input-file /path/to/converted/dataset/converted_humaneval.jsonl \
258+
--tokenizer /path/to/hf-models/vicuna-7b-v1.3/ \
269259
--profile-export-file my_profile_export.json \
270-
--url localhost:8001
260+
--url localhost:8001 \
261+
--request-rate 2
271262
```
272263
NOTE: you may need to change the input-file name according to your converted dataset. Above is using converted_humaneval.jsonl as an example.
273264
A sample output that looks like this:
274265
```
275-
NVIDIA GenAI-Perf | LLM Metrics
276-
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
277-
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
278-
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
279-
│ Request Latency (ms) │ 1,246.69611.481,979.951,968.771,868.161,738.61
280-
│ Output Sequence Length (tokens) │ 316.55153.00 │ 433.00 │ 431.70420.00399.50
281-
│ Input Sequence Length (tokens) │ 142.09 │ 63.00 │ 195.00 │ 194.20187.00175.00
282-
│ Output Token Throughput (per sec) │ 253.89 │ N/A │ N/A │ N/A │ N/A │ N/A │
283-
│ Request Throughput (per sec) │ 0.80 │ N/A │ N/A │ N/A │ N/A │ N/A │
284-
│ Request Count (count) │ 11.00 │ N/A │ N/A │ N/A │ N/A │ N/A │
285-
└───────────────────────────────────┴──────────┴────────┴──────────┴──────────┴────────────────────┘
266+
NVIDIA GenAI-Perf | LLM Metrics
267+
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
268+
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
269+
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
270+
│ Request Latency (ms) │ 7,667.61940.3816,101.6515,439.3614,043.3310,662.31
271+
│ Output Sequence Length (tokens) │ 319.87133.00 │ 485.00 │ 472.08 441.60 404.00
272+
│ Input Sequence Length (tokens) │ 153.05 │ 63.00 │ 278.00 │ 259.38 190.20 183.50
273+
│ Output Token Throughput (per sec) │ 360.53 │ N/A │ N/A │ N/A │ N/A │ N/A │
274+
│ Request Throughput (per sec) │ 1.13 │ N/A │ N/A │ N/A │ N/A │ N/A │
275+
│ Request Count (count) │ 39.00 │ N/A │ N/A │ N/A │ N/A │ N/A │
276+
└───────────────────────────────────┴──────────┴────────┴──────────┴──────────┴───────────┴───────────┘
286277
```
287278

288-
4. Run Gen-AI Perf Tool on Base Model
279+
4. Run Gen-AI Perf on Base Model
289280

290281
To compare performance between EAGLE and base model, we need to run Gen-AI Perf Tool on the base model as well. To do so, we need to repeat the steps above for the base model with minor changes.
291282

@@ -348,31 +339,70 @@ genai-perf \
348339
-m ensemble \
349340
--service-kind triton \
350341
--backend tensorrtllm \
351-
--input-file /data/converted_humaneval.jsonl \
352-
--tokenizer /hf-models/vicuna-7b-v1.3/ \
353-
--concurrency 1 \
354-
--measurement-interval 4000 \
342+
--input-file /path/to/converted/dataset/converted_humaneval.jsonl \
343+
--tokenizer /path/to/hf-models/vicuna-7b-v1.3/ \
355344
--profile-export-file my_profile_export.json \
356-
--url localhost:8001
345+
--url localhost:8001 \
346+
--request-rate 2
357347
```
358348

359349
Sample performance output for base model:
360350
```
361-
NVIDIA GenAI-Perf | LLM Metrics
362-
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
363-
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
364-
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
365-
│ Request Latency (ms) │ 3,070.83 │ 1,681.384,089.214,088.884,085.934,081.01
366-
│ Output Sequence Length (tokens) │ 335.00235.00 │ 414.00 │ 412.17395.70368.25 │
367-
│ Input Sequence Length (tokens) │ 143.2597.00 │ 187.00 │ 186.13178.30165.25
368-
│ Output Token Throughput (per sec) │ 109.09 │ N/A │ N/A │ N/A │ N/A │ N/A │
369-
│ Request Throughput (per sec) │ 0.33 │ N/A │ N/A │ N/A │ N/A │ N/A │
370-
│ Request Count (count) │ 4.00 │ N/A │ N/A │ N/A │ N/A │ N/A │
371-
└───────────────────────────────────┴──────────┴──────────┴──────────┴──────────┴────────────────────┘
351+
NVIDIA GenAI-Perf | LLM Metrics
352+
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
353+
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
354+
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
355+
│ Request Latency (ms) │ 8,730.66 │ 1,792.9416,376.1816,054.1714,780.5112,529.04
356+
│ Output Sequence Length (tokens) │ 353.32153.00 │ 534.00 │ 508.65 445.30 428.25 │
357+
│ Input Sequence Length (tokens) │ 156.6263.00 │ 296.00 │ 288.98 196.60 185.00
358+
│ Output Token Throughput (per sec) │ 410.03 │ N/A │ N/A │ N/A │ N/A │ N/A │
359+
│ Request Throughput (per sec) │ 1.16 │ N/A │ N/A │ N/A │ N/A │ N/A │
360+
│ Request Count (count) │ 40.00 │ N/A │ N/A │ N/A │ N/A │ N/A │
361+
└───────────────────────────────────┴──────────┴──────────┴──────────┴──────────┴───────────┴───────────┘
372362
```
373363

374364
5. Compare Performance
375365

376-
From the sample runs above, we can see that the EAGLE model has a lower latency and higher throughput than the base model. Specifically, the EAGLE model can generate 253.89 tokens per second, while the base model can only generate 109.09 tokens per second with a speed up of 2.33x.
366+
```bash
367+
genai-perf \
368+
profile \
369+
-m ensemble \
370+
--service-kind triton \
371+
--backend tensorrtllm \
372+
--input-file /path/to/converted/dataset/converted_gsm8k.jsonl \
373+
--tokenizer /path/to/hf-models/vicuna-7b-v1.3/ \
374+
--profile-export-file my_profile_export.json \
375+
--url localhost:8001 \
376+
--request-rate 5
377+
```
378+
379+
EAGLE model performance output on GSM8K:
380+
```
381+
NVIDIA GenAI-Perf | LLM Metrics
382+
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
383+
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
384+
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
385+
│ Request Latency (ms) │ 5,633.32 │ 34.94 │ 13,317.99 │ 12,415.99 │ 9,931.24 │ 8,085.09 │
386+
│ Output Sequence Length (tokens) │ 116.02 │ 23.00 │ 353.00 │ 348.77 │ 305.30 │ 126.00 │
387+
│ Input Sequence Length (tokens) │ 66.70 │ 23.00 │ 148.00 │ 144.39 │ 102.10 │ 81.00 │
388+
│ Output Token Throughput (per sec) │ 389.08 │ N/A │ N/A │ N/A │ N/A │ N/A │
389+
│ Request Throughput (per sec) │ 3.35 │ N/A │ N/A │ N/A │ N/A │ N/A │
390+
│ Request Count (count) │ 120.00 │ N/A │ N/A │ N/A │ N/A │ N/A │
391+
└───────────────────────────────────┴──────────┴───────┴───────────┴───────────┴──────────┴──────────┘
392+
```
393+
394+
Base model performance output on GSM8K:
395+
```
396+
NVIDIA GenAI-Perf | LLM Metrics
397+
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
398+
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
399+
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
400+
│ Request Latency (ms) │ 4,327.16 │ 32.04 │ 9,253.56 │ 9,033.99 │ 7,175.71 │ 6,257.44 │
401+
│ Output Sequence Length (tokens) │ 116.09 │ 23.00 │ 353.00 │ 330.00 │ 289.00 │ 127.00 │
402+
│ Input Sequence Length (tokens) │ 65.24 │ 23.00 │ 148.00 │ 139.83 │ 98.40 │ 79.00 │
403+
│ Output Token Throughput (per sec) │ 472.50 │ N/A │ N/A │ N/A │ N/A │ N/A │
404+
│ Request Throughput (per sec) │ 4.07 │ N/A │ N/A │ N/A │ N/A │ N/A │
405+
│ Request Count (count) │ 144.00 │ N/A │ N/A │ N/A │ N/A │ N/A │
406+
└───────────────────────────────────┴──────────┴───────┴──────────┴──────────┴──────────┴──────────┘
407+
```
377408

378-
As stated above, the number above is gathered from a single node with one GPU - RTX 5880 (48GB GPU memory). The actual number may vary due to the different hardware and environment.

Feature_Guide/Speculative_Decoding/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,4 +55,4 @@ may prove simpler than generating a summary for an article.
5555
## Speculative Decoding with Triton Inference Server
5656
Triton Inference Server supports speculative decoding with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). Currently, Triton Inference Server supports following popular speculative decoding approaches:
5757

58-
1. [EAGLE](./EAGLE/README.md) speculative decoding ([paper](https://arxiv.org/pdf/2401.15077) | [github](hhttps://github.com/SafeAILab/EAGLE/tree/main) | [blog](https://sites.google.com/view/eagle-llm))
58+
1. [EAGLE](./EAGLE/README.md) speculative decoding ([paper](https://arxiv.org/pdf/2401.15077) | [github](https://github.com/SafeAILab/EAGLE/tree/main) | [blog](https://sites.google.com/view/eagle-llm))

0 commit comments

Comments
 (0)