Skip to content

Commit 29a12cf

Browse files
authored
cherry-pick speculative decoding related PR #133 and #135 (#136)
* docs: move Constrained_Decoding and Function_Calling to Feature_Guide | rm AI_Agents_Guide folder (#135) * docs: Add EAGLE/SpS Speculative Decoding support with vLLM (#133)
1 parent f6fd598 commit 29a12cf

File tree

20 files changed

+356
-79
lines changed

20 files changed

+356
-79
lines changed

AI_Agents_Guide/README.md

Lines changed: 0 additions & 62 deletions
This file was deleted.

Feature_Guide/Speculative_Decoding/README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,4 +54,6 @@ may prove simpler than generating a summary for an article. [Spec-Bench](https:/
5454
shows the performance of different speculative decoding approaches on different tasks.
5555

5656
## Speculative Decoding with Triton Inference Server
57-
Follow [here](TRT-LLM/README.md) to learn how Triton Inference Server supports speculative decoding with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).
57+
Triton Inference Server supports speculative decoding on different types of Triton backends. See what a Triton backend is [here](https://github.com/triton-inference-server/backend).
58+
- Follow [here](TRT-LLM/README.md) to learn how Triton Inference Server supports speculative decoding with [TensorRT-LLM Backend](https://github.com/triton-inference-server/tensorrtllm_backend).
59+
- Follow [here](vLLM/README.md) to learn how Triton Inference Server supports speculative decoding with [vLLM Backend](https://github.com/triton-inference-server/vllm_backend).

Feature_Guide/Speculative_Decoding/TRT-LLM/README.md

Lines changed: 22 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -28,10 +28,17 @@
2828

2929
# Speculative Decoding with TensorRT-LLM
3030

31-
This tutorial shows how to build and serve speculative decoding models in Triton Inference Server with [TensorRT-LLM backend](https://github.com/triton-inference-server/tensorrtllm_backend) on a single node with one GPU.
31+
- [About Speculative Decoding](#about-speculative-decoding)
32+
- [EAGLE](#eagle)
33+
- [MEDUSA](#medusa)
34+
- [Draft Model-Based Speculative Decoding](#draft-model-based-speculative-decoding)
35+
36+
## About Speculative Decoding
37+
38+
This tutorial shows how to build and serve speculative decoding models in Triton Inference Server with [TensorRT-LLM Backend](https://github.com/triton-inference-server/tensorrtllm_backend) on a single node with one GPU. Please go to [Speculative Decoding](../README.md) main page to learn more about other supported backends.
3239

3340
According to [Spec-Bench](https://sites.google.com/view/spec-bench), EAGLE is currently the top-performing approach for speeding up LLM inference across different tasks.
34-
In this tutorial, we'll focus on [EAGLE](#eagle) and demonstrate how to make it work with Triton Inference Server. However, we'll also cover [MEDUSA](#medusa) and [Speculative Sampling (SpS)](#speculative-sampling) for those interested in exploring alternative methods. This way, you can choose the best fit for your needs.
41+
In this tutorial, we'll focus on [EAGLE](#eagle) and demonstrate how to make it work with Triton Inference Server. However, we'll also cover [MEDUSA](#medusa) and [Draft Model-Based Speculative Decoding](#draft-model-based-speculative-decoding) for those interested in exploring alternative methods. This way, you can choose the best fit for your needs.
3542

3643
## EAGLE
3744

@@ -42,7 +49,7 @@ EAGLE ([paper](https://arxiv.org/pdf/2401.15077) | [github](https://github.com/S
4249
### Acquiring EAGLE Model and its Base Model
4350

4451
In this example, we will be using the [EAGLE-Vicuna-7B-v1.3](https://huggingface.co/yuhuili/EAGLE-Vicuna-7B-v1.3) model.
45-
More types of EAGLE models could be found [here](https://sites.google.com/view/eagle-llm). The base model [Vicuna-7B-v1.3](https://huggingface.co/lmsys/vicuna-7b-v1.3) is also needed for EAGLE to work.
52+
More types of EAGLE models can be found [here](https://huggingface.co/yuhuili). The base model [Vicuna-7B-v1.3](https://huggingface.co/lmsys/vicuna-7b-v1.3) is also needed for EAGLE to work.
4653

4754
To download both models, run the following command:
4855
```bash
@@ -59,7 +66,7 @@ Launch Triton docker container with TensorRT-LLM backend.
5966
Note that we're mounting the downloaded EAGLE and base models to `/hf-models` in the docker container.
6067
Make an `engines` folder outside docker to reuse engines for future runs.
6168
Please, make sure to replace <xx.yy> with the version of Triton that you want
62-
to use (must be >= 25.01). The latest Triton Server container could be found [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags).
69+
to use (must be >= 25.01). The latest Triton Server container is recommended and can be found [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags).
6370

6471
```bash
6572
docker run --rm -it --net host --shm-size=2g \
@@ -195,7 +202,7 @@ python3 /tensorrtllm_client/inflight_batcher_llm_client.py --request-output-len
195202
> ...
196203
> ```
197204
198-
2. The [generate endpoint](https://github.com/triton-inference-server/tensorrtllm_backend/tree/release/0.5.0#query-the-server-with-the-triton-generate-endpoint).
205+
2. The [generate endpoint](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/protocol/extension_generate.html).
199206
200207
```bash
201208
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is ML?", "max_tokens": 50, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'
@@ -219,7 +226,7 @@ format required by Gen-AI Perf. Note that MT-bench could not be used since Gen-A
219226
```bash
220227
wget https://raw.githubusercontent.com/SafeAILab/EAGLE/main/eagle/data/humaneval/question.jsonl
221228
222-
# dataset-converter.py file can be found in the same folder as this README.
229+
# dataset-converter.py file can be found in the parent folder of this README.
223230
python3 dataset-converter.py --input_file question.jsonl --output_file converted_humaneval.jsonl
224231
```
225232
@@ -419,20 +426,20 @@ python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_
419426
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRITON_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE}
420427
```
421428

422-
## Speculative Sampling
429+
## Draft Model-Based Speculative Decoding
423430

424-
Speculative Sampling (SpS) ([paper](https://arxiv.org/pdf/2302.01318)) is another (and earlier) approach to accelerate LLM inference, distinct from both EAGLE and MEDUSA. Here are the key differences:
431+
Draft Model-Based Speculative Decoding ([paper](https://arxiv.org/pdf/2302.01318)) is another (and earlier) approach to accelerate LLM inference, distinct from both EAGLE and MEDUSA. Here are the key differences:
425432

426-
- Draft Generation: SpS uses a smaller, faster LLM as a draft model to predict multiple tokens ahead1. This contrasts with EAGLE's feature-level extrapolation and MEDUSA's additional decoding heads.
433+
- Draft Generation: it uses a smaller, faster LLM as a draft model to predict multiple tokens ahead. This contrasts with EAGLE's feature-level extrapolation and MEDUSA's additional decoding heads.
427434

428-
- Verification Process: SpS employs a chain-like structure for draft generation and verification, unlike EAGLE and MEDUSA which use tree-based attention mechanisms.
435+
- Verification Process: it employs a chain-like structure for draft generation and verification, unlike EAGLE and MEDUSA which use tree-based attention mechanisms.
429436

430-
- Consistency: SpS maintains distribution consistency with the target LLM in both greedy and non-greedy settings, similar to EAGLE but different from MEDUSA.
437+
- Consistency: it maintains distribution consistency with the target LLM in both greedy and non-greedy settings, similar to EAGLE but different from MEDUSA.
431438

432-
- Efficiency: While effective, SpS is generally slower than both EAGLE and MEDUSA.
439+
- Efficiency: While effective, it is generally slower than both EAGLE and MEDUSA.
433440

434-
- Implementation: SpS requires a separate draft model, which can be challenging to implement effectively for smaller target models. EAGLE and MEDUSA, in contrast, modify the existing model architecture.
441+
- Implementation: it requires a separate draft model, which can be challenging to implement effectively for smaller target models. EAGLE and MEDUSA, in contrast, modify the existing model architecture.
435442

436-
- Accuracy: SpS's draft accuracy can vary depending on the draft model used, while EAGLE achieves a higher draft accuracy (about 0.8) compared to MEDUSA (about 0.6).
443+
- Accuracy: its draft accuracy can vary depending on the draft model used, while EAGLE achieves a higher draft accuracy (about 0.8) compared to MEDUSA (about 0.6).
437444

438-
Please follow the steps [here](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/speculative-decoding.md#using-draft-target-model-approach-with-triton-inference-server) to run SpS with Triton Inference Server.
445+
Please follow the steps [here](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/speculative-decoding.md#using-draft-target-model-approach-with-triton-inference-server) to run Draft Model-Based Speculative Decoding with Triton Inference Server.

0 commit comments

Comments
 (0)