-
Notifications
You must be signed in to change notification settings - Fork 118
Docs: add tutorials on EAGLE, MEDUSA, vanilla speculative decoding using TRT-LLM #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
c722c19
add tutorials on speculative decoding main page and EAGLE sub page
ziqif-nv d481e41
minor change
ziqif-nv 036a84d
minor
ziqif-nv d2283fb
address comments
ziqif-nv eb20da8
major refactor of EAGLE and added MEDUSA and SpS
ziqif-nv 6b4a079
minor fix
ziqif-nv 00e95b9
address comments
ziqif-nv File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
<!-- | ||
# Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
# | ||
# Redistribution and use in source and binary forms, with or without | ||
# modification, are permitted provided that the following conditions | ||
# are met: | ||
# * Redistributions of source code must retain the above copyright | ||
# notice, this list of conditions and the following disclaimer. | ||
# * Redistributions in binary form must reproduce the above copyright | ||
# notice, this list of conditions and the following disclaimer in the | ||
# documentation and/or other materials provided with the distribution. | ||
# * Neither the name of NVIDIA CORPORATION nor the names of its | ||
# contributors may be used to endorse or promote products derived | ||
# from this software without specific prior written permission. | ||
# | ||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY | ||
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | ||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR | ||
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR | ||
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, | ||
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, | ||
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR | ||
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY | ||
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT | ||
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE | ||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | ||
--> | ||
|
||
# Speculative Decoding | ||
|
||
- [About Speculative Dampling](#about-speculative-decoding) | ||
- [Performance Improvements](#performance-improvements) | ||
- [Speculative Decoding with Triton Inference Server](#speculative-decoding-with-triton-inference-server) | ||
|
||
|
||
## About Speculative Decoding | ||
|
||
Speculative Decoding (also referred to as Speculative Sampling) is a set of techniques designed to allow generation of more than one token per forward pass iteration. This can lead to a reduction in the average per-token latency **in situations where the GPU is underutilized due to small batch sizes.** | ||
|
||
Speculative decoding involves predicting a sequence of future tokens, referred to as draft tokens, using a method that is substantially more efficient than repeatedly executing the target Large Language Model (LLM). | ||
These draft tokens are then collectively validated by processing them through the target LLM in a single forward pass. The underlying assumptions are twofold: | ||
|
||
1. processing multiple draft tokens concurrently will be as rapid as processing a single token | ||
2. multiple draft tokens will be validated successfully over the course of the full generation | ||
|
||
If the first assumption holds true, the latency of speculative decoding will no worse than the standard approach. If the second holds, output token generation advances by statistically more than one token per forward pass. | ||
The combination of both these allows speculative decoding to result in reduced latency. | ||
|
||
## Performance Improvements | ||
|
||
It's important to note that the effectiveness of speculative decoding techniques is highly dependent | ||
on the specific task at hand. For instance, forecasting subsequent tokens in a code-completion scenario | ||
may prove simpler than generating a summary for an article. [Spec-Bench](https://sites.google.com/view/spec-bench) | ||
shows the performance of different speculative decoding approaches on different tasks. | ||
|
||
## Speculative Decoding with Triton Inference Server | ||
Follow [here](TRT-LLM/README.md) to learn how Triton Inference Server supports speculative decoding with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i woud like to see under what specific tasks would it be efficient to use. what kind of models should be used. draft models and the target model examples if any. I would like to see such recommendations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed offline, we do not want to give strong recommendation to customers but instead, provide options to them. I have updated the tutorial to reflect that discussion