Releases: pytorch/TensorRT
Torch-TensorRT v2.9.0
PyTorch 2.9, CUDA 13.0 TensorRT 10.13, Python 3.13
Torch-TensorRT 2.9.0 Linux x86-64 and Windows targets PyTorch 2.9, TensorRT 10.13, CUDA 13.0, 12.8, 12.6 and Python 3.10 ~ 3.13
Python
x86-64 Linux and Windows
-
CUDA 13.0 + Python 3.10-3.13 is Available via PyPI
- https://pypi.org/project/torch-tensorrt/ -
CUDA 12.6/12.8/13.0 + Python 3.10-3.13 is also Available via Pytorch Index
- https://download.pytorch.org/whl/torch-tensorrt
aarch64 SBSA Linux and Jetson Thor
- CUDA 13.0 + Python 3.10–3.13 + Torch 2.9 + TensorRT 10.13 (Python 3.12 is the only version verified for Thor)
- Available via PyPI: https://pypi.org/project/torch-tensorrt/
- Available via PyTorch index: https://download.pytorch.org/whl/torch-tensorrt
NOTE: You must explicitly install TensorRT or use system installed TensorRT wheels for aarch64 platforms
uv pip install torch torch-tensorrt tensorrt
aarch64 Jetson Orin
- no torch_tensorrt 2.9 release for Jetson Orin, please continue using torch_tensorrt 2.8 release
C++
x86-64 Linux and Windows
- CUDA 13.0 Tarball / Zip
Deprecations
FX Frontend
The FX frontend was the precursor to the Dynamo frontend and a number of Dynamo components were shared between the two. Now that the Dynamo frontend is stable and all shared components have been decoupled we will no longer ship the FX frontend in binary releases starting in H1Y26. The FX frontend will remain in the source tree for the foreseeable future so source builds can reinstall the frontend if necessary.
New Features
LLM and VLM improvements
In this release, we’ve introduced several key enhancements:
- Sliding Window Attention in SDPA Converter : Added support for sliding window attention, enabling successful compilation of the Gemma3 model (Gemma3-1B).
- Dynamic Custom Lowering Passes
Refactored the lowering framework to allow users to dynamically register custom passes based on the configuration of Hugging Face models. - Vision-Language Model (VLM) Support
- Added support for Eagle2 and Qwen2.5-VL models via the new run_vlm.py utility.
- run_vlm.py enables compilation of both vision and language components of a VLM model. It also supports KV caching for efficient VLM generation.
See the documentation for detailed instructions on running these models.
TensorRT-RTX
TensorRT-RTX is a JIT-first version of TensorRT. Where as TensorRT will perform tactic selection and fusions during a build phase. TensorRT-RTX allows you to distribute builds prior to specializing for specific hardware so that one GPU agnostic package can be distributed to all users of your builds. Then on first use, TensorRT RTX will tune for the specific hardware your users are running. Torch-TensorRT-RTX is a build of Torch-TensorRT that uses the TensorRT-RTX compiler stack inplace of standard TensorRT. All APIs are identical to Torch-TensorRT, however, some features such as weak-typing and at compile time post training quantization are not supported.
- Added exprimental support for Torch-TensorRT-RTX
- You can check out the details on how to build and run here: https://docs.pytorch.org/TensorRT/getting_started/tensorrt_rtx.html
Improvements
- Closed a number of performance gaps between Torch-TensorRT and ONNX TensorRT constructed graphs
What's Changed
- fix the broken CC0 image link by @lanluo-nvidia in #3635
- upgrade torch_tensorrt version from 2.8.0.dev to 2.9.0.dev by @lanluo-nvidia in #3639
- Temporary fix to workaround the mutable decomposition error. by @lanluo-nvidia in #3636
- Fix dynamo core test failure on Windows by @HolyWu in #3642
- Closed the perf gap of resnet and enabled refit by @cehongwang in #3629
- feat: Refactor LLM model zoo and add KV cache support by @peri044 in #3527
- adding rotary embedding example, with graph rewrite for complex subgraph by @apbose in #3570
- feat: Add bf16 support to cast converter by @peri044 in #3643
- fix: replace add_identity by add_cast for type cast by @junstar92 in #3563
- Refit debug patch by @cehongwang in #3620
- fix compiler cl not found error in windows by @lanluo-nvidia in #3660
- slice scatter support for dynamic cases by @apbose in #3513
- fix the int8 quantization failure error by @lanluo-nvidia in #3663
- chore(deps): bump transformers from 4.48.0 to 4.52.1 in /tests/modules by @dependabot[bot] in #3670
- chore(deps): bump transformers from 4.50.0 to 4.51.0 in /examples/dynamo by @dependabot[bot] in #3669
- chore(deps): bump transformers from 4.49.0 to 4.51.0 in /tests/py by @dependabot[bot] in #3668
- remove tensorrt as build dependency by @lanluo-nvidia in #3681
- disable jetpack build for now by @lanluo-nvidia in #3685
- Fixed the CI problem by @cehongwang in #3680
- fix windows build failure: add /utf-8 by @lanluo-nvidia in #3684
- upgrade tensorrt from 10.11 to 10.12 by @lanluo-nvidia in #3686
- Add Flux fp4 support by @lanluo-nvidia in #3689
- feat: revert linear converter by @zewenli98 in #3703
- Fixed python only runtime bug by @cehongwang in #3701
- Disabled silu decomposition cast by @cehongwang in #3677
- Jetson distributed fix by @apbose in #3716
- Simplify the Group Norm converter by @zewenli98 in #3719
- fix conv1d/deconv1d bug with stride more than 1 by @lanluo-nvidia in #3737
- add test cases for strong typing by @lanluo-nvidia in #3739
- Upgrade perf_run script to support TRT 10 and fix some issues by @zewenli98 in #3650
- Fixed SDPA slow down and linear slow down by @cehongwang in #3700
- remove breakpoint() by @lanluo-nvidia in #3750
- add nvshmem in aarch64 by @lanluo-nvidia in #3769
- chore(deps): bump transformers from 4.51.3 to 4.53.0 in /tools/perf by @dependabot[bot] in #3754
- Cherry pick jetson enablement from 2.8 release branch to main by @lanluo-nvidia in #3765
- Breaking Change: Remove the deprecated int8 calibrator related by @lanluo-nvidia in #3759
- fix the typo by @lanluo-nvidia in #3773
- Removal of BAZEL build files from python package and changes to make cpp tests work by @apbose in #3641
- fix: atan2 strong type support & bug fix for integer dynamic shape by @chohk88 in #3751
- upgrade torchvision from 0.23.0 to 0.24.0 by @lanluo-nvidia in #3772
- chore: update resources in README.md by @peri044 in #3780
- disable python 3.14 in CI by @lanluo-nvidia in #3787
- fix: set example models to eval mode and follow the convention by @zewenli98 in #3770
- fix: prelu perf gap on Unet by @zewenli98 in #3717
- fix: batch norm issue encountered in RAFT by @zewenli98 in #3758
- feat: Add support for Groot N1.5 model by @peri044 in #3736
- skip flashinfer test due to torch upstream change by @lanluo-nvidia in #3794
- Add support for TensorRT-RTX by @lanluo-nvidia in #3753
- add fx deprecation notice + jetpack doc update by @lanluo-nvidia in #3795
- addressing ngc aarch64 error by @apbose in #3705
- fix pybind issue in windows by @lanluo-nvidia in #3801
- llm: register sdpa variant by @lanluo-nvidia in #3802
- fix bazel build //tests/core/runtime:runtime_tests issue by @lanluo-nvidia in #3804
- Simplify Release workflow and Add windows zip in the release artifacts by @lanluo-nvidia in #3800
- change llm model test from gemma3 to qwen to skip auth by @lanluo-nvidia in #3807
- replace allow_complex_guards_as_runtime_assertswithprefer_deferred_ru… by @lanluo-nvidia in #3809
- cherry pick 25.09 skip test to main by @lanluo-nvidia in #3810
- feat: support dynamics for all inputs for embedding_...
Torch-TensorRT v2.8.0
PyTorch 2.8, CUDA 12.8 TensorRT 10.12, Python 3.13
Torch-TensorRT 2.8.0 Standard Linux x86-64 and Windows targets PyTorch 2.8, TensorRT 10.12, CUDA 12.6, 12.8, 12.9 and Python 3.9 ~ 3.13
- Linux x86-64 + Windows
- CUDA 12.8 + Python 3.9-3.13 is Available via PyPI: https://pypi.org/project/torch-tensorrt/
- CUDA 12.6/12.8/12.9 + Python 3.9-3.13 is also Available via Pytorch Index: https://download.pytorch.org/whl/torch-tensorrt
Platform support
In addition to the standard Windows x86-64 and Linux x86-64 releases, we now provide binary builds for SBSA and Jetson:
-
SBSA aarch64
- CUDA 12.9 + Python 3.9–3.13 + Torch 2.8 + TensorRT 10.12
- Available via PyPI: https://pypi.org/project/torch-tensorrt/
- Available via PyTorch index: https://download.pytorch.org/whl/torch-tensorrt
-
Jetson Orin
- CUDA 12.6 + Python 3.10 + Torch 2.8 + TensorRT 10.3.0
- Available at https://pypi.jetson-ai-lab.io/jp6/cu126
Deprecations
- TensorRT implicit quantization support has been deprecated since TensorRT 10.1. Torch-TensorRT APIs related to the INT8Calibrator will be removed in Torch-TensorRT 2.9.0. Quantization users should move to a workflow based on TensorRT-Model-Optimizer Toolkit. See: https://docs.pytorch.org/TensorRT/tutorials/_rendered_examples/dynamo/vgg16_ptq.html for more information
New Features
AOT-Inductor Pythonless Deployment
Stability: Beta
Historically TorchScript has been used to run Torch-TensorRT programs outside of a Python interpreter. Both the dynamo
/torch.compile
frontend and the TorchScript frontends supported this TorchScript deployment workflow.
Old
trt_model = torch_tensorrt.compile(model, ir="dynamo", arg_inputs=[...])
ts_model = torch.jit.trace(trt_model, inputs=[...])
ts_model.save("trt_model.ts")
Now you can achieve a similar result using AOT-Inductor. AOTInductor is a specialized version of TorchInductor, designed to process exported PyTorch models, optimize them, and produce shared libraries as well as other relevant artifacts. These compiled artifacts are specifically crafted for deployment in non-Python environments.
Torch-TensorRT can embed TensorRT engines in AOTInductor libraries to accelerate models further. You are also able to combine Inductor kernels with TensorRT engines via this method. This allows users to deploy their models outside of Python using torch-compile native technologies.
New
with torch.no_grad():
cg_trt_module = torch_tensorrt.compile(model, **compile_settings)
torch_tensorrt.save(
cg_trt_module,
file_path=os.path.join(os.getcwd(), "model.pt2"),
output_format="aot_inductor",
retrace=True,
arg_inputs=example_inputs,
)
This model.pt2
file can then be loaded in either Python or C++ using Torch APIs.
import torch
import torch_tensorrt
model = torch._inductor.aoti_load_package(os.path.join(os.getcwd(), "model.pt2"))
#include <iostream>
#include <vector>
#include "torch/torch.h"
#include "torch/csrc/inductor/aoti_package/model_package_loader.h"
int main(int argc, const char* argv[]) {
std::string trt_aoti_module_path = "model.pt2";
c10::InferenceMode mode;
torch::inductor::AOTIModelPackageLoader loader(trt_aoti_module_path);
std::vector<torch::Tensor> inputs = {torch::randn({8, 10}, at::kCUDA)};
std::vector<torch::Tensor> outputs = loader.run(inputs);
std::cout << "Result from the first inference:"<< std::endl;
std::cout << outputs << std::endl;
return 0;
}
More information can be found here https://docs.pytorch.org/TensorRT/user_guide/runtime.html as we as a code example here: https://github.com/pytorch/TensorRT/blob/release/2.8/examples/torchtrt_aoti_example/inference.cpp
PTX Plugins
Stability: Stable
In Torch-TensorRT 2.7.0 we introduced auto-generated plugins which allows users to automatically wrap kernels / PyTorch custom Operators into TensorRT plugins to run their models without a graph break. In 2.8.0 we extend this system to support PTX based plugins which enables users to serialize and run their TensorRT engines without requiring any PyTorch / Triton / Python in the runtime or access to the original kernel implementation. This approach also has the added benefit of lower overhead than the auto-generated plugin system for achieving maximum performance.
Example below shows how to register a custom operator, generate the necessary plugin, and integrate it into the TensorRT execution graph. [the example]
(https://github.com/pytorch/TensorRT/blob/main/examples/dynamo/aot_plugin.py)
Hierarchical Multi-backend Adjacency Partitioner
Stability: Experimental
The Hierarchical Multi-backend Adjacency Partitioner enables sophisticated model partitioning strategies for distributing PyTorch models across multiple backends based on operator support and priority ordering. A prototype partitioner has been added to the package which allows graphs to be split across multiple backends (e.g., TensorRT, PyTorch Inductor) based on operator capabilities. By providing a backend preference order operators are assigned to the highest-priority backend that supports them.
Please refer to the example for usage.
Model Optimizer-Based NVFP4 Quantization (PTQ) Support for Linux
Stability: Stable
Introducing NVFP4 for efficient and accurate low-precision inference on the Blackwell GPU architecture.
Currently, the workflow supports quantizing models from FP16 → NVFP4.
Directly quantizing from FP32 → NVFP4 is not recommended as it may lead to accuracy degradation. Instead, first convert or train the model in FP16, then quantize to NVFP4.
Full example:
https://github.com/pytorch/TensorRT/blob/release/2.8/examples/apps/flux_demo.py
run_llm
and KV Caching
Stability: Beta
We’ve introduced a KV caching implementation for Torch-TensorRT using native TensorRT operations, yielding significant improvements in inference performance for autoregressive large language models (LLMs). KV caching is a crucial optimization that reduces latency by reusing attention activations across decoding steps. In our approach, the KV cache is modeled as fixed-size tensor inputs and outputs, with outputs from each decoding step looped back as inputs to update the cache incrementally. This update is performed using TensorRT-supported operations such as slice
, concat
, and pad
. The design allows step-wise cache updates while preserving compatibility with TensorRT’s optimization workflow and engine serialization.
We’ve also introduced a new utility, run_llm.py
, to run inference on popular LLMs with KV caching enabled.
To run a Qwen3 model using KV caching with Torch-TensorRT, use the following command:
python run_llm.py --model Qwen/Qwen3-8B --prompt "What is parallel programming?" --precision FP16 --num_tokens 128 --cache static_v2 --benchmark
Please refer to Compiling LLM models from Huggingface for more details and limitations.
Debugger
We introduced a new debugger for better usability and a debugging experience for Torch-TensorRT. The debugger centralized all debugger settings, such as logging level from critical to info, and engine profiling. We also introduced fx graph visualization in the debugger, where you can specify the specific lowering pass before/which you want to draw the graph. Moreover, the debugger can provide engine profiling and layer information that is compatible with TREX, an engine visualization tool developed by TensorRT, that better explains the engine structure.
Model Zoo
We have expanded support to include several popular models from the Qwen3 and Llama3 series. In this release, we’ve also addressed various performance and accuracy issues to improve overall stability. For a complete list of supported models, please refer to the Supported Models section.
Bug Fixes
Refit
Refit has been re-enabled for Python 3.13 after being disabled in 2.7.0
- Reduced memory overhead by offloading model to CPU
Performance improvements
- Linear converter was reverted to the earlier implementation because it shows perf improvements in fp16 on some models (e.g., BERT)
- Group Norm converter was simplified to reduce unnecessary TensorRT ILayers
- The constants in the BatchNorm converter are now folded at compile time, leading to significant performance improvements.
- SDPA op decomposition is optimized, resulting in same or better performance as ONNX-TensorRT for transformer-based diffusion models such as Stable Diffusion 3/WAN2.1/FLUX
What's Changed
- chore: bump torch to 2.8.0.dev by @zewenli98 in #3449
- Nccl ops correction changes by @apbose in #3387
- fix: Change the translational layer from numpy to torch during conversion to handle additional data types by @peri044 in #3445
- Fix grid_sample by @HolyWu in #3340
- fix: Destory cuda graphs before setting weight streaming by @keehyuna in #3461
- tool: uv setting to avoid the pip install -e by @narendasan in #3468
- chore: reenable py313 by @zewenli98 in #3455
- bf16 support for elementwise operation by @apbose in https://github.com/pytorch/TensorRT/pull/...
Torch-TensorRT v2.6.1
What's Changed
- remove breakpoint by @lanluo-nvidia in #3540
- fix the build issue for patch2.6.1 by @lanluo-nvidia in #3542
- update version to 2.6.1 by @lanluo-nvidia in #3545
- cherry pick 3505(windows driver upgrade) to release2.6.1 by @lanluo-nvidia in #3547
Full Changelog: v2.6.0...v2.6.1
Torch-TensorRT v2.7.0
PyTorch 2.7, CUDA 12.8, TensorRT 10.9, Python 3.13
Torch-TensorRT 2.7.0 targets PyTorch 2.7, TensorRT 10.9, and CUDA 12.8, (builds for CUDA 11.8/12.4 are available via the PyTorch package index - https://download.pytorch.org/whl/cu118 https://download.pytorch.org/whl/cu124). Python versions from 3.9-3.13 are supported. We no longer provide builds for the pre-cxx11-abi, all wheels and tarballs will use the cxx11 ABI.
Known Issues
- Engine refitting is disabled in Python 3.13.
Using Self Defined Kernels in TensorRT Engines using Automatic Plugin Generation
Users may develop their own custom kernels using DSLs such as OpenAI Triton. Through the use of PyTorch Custom Ops and Torch-TensorRT Automatic Plugin Generation, these kernels can be called within the TensorRT engine with minimal extra code required.
@triton.jit
def elementwise_scale_mul_kernel(X, Y, Z, a, b, BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(0)
# Compute the range of elements that this thread block will work on
block_start = pid * BLOCK_SIZE
# Range of indices this thread will handle
offsets = block_start + tl.arange(0, BLOCK_SIZE)
# Load elements from the X and Y tensors
x_vals = tl.load(X + offsets)
y_vals = tl.load(Y + offsets)
# Perform the element-wise multiplication
z_vals = x_vals * y_vals * a + b
# Store the result in Z
tl.store(Z + offsets, z_vals)
@torch.library.custom_op("torchtrt_ex::elementwise_scale_mul", mutates_args=()) # type: ignore[misc]
def elementwise_scale_mul(
X: torch.Tensor, Y: torch.Tensor, b: float = 0.2, a: int = 2
) -> torch.Tensor:
# Ensure the tensors are on the GPU
assert X.is_cuda and Y.is_cuda, "Tensors must be on CUDA device."
assert X.shape == Y.shape, "Tensors must have the same shape."
# Create output tensor
Z = torch.empty_like(X)
# Define block size
BLOCK_SIZE = 1024
# Grid of programs
grid = lambda meta: (X.numel() // meta["BLOCK_SIZE"],)
# Launch the kernel with parameters a and b
elementwise_scale_mul_kernel[grid](X, Y, Z, a, b, BLOCK_SIZE=BLOCK_SIZE)
return Z
@torch.library.register_fake("torchtrt_ex::elementwise_scale_mul")
def _(x: torch.Tensor, y: torch.Tensor, b: float = 0.2, a: int = 2) -> torch.Tensor:
return x
torch_tensorrt.dynamo.conversion.plugins.custom_op("torchtrt_ex::elementwise_scale_mul", supports_dynamic_shapes=True, requires_output_allocator=False)
trt_mod_w_kernel = torch_tensorrt.compile(module, ...)
torch_tensorrt.dynamo.conversion.plugins.custom_op
will generate a TensorRT plugin using the Quick Deploy Plugin system and using PyTorch's FakeTensor mode by reusing information required to register a Torch custom op to use with TorchDynamo. It will also generate the Torch-TensorRT converter to insert the plugin to the TensorRT engine.
QDP Plugins for Torch Custom Ops and Converters for QDP Plugins can be generated individually using
torch_tensorrt.dynamo.conversion.plugins.generate_plugin(
"torchtrt_ex::elementwise_scale_mul"
)
torch_tensorrt.dynamo.conversion.plugins.generate_plugin_converter(
"torchtrt_ex::elementwise_scale_mul",
supports_dynamic_shapes=True,
requires_output_allocator=False,
)
MutableTorchTensorRTModule improvements
MutableTorchTensorRTModule
automatically recompiles if the engine becomes invalid. Previously, engines would assume static shape which means that if a user provides a different sized input, the graph would recompile or pull from engine cache. Now developers are able to provide shape hints to the MutableTorchTensorRTModule
which will allow the module to handle a broader range of inputs without recompiling. For example:
pipe.unet = torch_tensorrrt.MutableTorchTensorRTModule(pipe.unet, **settings)
BATCH = torch.export.Dim("BATCH", min=2, max=24)
_HEIGHT = torch.export.Dim("_HEIGHT", min=16, max=32)
_WIDTH = torch.export.Dim("_WIDTH", min=16, max=32)
HEIGHT = 4 * _HEIGHT
WIDTH = 4 * _WIDTH
args_dynamic_shapes = ({0: BATCH, 2: HEIGHT, 3: WIDTH}, {})
kwargs_dynamic_shapes = {
"encoder_hidden_states": {0: BATCH},
"added_cond_kwargs": {
"text_embeds": {0: BATCH},
"time_ids": {0: BATCH},
},
"return_dict": None,
}
pipe.unet.set_expected_dynamic_shape_range(
args_dynamic_shapes, kwargs_dynamic_shapes
)
Data Dependent Shape support
For networks that produce outputs whose shapes are dependent on the shape of the input, the output buffer must be allocated at runtime. To support this use case we have added a new runtime mode Dynamic Output Allocation Mode to support Data Dependent Shape (DDS) operations, such as NonZero op. (#3388)
Note:
- Dynamic output allocation mode cannot be used in conjunction with CUDA Graphs nor pre-allocated outputs feature.
- Without dynamic output allocation, the output buffer is allocated based on the inferred output shape based on input size.
There are two scenarios in which dynamic output allocation is enabled:
- The model has been identified at compile time to require dynamic output allocation for at least one TensorRT subgraph. These models will engage the runtime mode automatically (with logging) and are incompatible with other runtime modes such as CUDA Graphs. Converters can declare that subgraphs that they produce will require the output allocator using
requires_output_allocator=True
there by forcing any model which utilizes the converter to automatically use the output allocator runtime mode. e.g.,
@dynamo_tensorrt_converter(
torch.ops.aten.nonzero.default,
supports_dynamic_shapes=True,
requires_output_allocator=True,
)
def aten_ops_nonzero(
ctx: ConversionContext,
target: Target,
args: Tuple[Argument, ...],
kwargs: Dict[str, Argument],
name: str,
) -> Union[TRTTensor, Sequence[TRTTensor]]:
...
- Users may manually enable dynamic output allocation mode via the
torch_tensorrt.runtime.enable_output_allocator
context manager.
# Enables Dynamic Output Allocation Mode, then resets the mode to its prior setting
with torch_tensorrt.runtime.enable_output_allocator(trt_module):
...
Tiling Optimization support
Tiling optimization enables cross-kernel tiled inference. This technique leverages on-chip caching for continuous kernels in addition to kernel-level tiling. It can significantly enhance performance on platforms constrained by memory bandwidth. (#3444)
We currently support four tiling strategies "none", "fast", "moderate", "full". A higher level allows TensorRT to spend more time searching for better tiling strategy. Here's an example to call tiling optimization:
compiled_model = torch_tensorrt.compile(
model,
ir="dynamo",
inputs=inputs,
tiling_optimization_level="full",
l2_limit_for_tiling=10,
)
Model Zoo additions
- Added support for compiling the FLUX.1-dev 12B model in our model zoo. An example is available here. Quantized variants of FLUX are under development as part of future work.
General Improvements
- Improved BF16 support in model compilation by fixing bugs and adding new tests to cover both full-graph and graph-break scenarios.
- Significantly accelerated model compilation time (#3396)
Python 3.13 support
We added support for Python 3.13 (#3455). However, due to the Python object reference issue in PyTorch 2.7, we disabled the refitting related features for Python 3.13 in this release. This issue should be fixed in the next release.
What's Changed
- Fix usage example by @ohadravid in #3337
- Bump TRT version to 10.7 by @zewenli98 in #3313
- using nccl ops from TRT-LLM namespace by @apbose in #3250
- feat: Trigger Actions to run multiple TRT versions weekly by @zewenli98 in #3346
- fix: torch 2.7 bump bug on the main branch by @zewenli98 in #3353
- fix: remove legacy conv converter by @chohk88 in #3343
- chore: flip use_cxx11_abi naming by @zewenli98 in #3361
- chore: address flaky test failures related to global partitioning by @peri044 in #3369
- fix(aten::instance_norm): Handle optional inputs in instance norm con… by @narendasan in #3367
- chore: moving away from tensorrt_bindings by @narendasan in #3365
- Use IUnsqueezeLayer in unsqueeze impl by @HolyWu in #3366
- Deprecate torchscript frontend by @narendasan in #3373
- feat: Add FLUX-1.dev model to the model zoo by @peri044 in #3382
- Accelerate network interpretation by 15x; fixed redundant code in TRT Interpreter by @cehongwang in #3396
- chore(deps): bump transformers from 4.40.2 to 4.48.0 in /tests/modules by @dependabot i...
Torch-TensorRT v2.6.0
PyTorch 2.6, CUDA 12.6 TensorRT 10.7, Python 3.12
Torch-TensorRT 2.6.0 targets PyTorch 2.6, TensorRT 10.7, and CUDA 12.6, (builds for CUDA 11.8/12.4 are available via the PyTorch package index - https://download.pytorch.org/whl/cu118 https://download.pytorch.org/whl/cu124). Python versions from 3.9-3.12 are supported. We do not support 3.13 in this release due to TensorRT not supporting that version of Python at this time.
Deprecation notice
The torchscript frontend will be deprecated in v2.6. Specifically, the following usage will no longer be supported and will issue a deprecation warning at runtime if used:
torch_tensorrt.compile(model, ir="torchscript")
Moving forward, we encourage users to transition to one of the supported options:
torch_tensorrt.compile(model)
torch_tensorrt.compile(model, ir="dynamo")
torch.compile(model, backend="tensorrt")
Torchscript will continued to be supported as a deployment format via post compilation tracing
dynamo_model = torch_tensorrt.compile(model, ir="dynamo", arg_inputs=[...])
ts_model = torch.jit.trace(dynamo_model, inputs=[...])
ts_model(...)
Please refer to the README for more information regarding our deprecation policy.
Cross-OS Compilation
In Torch-TensorRT 2.6 it is now possible to use a Linux host to compile Torch-TensorRT programs for Windows using the torch_tensorrt.cross_compile_for_windows
API. These programs use a slightly different serialization format to facilitate this workflow and cannot be run on Linux. Therefore, when calling torch_tensorrt.cross_compile_for_windows
expect the program to be saved directly to disk. Developers should then use the torch_tensorrt.load_cross_compiled_exported_program
on the Windows target to load the serialized program. Torch-TensorRT programs now include target platform information to verify OS compatibility on deserialization. This in turn has caused an ABI bump for the runtime.
if load:
# load the saved model in Windows
if platform.system() != "Windows" or platform.machine() != "AMD64":
raise ValueError(
"cross runtime compiled model for windows can only be loaded in Windows system"
)
loaded_model = torchtrt.load_cross_compiled_exported_program(save_path).module()
print(f"model has been successfully loaded from ${save_path}")
# inference
trt_output = loaded_model(input)
print(f"inference result: {trt_output}")
else:
if platform.system() != "Linux" or platform.architecture()[0] != "64bit":
raise ValueError(
"cross runtime compiled model for windows can only be compiled in Linux system"
)
compile_spec = {
"debug": True,
"min_block_size": 1,
}
torchtrt.cross_compile_for_windows(
model, file_path=save_path, inputs=inputs, **compile_spec
)
print(
f"model has been successfully cross compiled and saved in Linux to {args.path}"
)
Runtime Weight Streaming
Weight Streaming in Torch-TensorRT is a memory optimization technique that helps deploy large models on memory-constrained devices by dynamically loading weights as needed during inference, reducing the overall memory footprint and enabling more efficient use of hardware resources. It is an opt-in feature that needs to be enabled at both build time and runtime.
trt_model = torch_tensorrt.dynamo.compile(
model,
inputs=input_tensors,
enabled_precisions={torch.float32}, # only float32 precision is allowed for strongly typed network
use_explicit_typing=True, # create a strongly typed network
enable_weight_streaming=True, # enable weight streaming
)
Control the weight streaming budget at runtime using the weight streaming context manager
with torch_tensorrt.runtime.weight_streaming(trt_model) as weight_streaming_ctx:
# Get the total size of streamable weights in the engine
streamable_budget = weight_streaming_ctx.total_device_budget
# Set 50% weight streaming budget
requested_budget = int(streamable_budget * 0.5)
weight_streaming_ctx.device_budget = requested_budget
trt_model(inputs)
Inter-Block CUDAGraphs
We updated CUDAGraphs API to support Inter-Block CUDAGraphs. When a compiled Torch-TensorRT module has graph breaks, previously, only TensorRT blocks could be run with CUDAGraph's optimized kernel launch. With Torch-TensorRT 2.6 the entire graph can be captured and executed in a unified CUDAGraph to minimize kernel launch overhead.
# Previous API
with torch_tensorrt.runtime.enable_cudagraphs():
torchtrt_model(inputs)
# New API
with torch_tensorrt.runtime.enable_cudagraphs(torchtrt_model) as cudagraphs_model:
cudagraphs_model(input)
Improvements to Engine Caching
First, there are some API changes.
make_refittable
was renamed toimmutable_weights
in preparation for a future release that will default engines to be compiled with the refit feature enabled, allowing for the Torch-TensorRT engine cache to provide maximum benefits.refit_identical_engine_weights
was added to specify whether to refit the engine with identical weights;strip_engine_weights
was added to specify whether to strip the engine weights.- The default disk size for engine caching was expanded to 5GB.
In addition, one of the capabilities of engine caching is to recognize whether two graphs are isomorphic. If a new graph is isomorphic to any previously compiled TensorRT engine, the engine cache will reuse that engine instead of recompiling the graph, thereby avoiding recompilation time. In the previous release, we utilized FxGraphCachePickler.get_hash(new_gm)
from PyTorch to calculate hash values which took up a large portion of the total compile time. In this release, we designed a new hash function to get hash values quickly and then determine the isomorphism with ~4x speedup.
C++11 ABI Changes
To keep pace with PyTorch, as of release 2.6, we switched docker images from manylinux to manylinux2_28. In Torch/Torch-TensorRT 2.6, PRE_CXX11_ABI
is used for CUDA 11.8 and 12.4, while CXX11_ABI
is used for CUDA 12.6. For Torch/Torch-TensorRT 2.7, CXX11_ABI
will be used for all CUDA 11.8, 12.4, and 12.6.
Explicit Typing
We introduce a new compilation setting, use_explicit_typing
, to enable mixed precision inference with Torch-TensorRT. When this flag is enabled, TensorRT operates in strong typing mode, ensuring that layer data types are preserved during compilation. For a detailed demonstration of this behavior, refer to the provided tutorial. To learn more about strong typing in TensorRT, refer to the relevant section in the TensorRT Developer Guide.
Model Zoo
- We have added Segment Anything Model 2 compilation using Torch-TensorRT (SAM2) to our model zoo. The example can be found here
- We have also added a torch.compile example for GPT2 using the
tensorrt
backend. This example demonstrates the use of the HuggingFacegenerate
API for auto-regressive decoding. For export based workflow (ir=dynamo
), we provide a custom generate function to handle output decoding.
Multi-GPU Improvements
There are experimental improvements to multi-gpu workflows, including pulling NCCL operations into TensorRT subgraphs automatically. These should be considered alpha stability. More information can be found here: https://github.com/pytorch/TensorRT/tree/main/examples/distributed_inference
What's Changed
- upgrade modelopt by @lanluo-nvidia in #3160
- feat: exclude refit sensitive ops from TRT compilation by @peri044 in #3159
- tool: Adding support for the uv system by @narendasan in #3125
- upgrade torch from 2.5.0.dev to 2.6.0.dev in main branch by @lanluo-nvidia in #3165
- fix: Fix static arange export by @peri044 in #3194
- docs: A tutorial on how to overload converters in Torch-TensorRT by @narendasan in #3197
- Adjust cpp torch trt logging level with compiler option by @keehyuna in #3181
- extend the timeout-minutes in build/test from 60 min to 120 min by @lanluo-nvidia in #3203
- extend windows build from 60 min to 120 min by @lanluo-nvidia in #3218
- fix the global partitioner bug by @lanluo-nvidia in #3195
- feat: Implement FP32 accumulation for matmul by @peri044 in #3110
- chore: Make substitute-runner in Windows CI work again by @HolyWu in #3225
- Run test_base_fp8 for compute capability 8.9 or later by @HolyWu in #3164
- Fixed batchnorm bug by @cehongwang in #3170
- Fix for warning as default stream was used in enqueueV3 by @keehyuna in #3191
- chore: doc updates by @peri044 in #3238
- chore: Additional Doc fixes by @peri044 in #32...
Torch-TensorRT v2.5.0
PyTorch 2.5, CUDA 12.4, TensorRT 10.3, Python 3.12
Torch-TensorRT 2.5.0 targets PyTorch 2.5, TensorRT 10.3 and CUDA 12.4.
(builds for CUDA 11.8/12.1 are available via the PyTorch package index - https://download.pytorch.org/whl/cu118 https://download.pytorch.org/whl/cu121)
Deprecation notice
The torchscript frontend will be deprecated in v2.6. Specifically, the following usage will no longer be supported and will issue a deprecation warning at runtime if used:
torch_tensorrt.compile(model, ir="torchscript")
Moving forward, we encourage users to transition to one of the supported options:
torch_tensorrt.compile(model)
torch_tensorrt.compile(model, ir="dynamo")
torch.compile(model, backend="tensorrt")
Torchscript will continued to be supported as a deployment format via post compilation tracing
dynamo_model = torch_tensorrt.compile(model, ir="dynamo", arg_inputs=[...])
ts_model = torch.jit.trace(dynamo_model, inputs=[...])
ts_model(...)
Please refer to the README for more information regarding our deprecation policy.
Refit (Beta)
v2.5.0 introduces direct model refitting from PyTorch for your compiled Torch-TensorRT programs. Sometimes the weights need to change through the course of inference and in the past full recompilation was necessary to change out the weights of the model, either through automatic recompilation through torch.compile
or through manual recompilation with torch_tensorrt.compile
. Now using the refit_module_weights
API, compiled modules can be refitted by providing a new PyTorch module (with identical structure) containing the new weights. Compiled modules must be compiled with make_refittable
to use this feature.
# Create and export the updated model
model2 = models.resnet18(pretrained=True).eval().to("cuda")
exp_program2 = torch.export.export(model2, tuple(inputs))
compiled_trt_ep = torch_trt.load("./compiled.ep")
# This returns a new module with updated weights
new_trt_gm = refit_module_weights(
compiled_module=compiled_trt_ep,
new_weight_module=exp_program2,
)
There are some ops that are not compatible with refit, such as ops that utilize ILoop layer
. When make_refittable
is enabled, these ops will be forced to run in PyTorch. It should also be known that engines that are refit enabled may be slightly less performant than non-refittable engines as TensorRT cannot tune for the specific weights it will see at execution time.
Refit Caching (Experimental)
Refitting on its own can help to speed up update model swap times by 0.5-2x. However, the speed of refit can be further improved by utilizing refit caching. Refit caching at compile time stores hints for a direct mapping from PyTorch module members to TRT layer names in the metadata of TorchTensorRTModule
. This caching can speed up refit by orders of magnitude. However, it currently has limitations when dealing with layers that have compile time optimization. This feature is still experimental as there may be some ops that are not amenable to refit caching. We still enable using the cache by default when refitting to collect feedback on the edge cases and we provide a output validator which can be used to ensure that refit occurred properly. When verify_outputs
is True if the refit failed, then the refitter will discard the cache and refit from scratch.
new_trt_gm = refit_module_weights(
compiled_module=compiled_trt_ep,
new_weight_module=exp_program2,
arg_inputs=inputs,
verify_outputs=True,
)
MutableTorchTensorRTModule (Experimental)
torch.compile
is incredibly useful when it comes to trying to optimize models that may change over time since it can automatically recompile the module when something changes. However, the major limitation of torch.compile
is it cannot be serialized. For users who are looking for similar flexibility but the added ability to serialize and move their work we have introduced the MutableTorchTensorRTModule
. This module wraps a PyTorch module and exposes its members transparently, however it injects listeners on setattr
and overrides the forward function to use TensorRT accelerated subgraphs. This means you can make changes to your module such as applying adapters and the MutableTorchTensorRTModule
will detect the change and mark the function for refit or recompilation based on the change. Similar to torch.compile
this is done in a JIT manner, so the first inference after a change will perform the refit or recompile operation.
from diffusers import DiffusionPipeline
with torch.no_grad():
settings = {
"use_python_runtime": True,
"enabled_precisions": {torch.float16},
"debug": True,
"make_refittable": True,
}
model_id = "runwayml/stable-diffusion-v1-5"
device = "cuda:0"
prompt = "house in forest, shuimobysim, wuchangshuo, best quality"
negative = "(worst quality:2), (low quality:2), (normal quality:2), lowres, normal quality, out of focus, cloudy, (watermark:2),"
pipe = DiffusionPipeline.from_pretrained(
model_id, revision="fp16", torch_dtype=torch.float16
)
pipe.to(device)
# The only extra line you need
pipe.unet = torch_trt.MutableTorchTensorRTModule(pipe.unet, **settings)
image = pipe(prompt, negative_prompt=negative, num_inference_steps=30).images[0]
image.save("./without_LoRA_mutable.jpg")
# Standard Huggingface LoRA loading procedure
pipe.load_lora_weights(
"stablediffusionapi/load_lora_embeddings",
weight_name="moxin.safetensors",
adapter_name="lora1",
)
pipe.set_adapters(["lora1"], adapter_weights=[1])
pipe.fuse_lora()
pipe.unload_lora_weights()
# Refit triggered
image = pipe(prompt, negative_prompt=negative, num_inference_steps=30).images[0]
image.save("./with_LoRA_mutable.jpg")
Engine Caching
In some scenarios, users may compile a module multiple times and each time it takes a long time to build a TensorRT engine in the backend. Engine caching will boost performance by reusing previously compiled TensorRT engines rather than recompiling it every time, thereby avoiding recompilation time. When a cached engine is loaded, it will be refitted with the new module weights.
To make it more efficient, as long as two graph modules have the same structure, even though their weights are not the same, we still consider they are the same, i.e., isomorphic graph modules. Isomorphic graph modules with the same compilation settings will share cached engines.
We implemented DiskEngineCache
so that users can directly use the APIs to control how and where to save/load cached engines on the disk of the local machine. For exmaple,
trt_gm = torch_trt.dynamo.compile(
exp_program,
tuple(inputs),
make_refitable=True,
cache_built_engines=True,
reuse_cached_engines=True,
engine_cache_dir="/tmp/torch_trt_engine_cache"
engine_cache_size=1 << 30, # 1GB
)
In addition, considering some users want to save to or load engines from other servers, clusters, or cloud, we also provided a base class BaseEngineCache
so that users are able to easily implement their own logic to save and load engines. For example,
class MyEngineCache(BaseEngineCache):
def __init__(
self,
addr: str,
) -> None:
self.addr= addr
def save(
self,
hash: str,
blob: bytes,
prefix: str = "blob",
):
# user's customized function to save engines
write_to(self.addr, name=f"{prefix}_{hash}.bin", content=blob)
def load(self, hash: str, prefix: str = "blob") -> Optional[bytes]:
# user's customized function to load engines
return read_from(self.addr, name=f"{prefix}_{hash}.bin")
trt_gm = torch_trt.dynamo.compile(
exp_program,
tuple(inputs),
make_refitable=True,
cache_built_engines=True,
reuse_cached_engines=True,
custom_engine_cache=MyEngineCache("xxxxx"),
)
CUDA Graphs
In v2.5.0 CUDA graph support for in engine kernel launch optimization has been added through a new runtime mode. This mode can be activated from Python using
import torch_tensorrt
my_torchtrt_model = torch_tensorrt.compile(...)
with torch_tensorrt.runtime.enable_cudagraphs():
my_torchtrt_model(inputs)
This mode works by creating CUDAGraphs around individual TensorRT engines which improves their efficiency. It creates graph through a capture phase which is tied to the input shape to the engine. When the input shape changes, this graph is invalidated and the graph is automatically recaptured.
Model Optimizer based Int8 Quantization(PTQ) support for Linux
This version introduces official support for the int8 Quantization via modelopt (https://github.com/NVIDIA/TensorRT-Model-Optimizer) 17.0 for Linux.
Full examples can be found at https://github.com/pytorch/TensorRT/blob/main/examples/dynamo/vgg16_ptq.py
running the vgg16 example for int8 ptq
step1: generate checkpoint file for vgg16:
cd examples/int8/training/vgg16
python main.py --lr 0.01 --batch-size 128 --drop-ratio 0.15 \
--ckpt-dir $(pwd)/vgg16_ckpts --epochs 20 --seed 545
this should produce a ckpt file at examples/int8/training/vgg16/vgg16_ckpts/ckpt_epoch20.pth
step2: run int8 ptq for vgg16:
python examples/dynamo/vgg16_fp8_ptq.py --batch-size 128 \
--ckpt=examples/int8/training/vgg16/vgg16_ckpts/ckpt_epoch20.pth \
--quantize-type=int8
LLM examples
We now offer dynamic shape support for all converters (covering core ATen operations). Dynamic shapes are widely utilized in leading LLM models, where input sequence lengths may vary. With this release, we showcase full graph compilation for Ll...
Torch-TensorRT v2.4.0
C++ runtime support in Windows Support, Enhanced Dynamic Shape support in Converters, PyTorch 2.4, CUDA 12.4, TensorRT 10.1, Python 3.12
Torch-TensorRT 2.4.0 targets PyTorch 2.4, CUDA 12.4 (builds for CUDA 11.8/12.1 are available via the PyTorch package index - https://download.pytorch.org/whl/cu118 https://download.pytorch.org/whl/cu121) and TensorRT 10.1.
This version introduces official support for the C++ runtime on the Windows platform, though it is limited to the dynamo frontend, supporting both AOT and JIT workflows. Users can now utilize both Python and C++ runtimes on Windows. Additionally, this release expands support to include all Aten Core Operators, except torch.nonzero
, and significantly increases dynamic shape support across more converters. Python 3.12 is supported for the first time in this release.
Full Windows Support
In this release we introduce both C++ and Python runtime support in Windows. Users can now directly optimize PyTorch models with TensorRT on Windows, with no code changes. C++ runtime is the default option and users can enable Python runtime by specifying use_python_runtime=True
import torch
import torch_tensorrt
import torchvision.models as models
model = models.resnet18(pretrained=True).eval().to("cuda")
input = torch.randn((1, 3, 224, 224)).to("cuda")
trt_mod = torch_tensorrt.compile(model, ir="dynamo", inputs=[input])
trt_mod(input)
Enhanced Op support in Converters
Support for Converters is near 100% of core ATen. At this point fall back to PyTorch execution is either due to specific limitations of converters or some combination of user compiler settings (e.g. torch_executed_ops
, dynamic shape). This release also expands the number of operators that support dynamic shape. dryrun
will provide specific information on your model + settings support.
What's Changed
- fix: FakeTensors appearing in
get_attr
calls by @gs-olive in #2669 - feat: support adaptive_avg_pool1d dynamo converter by @zewenli98 in #2614
- fix: Add cmake missing source file ref for core_lowering.passes by @Arktische in #2672
- ci: Torch nightly version upgrade to
2.4.0
by @gs-olive in #2704 - Add support for
aten.pixel_unshuffle
dynamo converter by @HolyWu in #2696 - feat: support aten.atan2 converter by @chohk88 in #2689
- feat: support aten.index_select converter by @chohk88 in #2710
- feat: support aten.isnan converter by @chohk88 in #2711
- feat: support adaptive avg pool 2d and 3d dynamo converters by @zewenli98 in #2632
- feat: support aten.expm1 converter by @chohk88 in #2714
- fix: Add dependencies to Docker container for
apt
versioning TRT by @gs-olive in #2746 - fix: Missing parameters in compiler settings by @gs-olive in #2749
- fix: param bug in
test_binary_ops_aten
by @zewenli98 in #2733 - aten::empty_like by @apbose in #2654
- empty_permute decomposition by @apbose in #2698
- Removing grid lowering by @apbose in #2686
- Selectively enable different frontends by @narendasan in #2693
- chore(deps): bump transformers from 4.33.2 to 4.36.0 in /tools/perf by @dependabot in #2555
- Fix upsample converter not properly registered by @HolyWu in #2683
- feat: TS Add converter support for aten::grid_sampler by @mfeliz-cruise in #2717
- fix: Bump
torchvision
version by @gs-olive in #2770 - fix: convert_module_to_trt_engine by @zewenli98 in #2728
- chore: cherry pick of save API by @peri044 in #2719
- chore: Upgrade TensorRT version to TRT 10 EA (#2699) by @peri044 in #2774
- Fix minor grammatical corrections by @aakashapoorv in #2779
- feat: cherry-pick of Implement symbolic shape propagation, sym_size converter by @peri044 in #2751
- feat: cherry-pick of torch.compile dynamic shapes by @peri044 in #2750
- chore: bump deps for default workspace file by @narendasan in #2786
- fix: Point infra branch to main by @gs-olive in #2785
- "empty_like" decomposition test correction by @apbose in #2784
- chore: Bump versions by @narendasan in #2787
- fix: refactor layer norm converter with INormalization Layer by @zewenli98 in #2755
- TRT-10 GA Support for main branch by @zewenli98 in #2781
- chore(//tests): Update tests to use assertEqual by @narendasan in #2800
- feat: Add support for
is_causal
argument in attention by @gs-olive in #2780 - feat: Adding support for native int64 by @narendasan in #2789
- chore: small mypy issue by @narendasan in #2803
- Rand converter - evaluator by @apbose in #2580
- cherry-pick: Python Runtime Windows Builds on TRT 10 (#2764) by @gs-olive in #2776
- feat: support 1d ITensor offsets for embedding_bag converter by @zewenli98 in #2677
- chore(deps): bump transformers from 4.36.0 to 4.38.0 in /tools/perf by @dependabot in #2766
- fix: a bug in func run_test_compare_tensor_attributes_only by @zewenli98 in #2809
- Fix ModuleNotFoundError in ptq by @HolyWu in #2814
- docs: Example on how to use custom kernels in Torch-TensorRT by @narendasan in #2812
- typo fix in doc on saving models by @laikhtewari in #2818
- chore: Remove CUDNN dependencies by @zewenli98 in #2804
- fix: bug in elementwise base for static inputs by @zewenli98 in #2819
- Use environment for docgen by @atalman in #2826
- tool: Opset coverage notebook by @narendasan in #2831
- ci: Add release flag for nightly build tag by @gs-olive in #2821
- [doc] Update options documentation for torch.compile by @lanluo-nvidia in #2834
- feat(//py/torch_tensorrt/dynamo): Support for BF16 by @narendasan in #2833
- feat: data parallel inference examples by @bowang007 in #2805
- fix: bugs in TRT 10 upgrade by @zewenli98 in #2832
- feat: support aten._cdist_forward converter by @chohk88 in #2726
- chore: cherry pick of #2805 by @bowang007 in #2851
- feat: Add support for multi-device safe mode in C++ by @gs-olive in #2824
- feat: support aten.log1p converter by @chohk88 in #2823
- feat: support aten.as_strided converter by @chohk88 in #2735
- fix: Fix deconv kernel channel num_output_maps where wts are ITensor by @andi4191 in #2678
- Aten scatter converter by @apbose in #2664
- fix user_guide and tutorial docs by @yoosful in #2854
- chore: Make from and to methods use the same TRT API by @narendasan in #2858
- add aten.topk implementation by @lanluo-nvidia in #2841
- feat: support aten.atan2.out converter by @chohk88 in #2829
- chore: update docker, refactor CI TRT dep to main by @peri044 in #2793
- feat: Cherry pick of Add validators for dynamic shapes in converter registration by @peri044 in #2849
- feat: support aten.diagonal converter by @chohk88 in #2856
- Remove ops from decompositions where converters exist by @HolyWu in #2681
- slice_scatter decomposition by @apbose in #2519
- select_scatter decomp by @apbose in #2515
- manylinux wheel file build update for TensorRT-10.0.1 by @lanluo-nvidia in #2868
- replace itemset due to numpy version 2.0 removed itemset api by @lanluo-nvidia in #2879
- chore: cherry-pick of DS feature by @peri044 in #2857
- feat: TS Add converter supp...
Torch-TensorRT v2.3.0
Windows Support, Dynamic Shape and Quantization in Dynamo , PyTorch 2.3, CUDA 12.1, TensorRT 10.0
Torch-TensorRT 2.3.0 targets PyTorch 2.3, CUDA 12.1 (builds for CUDA 11.8 are available via the PyTorch package index - https://download.pytorch.org/whl/cu118) and TensorRT 10.0. 2.3.0 adds official support for Windows as a platform. Windows will only support using the Dynamo frontend and currently users are required to use the Python-only runtime (support for the C++ runtime will be added in a future version). This release also adds support for Dynamic shape without recompilation. Users can also now use quantized models with Torch-TensorRT using the Model Optimizer toolkit (https://github.com/NVIDIA/TensorRT-Model-Optimizer).
Note: Python 3.12 is not supported as the Dynamo stack in PyTorch 2.3.0 does not support Python 3.12
Windows
In this release we introduce Windows support for the Python runtime using the Dynamo paths. Users can now directly optimize PyTorch models with TensorRT on Windows, with minimal code changes. This integration enables Python-only optimization in the Torch-TensorRT Dynamo compilation paths (ir="dynamo"
and ir="torch_compile"
).
import torch
import torch_tensorrt
import torchvision.models as models
model = models.resnet18(pretrained=True).eval().to("cuda")
input = torch.randn((1, 3, 224, 224)).to("cuda")
trt_mod = torch_tensorrt.compile(model, ir="dynamo", inputs=[input])
trt_mod(input)
Dynamic Shaped Model Compilation in Dynamo
Dynamic shape support has become more robust in v2.3.0. Torch-TensorRT now leverages symbolic information in the graph to calculate intermediate shape ranges which allows more dynamic shape cases to be supported. For AOT workflows using torch.export, using these new features requires no changes. For JIT workflows which previously used torch.compile
guards to automatically recompile the engines where the input size changes, users can now mark dynamic dimensions using torch APIs (https://pytorch.org/docs/stable/torch.compiler_dynamic_shapes.html). Using these APIs will mean that as long as inputs do not violate the specified constraints, engines would not recompile.
AOT workflow
import torch
import torch_tensorrt
compile_spec = {"inputs": [torch_tensorrt.Input(min_shape=(1, 3, 224, 224),
opt_shape=(4, 3, 224, 224),
max_shape=(8, 3, 224, 224),
dtype=torch.float32)],
"enabled_precisions": {torch.float}}
trt_model = torch_tensorrt.compile(model, **compile_spec)
JIT workflow
import torch
import torch_tensorrt
compile_spec = {"enabled_precisions": {torch.float}}
inputs = torch.randn((4, 3, 224, 224)).to("cuda")
# This indicates the dimension 0 is dynamic and the range is [1, 8]
torch._dynamo.mark_dynamic(inputs, 0, min=1, max=8)
trt_model = torch.compile(model, backend="tensorrt", options=compile_spec)
More information can be found here: https://pytorch.org/TensorRT/user_guide/dynamic_shapes.html
Explicit Dynamic Shape support in Converters
Converters now explicitly declare their support for dynamic shapes and we are progressively adding and verifying. Converter writers can specify the support for dynamic shapes using the supports_dynamic_shape
argument of the dynamo_tensorrt_converter
decorator.
@dynamo_tensorrt_converter(
torch.ops.aten.convolution.default,
capability_validator=lambda conv_node: conv_node.args[7] in ([0], [0, 0], [0, 0, 0])
supports_dynamic_shapes=True,
) # type: ignore[misc]
def aten_ops_convolution(
ctx: ConversionContext,
target: Target,
args: Tuple[Argument, ...],
kwargs: Dict[str, Argument],
name: str,
) -> Union[TRTTensor, Sequence[TRTTensor]]:
By default, if a converter has not been marked as supporting dynamic shape, it's operator will be run in PyTorch if the user has specified the inputs as dynamic. This is done for the sake of ensuring that compilation will succeed with some valid compiled module. However, many operators already support dynamic shape in an untested fashion. Therefore, users can decide to enable to full converter library for dynamic shape using the assume_dynamic_shape_support
flag. This flag assumes all converters support dynamic shape, leading to more operations being run in TensorRT with the potential drawback that some ops may cause compilation or runtime failures. Future releases will add progressively add coverage for dynamic shape for all Core ATen Operators.
Quantization in Dynamo
We introduce support for model quantization in FP8. We support models quantized using NVIDIA TensorRT-Model-Optimizer toolkit. This toolkit introduces quantization nodes in the graph which are converted and used by TensorRT to quantize the model into lower precision. Although the toolkit supports quantization in various datatypes, we only support FP8 in this release.
Please refer to our end-end example Torch Compile VGG16 with FP8 and PTQ on how to use this.
Engine Version and Hardware Compatibility
We introduce new compilation arguments, hardware_compatible: bool
and version_compatible: bool
, which enable two key features in TensorRT.
hardware_compatible
Enabling hardware compatibility mode will generate TRT Engines which are compatible with Ampere and newer GPUs. As a result, engines built on one GPU can later be run on others, without requiring recompilation.
version_compatible
Enabling version compatibility mode will generate TRT Engines which are compatible with newer versions of TensorRT. As a result, engines built with one version of TensorRT will be forward compatible with other TRT versions, without needing recompilation.
...
trt_mod = torch_tensorrt.compile(model, ir="dynamo", inputs=[input], hardware_compatible=True, version_compatible=True)
...
New Data Type Support
Torch-TensorRT includes a number of new data types that leverage dedicated hardware on Ampere, Hopper and future architectures.
bfloat16
has been added as a supported type alongside FP16 and FP32 that can be enabled for additional kernel tactic options. Models that contain BF16 weights can now be provided to Torch-TensorRT without modification. FP8 has been added with support for Hopper and newer architectures as a new quantization format (see below), similar to INT8. Finally, native support for INT64 inputs and computation has been added. In the past, the truncate_long_and_double
feature flag must be enabled in order to handle INT64 and FLOAT64 computation, inputs and weights. This flag would cause the compiler to truncate any INT64 or FLOAT64 objects to INT32 and FLOAT32 respectively. Now INT64 objects will not be truncated and remain in INT64. As such, the truncate_long_and_double
flag has been renamed truncate_double
as FLOAT64 truncation is still required, truncate_long_and_double
is now deprecated.
What's Changed
- feat: support group_norm, batch_norm, and layer_norm by @zewenli98 in #2330
- support argmax converter by @bowang007 in #2291
- feat: Decomposition for
_unsafe_index
by @gs-olive in #2386 - docs: Add documentation of
torch.compile
backend usage by @gs-olive in #2363 - fix: Remove supported ops from decompositions by @gs-olive in #2390
- fix: Converter, inputs, and utils bugfixes for Transformer XL by @gs-olive in #2404
- feat: support embedding_bag converter (1D input) by @zewenli98 in #2395
- feat: support chunk dynamo converter by @zewenli98 in #2401
- chore: Add documentation for dynamo.compile backend by @peri044 in #2389
- Support new FX Legacy Registry in opset coverage tool by @laikhtewari in #2366
- fix: type error in embedding_bag by @zewenli98 in #2418
- feat: support cumsum dynamo converter by @zewenli98 in #2403
- 2.0 docs overhaul by @narendasan in #2420
- feat: support tile dynamo converter by @zewenli98 in #2402
- chore: update perf tooling to add dynamo options by @peri044 in #2423
- feat: Add
aten.unbind
decomposition for VIT by @gs-olive in #2430 - fix: Segfault fix for Benchmarks by @gs-olive in #2432
- examples: Stable Diffusion
torch.compile
sample with output image by @gs-olive in #2417 - minor fix: Parse out slashes in Docker container name by @gs-olive in #2437
- fix: Docs rendering on PyTorch site by @gs-olive in #2440
- Numpy changes for aten::index converter by @apbose in #2396
- feat: a lowering pass to re-compose ops into aten.linear by @zewenli98 in #2411
- chore: fix docs for export by @peri044 in #2447
- chore: add additional BN native converter by @peri044 in #2446
- minor fix: Update Benchmark values by @gs-olive in #2453
- Dele...
Torch-TensorRT v2.2.0
Dynamo Frontend for Torch-TensorRT, PyTorch 2.2, CUDA 12.1, TensorRT 8.6
Torch-TensorRT 2.2.0 targets PyTorch 2.2, CUDA 12.1 (builds for CUDA 11.8 are available via the PyTorch package index - https://download.pytorch.org/whl/cu118) and TensorRT 8.6. This release is the second major release of Torch-TensorRT as the default frontend has changed from TorchScript to Dynamo allowing for users to more easily control and customize the compiler in Python.
The dynamo frontend can support both JIT workflows through torch.compile
and AOT workflows through torch.export + torch_tensorrt.compile
. It targets the Core ATen Opset (https://pytorch.org/docs/stable/torch.compiler_ir.html#core-aten-ir) and currently has 82% coverage. Just like in Torchscript graphs will be partitioned based on the ability to map operators to TensorRT in addition to any graph surgery done in Dynamo.
Output Format
Through the Dynamo frontend, different output formats can be selected for AOT workflows via the output_format
kwarg. The choices are torchscript
where the resulting compiled module will be traced with torch.jit.trace
, suitable for Pythonless deployments, exported_program
a new serializable format for PyTorch models or finally if you would like to run further graph transformations on the resultant model, graph_module
will return a torch.fx.GraphModule
.
Multi-GPU Safety
To address a long standing source of overhead, single GPU systems will now operate without typical required device checks. This check can be re-added when multiple GPUs are available to the host process using torch_tensorrt.runtime.set_multi_device_safe_mode
# Enables Multi Device Safe Mode
torch_tensorrt.runtime.set_multi_device_safe_mode(True)
# Disables Multi Device Safe Mode [Default Behavior]
torch_tensorrt.runtime.set_multi_device_safe_mode(False)
# Enables Multi Device Safe Mode, then resets the safe mode to its prior setting
with torch_tensorrt.runtime.set_multi_device_safe_mode(True):
...
More information can be found here: https://pytorch.org/TensorRT/user_guide/runtime.html
Capability Validators
In the Dynamo frontend, tests can be written and associated with converters to dynamically enable or disable them based on conditions in the target graph.
For example, the convolution converter in dynamo only supports 1D, 2D, and 3D convolution. We can therefore create a lambda which given a convolution FX node can determine if the convolution is supported:
@dynamo_tensorrt_converter(
torch.ops.aten.convolution.default,
capability_validator=lambda conv_node: conv_node.args[7] in ([0], [0, 0], [0, 0, 0])
) # type: ignore[misc]
def aten_ops_convolution(
ctx: ConversionContext,
target: Target,
args: Tuple[Argument, ...],
kwargs: Dict[str, Argument],
name: str,
) -> Union[TRTTensor, Sequence[TRTTensor]]:
In such a case where the Node
is not supported, the node will be partitioned out and run in PyTorch.
All capability validators are run prior to partitioning, after the lowering phase.
More information on writing converters for the Dynamo frontend can be found here: https://pytorch.org/TensorRT/contributors/dynamo_converters.html
Breaking Changes
- Dynamo (torch.export) is now the default frontend for Torch-TensorRT. The TorchScript and FX frontends are now in maintenance mode. Therefore any
torch.nn.Module
s ortorch.fx.GraphModule
s provided totorch_tensorrt.compile
will by default be exported usingtorch.export
then compiled. This default can be overridden by setting their=[torchscript|fx]
kwarg. Any bugs reported will first be attempted to be resolved in the dynamo stack before attempting other frontends however pull requests for additional functionally in the TorchScript and FX frontends from the community will still be accepted.
What's Changed
- chore: Update Torch and Torch-TRT versions and docs on
main
by @gs-olive in #1784 - fix: Repair invalid schema arising from lowering pass by @gs-olive in #1786
- fix: Allow full model compilation with collection inputs (
input_signature
) by @gs-olive in #1656 - feat(//core/conversion): Add support for aten::size with dynamic shaped models for Torchscript backend. by @peri044 in #1647
- feat: add support for aten::baddbmm by @mfeliz-cruise in #1806
- [feat] Add dynamic conversion path to aten::mul evaluator by @mfeliz-cruise in #1710
- [fix] aten::stack with dynamic inputs by @mfeliz-cruise in #1804
- fix undefined attr issue by @bowang007 in #1783
- fix: Out-Of-Bounds bug in Unsqueeze by @gs-olive in #1820
- feat: Upgrade Docker build to use custom TRT + CUDNN by @gs-olive in #1805
- fix: include str ivalue type conversion by @bowang007 in #1785
- fix: dependency order of inserted long input casts by @mfeliz-cruise in #1833
- feat: Add ts converter support for aten::all.dim by @mfeliz-cruise in #1840
- fix: Error caused by invalid binding name in
TRTEngine.to_str()
method by @gs-olive in #1846 - fix: Implement
aten.mean.default
andaten.mean.dim
converters by @gs-olive in #1810 - feat: Add converter for aten::log2 by @mfeliz-cruise in #1866
- feat: Add support for aten::where with scalar other by @mfeliz-cruise in #1855
- feat: Add converter support for logical_and by @mfeliz-cruise in #1856
- feat: Refactor FX APIs under dynamo namespace for parity with TS APIs by @peri044 in #1807
- fix: Add version checking for
torch._dynamo
import in__init__
by @gs-olive in #1881 - fix: Improve Docker build robustness, add validation by @gs-olive in #1873
- fix: Improve input weight handling to
acc_ops
convolution layers in FX by @gs-olive in #1886 - fix: Upgrade
main
to TRT 8.6, CUDA 11.8, CuDNN 8.8, Torch Dev by @gs-olive in #1852 - feat: Wrap dynamic size handling in a compilation flag by @peri044 in #1851
- fix: Add torchvision legacy CI parameter by @gs-olive in #1918
- Sync fb internal change to OSS by @wushirong in #1892
- fix: Reorganize Dynamo directory + backends by @gs-olive in #1928
- fix: Improve partitioning + lowering systems in
torch.compile
path by @gs-olive in #1879 - fix: Upgrade TRT to 8.6.1, parallelize FX tests in CI by @gs-olive in #1930
- feat: Add issue template for Story by @gs-olive in #1936
- feat: support type promotion in aten::cat converter by @mfeliz-cruise in #1911
- Reorg for converters in (FX Converter Refactor [1/N]) by @narendasan in #1867
- fix: Add support for default dimension in
aten.cat
by @gs-olive in #1863 - Relaxing glob pattern for CUDA12 by @borisfom in #1950
- refactor: Centralizing sigmoid implementation (FX Converter Refactor [2/N]) <Target: converter_reorg_proto> by @narendasan in #1868
- fix: Address
.numpy()
issue on fake tensors by @gs-olive in #1949 - feat: Add support for passing through build issues in Dynamo compile by @gs-olive in #1952
- fix: int/int=float division by @mfeliz-cruise in #1957
- fix: Support dims < -1 in aten::stack converter by @mfeliz-cruise in #1947
- fix: Resolve issue in isInputDynamic with mixed static/dynamic shapes by @mfeliz-cruise in #1883
- DLFW changes by @apbose in #1878
- feat: Add converter for aten::isfinite by @mfeliz-cruise in #1841
- Reorg for converters in hardtanh(FX Converter Refactor [5/N]) <Target: converter_reorg_proto> by @apbose in #1901
- fix/feat: Add lowering pass to resolve most
aten::Int.Tensor
uses by @gs-olive in #1937 - fix: Add decomposition for
aten.addmm
by @gs-olive in #1953 - Reorg for converters tanh (FX Converter Refactor [4/N]) <Target: converter_reorg_proto> by @apbose in #1900
- Reorg for converters leaky_relu (FX Converter Refactor [6/N]) <Target: converter_reorg_proto> by @apbose in #1902
- Upstream 3 features to fx_ts_compat: MS, VC, Optimization Level by @wu6u3tw in #1935
- fix: Add lowering pass to remove output repacking in
convert_method_to_trt_engine
calls by @gs-olive in #1945 - Fixing aten::slice invalid schema and i...
Torch-TensorRT v1.4.0
PyTorch 2.0, CUDA 11.8, TensorRT 8.6, Support for the new torch.compile
API, compatibility mode for FX frontend
Torch-TensorRT 1.4.0 targets PyTorch 2.0, CUDA 11.8, TensorRT 8.5. This release introduces a number of beta features to set the stage for working with PyTorch and TensorRT in the 2.0 ecosystem. Primarily, this includes a new torch.compile
backend targeting Torch-TensorRT. It also adds a compatibility layer that allows users of the TorchScript frontend for Torch-TensorRT to seamlessly try FX and Dynamo.
torch.compile` Backend for Torch-TensorRT
One of the most prominent new features in PyTorch 2.0 is the torch.compile
workflow, which enables users to accelerate code easily by specifying a backend of their choice. Torch-TensorRT 1.4.0 introduces a new backend for torch.compile
as a beta feature, including a convenience frontend to perform accelerated inference. This frontend can be accessed in one of two ways:
import torch_tensorrt
torch_tensorrt.dynamo.compile(model, inputs, ...)
##### OR #####
torch_tensorrt.compile(model, ir="dynamo_compile", inputs=inputs, ...)
For more examples, see the provided sample scripts, which can be found here
This compilation method has a couple key considerations:
- It can handle models with data-dependent control flow
- It automatically falls back to Torch if the TRT Engine Build fails for any reason
- It uses the Torch FX
aten
library of converters to accelerate models - Recompilation can be caused by changing the batch size of the input, or providing an input which enters a new control flow branch
- Compiled models cannot be saved across Python sessions (yet)
The feature is currently in beta, and we expect updates, changes, and improvements to the above in the future.
fx_ts_compat
Frontend
As the ecosystem transitions from TorchScript to Dynamo, users of Torch-TensorRT may want start to experiment with this stack. As such we have introduced a new frontend for Torch-TensorRT which exposes the same APIs as the TorchScript frontend but will use the FX/Dynamo compiler stack. You can try this frontend by using the ir="fx_ts_compat"
setting
torch_tensorrt.compile(..., ir="fx_ts_compat")
What's Changed
- Fix build by @yinghai in #1479
- add circle CI signal in README page by @yinghai in #1481
- fix eisum signature by @yinghai in #1480
- Fix link to CircleCI in README.md by @yinghai in #1483
- Minor changes by @yinghai in #1482
- [FX] Changes done internally at Facebook by @frank-wei in #1456
- chore: upload docs for 1.3.0 by @narendasan in #1504
- fix: Repair Citrinet-1024 compilation issues by @gs-olive in #1488
- refactor: Split elementwise tests by @peri044 in #1507
- [feat] Support 1D topk by @mfeliz-cruise in #1491
- Support aten::sum with bool tensor input by @mfeliz-cruise in #1512
- [fix]Disambiguate cast layer names by @mfeliz-cruise in #1513
- feat: Add functionality for easily benchmarking fx code on key models by @gs-olive in #1506
- [feat]Canonicalize aten::multiply to aten::mul by @mfeliz-cruise in #1517
- broadcast the two input shapes for transposed matmul by @nvpohanh in #1457
- make padding layer converter more efficient by @nvpohanh in #1470
- fix: Change equals-check from reference to value for BERT model not compiling in FX by @gs-olive in #1539
- Update README dependencies section for v1.3.0 by @take-cheeze in #1540
- fix:
aten::where
with differing-shape inputs bugfix by @gs-olive in #1533 - fix: Automatically send truncated long ints to cuda at shape analysis time by @gs-olive in #1541
- feat: Add functionality to FX benchmarking + Improve documentation by @gs-olive in #1529
- [fix] Fix crash when calling unbind on evaluated tensor by @mfeliz-cruise in #1554
- Update test_flatten_aten and test_reshape_aten due to PT2.0 changed tracer behavior for these ops by @frank-wei in #1559
- fix: Bugfix for
align_corners=False
- FX interpolate by @gs-olive in #1561 - fix: Properly cast intermediate Int8 tensors to TensorRT Engines in Fallback by @gs-olive in #1549
- Upgrade stack to Pytorch 2.0 + CUDA 11.7 + TRT 8.5 GA by @peri044 in #1477
- feat: Add option to specify int64 as an Input dtype by @gs-olive in #1551
- feat: Support int inputs to aten::max/min and aten::argmax/argmin by @mfeliz-cruise in #1574
- fix: Add
aten::full_like
evaluator by @gs-olive in #1584 - tools: assign 1 person to a bug instead of all by @narendasan in #1604
- feat: Add support for aten::meshgrid by @mfeliz-cruise in #1601
- [FX] Changes done internally at Facebook by @frank-wei in #1603
- chore: Add FX core test by @peri044 in #1593
- chore: Update dockerfile by @peri044 in #1581
- fix: Replace
RemoveDropout
lowering pass implementation with modified JIT pass by @gs-olive in #1589 - [FX] Changes done internally at Facebook by @frank-wei in #1625
- chore: Update Dockerfile to Ubuntu 20.04 + Crash Resolution by @gs-olive in #1639
- fix: Bugfix in Linear-to-AddMM Fusion Lowering Pass by @gs-olive in #1619
- fix: Resolve compilation bug for empty tensors in
aten::select
by @gs-olive in #1623 - Convolution cast by @apbose in #1609
- fix: Bugfix in TRT Engine deserialization indexing by @gs-olive in #1646
- fix: fix the inappropriate lowering pass of aten::to by @bowang007 in #1649
- Lowering aten::pad to aten::constant_pad_nd/aten::reflection_padXd/aten::replication_padXd by @ruoqianguo in #1588
- [fix] Disambiguate element-wise cast layer names by @mfeliz-cruise in #1630
- feat: Add optional tensor domain argument to Input class by @gs-olive in #1537
- Improve batch_norm fp16 accuracy by @mfeliz-cruise in #1450
- add an example of aten2trt, fix batch norm pass by @frank-wei in #1685
- fix: Issue in non-Tensor Input Resolution by @gs-olive in #1617
- Corrected a typo, which was raising an error by @zshn25 in #1694
- Cherry-pick manylinux compatible builds into main by @narendasan in #1677
- fix: Improve input handling for
input_signature
by @gs-olive in #1698 - Unsqueeze operator with dynamic inout by @apbose in #1624
- [feat] Add converter support for index_select by @mfeliz-cruise in #1692
- [feat] Add converter support for aten::logical_not by @mfeliz-cruise in #1705
- fix: Bugfix in convNd_to_convolution lowering pass by @gs-olive in #1693
- [feat] Add converter for aten::any.dim by @mfeliz-cruise in #1707
- [fix] resolve issue for single non-batch index tensor in aten::index by @mfeliz-cruise in #1700
- fix: Handle nonetype pad value for Constant pad by @peri044 in #1712
- infra: Add Torch 1.13.1 testing to nightly CI by @gs-olive in #1731
- fix: Allow full model compilation with collection outputs by @gs-olive in #1599
- fix: fix the prim::Loop fallback issue by @bowang007 in #1691
- feat: Add decorator utility to improve error messaging for legacy support by @gs-olive in #1738
- minor fix: Update default minimum torch version for aten tracer by @gs-olive in #1747
- Get windows build working by @bharrisau in #1711
- Update config.yml by @frank-wei in #1736
- fix: Bugfix in shape analysis for multi-GPU systems by @gs-olive in #1765
- fix: Add schemas to convolution lowering pass by @gs-olive in #1728...