Skip to content

Releases: NVIDIA/TensorRT-LLM

v1.2.0rc0

30 Sep 07:55
560ded5
Compare
Choose a tag to compare
v1.2.0rc0 Pre-release
Pre-release

Announcement Highlights

  • Model Support
    • Support nano_v2_vlm in pytorch backend (#7207)
    • Add Tencent HunYuanDenseV1 model support (#7081)
    • Support Seed-OSS model in pytorch backend (#7496)
    • GPT-OSS MXFP4 support (#7451)
  • API
    • Support new structural tag API (upgrade XGrammar to 0.1.25) (#7893)
    • Enable regex and EBNF grammar in trtllm-serve (#7925)
    • Optionally disable server GC and worker GC (#7995)
    • Add serialization/deserialization options for AutoTuner profiling cache (#7738)
    • Cherry-pick from (#7598) Make low_precision_combine as a llm arg (#7898)
  • Benchmark
    • Add gpt-oss serve benchmark tests (#7638)
    • Exit as early as possible and propagate exit status correctly for multi-node testing (#7739)
    • Add gpt oss model for trtllm perf test (#7328)
    • Add generation logits case for llama3 (#7759)
    • Feature fix model issue for disagg serving (#7785)
    • Add deepseek r1/v3 model with chunked prefill cases (#7124)
    • Add accuracy benchmark in stress test (#7561)
    • Add NoSmem epilogue schedule and dynamic cluster shape for sm10x group gemm (#7757)
    • Rename llm_perf_full to llm_perf_core and add missing cases (#7899)
    • Update benchmark script (#7860)
    • Add multi-nodes test for disagg-serving (#7470)
    • Update llm_models_root to improve path handling on BareMetal environment (#7876)
    • Add DS-R1/Qwen3 test cases for RTX 6000 (#7662)
    • Add NIM perf test cases (#7924)
    • Fix the tactic sorting in TrtllmGenBatchedGemmRunner::getValidConfigIndices (#7419)
    • Improve the failure message for accuracy test suite (#7994)
    • Update get_sysinfo.py to avoid UnboundLocalError (#7982)
    • Update disagg gen-only benchmark. (#7917)
  • Feature
    • Phi4-mm image modality inference optimization (#7918)
    • Add NVFP4 x FP8 moe kernels (#7821)
    • Enable KV cache reuse and chunked prefill for mistral3.1 (#7628)
    • Enable two-model spec dec for MTP Eagle (#7001)
    • Support EPLB in Qwen3 MoE (#7443)
    • Eagle3 cuda graph support for the first draft model inference (#7363)
    • Enable run_post_quant_allgather for MoE TRTLLM backend (#6794)
    • Enable gpt oss on DGX H100. (#6775)
    • Add gpt-oss chunked prefill tests (#7779)
    • Eagle, use last hidden post norm (#7546)
    • Optimize Qwen2/2.5-VL performance (#7250)
    • Support kvcache reuse and chunk prefill for phi4mm (#7723)
    • Support attention dp for qwen3 dense model (#7618)
    • AutoDeploy Fix memory leak in fuse_moe (#7844)
    • Enable overlap scheduler for two-model spec decoding (#7651)
    • Add support of CUDA13 and sm103 devices (#7568)
    • Add Cute DSL nvfp4 linear op (#7632)
    • Enable LM tp for MTP, under attention dp case (cherry-pick #7128) (#7571)
    • Add an example of KV cache host offloading (#7767)
    • Helix: make softmax stats pointer available to attention gen (#6865)
    • AutoDeploy: graph-less transformers mode for HF (#7635)
    • Cherry-pick DeepGEMM related commits from release/1.1.0rc2 (#7716)
    • Add swapab, tileN64, cga sync support for cute dsl nvfp4 gemm (#7764)
    • FP8 Context MLA integration (Cherry-pick #6059 from release/1.1.0rc2) (#7610)
    • Update CUTLASS to 4.2 and enable SM103 group gemm (#7832)
    • Cherry-pick fix to reuse pytorch memory segments occupied by cudagraph (#7747)
    • Helix: add custom position ids to MLA kernels (#6904)
    • Support for partial sharding from factory (#7393)
    • KV cache transmission in disagg with CP on gen side (#7624)
    • Cherry-pick from #7423 Support fp8 block wide ep cherry pick (#7712)
    • E-PD Disagg Support via llmapi (3/N) (#7577)
    • Add batch waiting when scheduling (#7416)
    • Use list instead of torch tensor for new tokens in update requests (#7730)
    • Support multi-threaded tokenizers for trtllm-serve (cherry-pick) (#7776)
    • Support JIT mha.cu for SPEC_DEC in runtime (#6078)
    • Batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) (#7294)
    • Enable prompt_logprobs in pytorch backend (#7580)
    • Support SWA KV cache reuse (#6768)
    • Return topk logprobs in torch backend (#7756)
    • CapturedGraph to support max_batch_size > max(cuda_graph_batch_sizes) (#7888)
    • Revert " Return topk logprobs in torch backend (#7756)" (#7969)
    • DeepEP LL fp8 dispatch/combine (#7927)
    • Helix: add alltoall op (#6815)
    • Optimize kv cache transfer TEP (#7613)
    • Add environment variable to adjust block pool allocation ration under kv cache manager (#7923)
    • Add a standalone buffer cache class and reuse buffers between cduagraph and no-graph flow (#7669)
    • Add static tree sampling and verification (#7161)
    • Add support for KVCache transfer from KVCache reuse path (#6348)
    • Added AutoDeploy backend support to test_perf.py (#7588)
    • Speed up concat k and copy k_nope in context phase using torch.compile (#8044)
  • Documentation
    • Fix the link in the doc (#7713)
    • Clean the doc folder and move the outdated docs into lega… (#7729)
    • Add doc for KV cache salting support (#7772)
    • Fix section header of llm_kv_cache_offloading example (#7795)
    • Update Documentation link to point to docs instead of docs source code (#6495)
    • Cherry-pick deployment guide update from 1.1.0rc2 branch to main branch (#7774)
    • Tech blog: Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly (#7864)
    • Update tech blog12 (#7884)
    • Add known issues to llmapi doc (#7560)
    • Add blackwell information into support matrix (#6740)
    • Fix a invalid link and a typo. (#7634)
    • Use hash id for external link (#7641)
    • Add labels description note into llm api section (#7696)
    • Enhance api reference doc by labeling stable APIs (#7751)
    • Add 1.0 release notes (#7605)
    • Scaffolding tech blog part one (#7835)
    • Update docker cmd in quick start guide and trtllm-serve … (#7787)
    • Replace the main in the examples' link with commit id. (#7837)
    • Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850)
    • Add a guide for modifying APIs (#7866)
    • Update Perf-Overview.md for release/1.0 (#7848)
    • Add stable label to all the un-labelled arguments in LLM class (#7863)
    • Fix invalid links in perf benchmarking. (#7933)
    • Add Llama PP known issue to release note (#7959)
    • Add acknowledgements in scaffolding tech blog (#7983)
    • Add scaffolding tech blog to cover (#8021)
    • Refine perf overview.md and correct the error link in per… (#8035)
    • Scaffolding tech blog fix a typo (#8042)
    • Document hang issue caused by UnpicklingError (#8049)

What's Changed

  • [None][feat] Eagle, use last hidden post norm by @IzzyPutterman in #7546
  • [None][infra] AutoDeploy: codeowners for autodeploy unit tests by @lucaslie in #7743
  • [TRTLLM-6668][feat] Enable overlap scheduler for two-model spec decoding by @ziyixiong-nv in #7651
  • [None][ci] move qwen3 tests from GB200 to B200 by @QiJune in #7733
  • [None][feat] support attention dp for qwen3 dense model by @Nekofish-L in #7618
  • [None][doc] Fix the link in the doc by @Shixiaowei02 in #7713
  • [TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices by @VALLIS-NERIA in #7568
  • [TRTLLM-6295][test] Exit as early as possible and propagate exit status correctly for multi-node testing by @chzblych in #7739
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7735
  • [None][fix] Ensure that the W4A8 custom input scale remains aligned across all ranks by @yilin-void in #7614
  • [None][chore] Fix error when running trtllm-bench without cuda graph. by @bobboli in #7725
  • [None][doc] Clean the doc folder and move the outdated docs into lega… by @nv-guomingz in #7729
  • [TRTLLM-6898][feat] Add Cute DSL nvfp4 linear op by @limin2021 in #7632
  • [None] [chore] cherry pick changes on slurm scripts from release/1.1.0rc2 by @kaiyux in #7750
  • [https://nvbugs/5503529][fix] Change test_llmapi_example_multilora to get adapters path from cmd line to avoid downloading from HF by @amitz-nv in #7740
  • [TRTLLM-7070][feat] add gpt-oss serve benchmark tests by @xinhe-nv in #7638
  • [None][fix] waive hang tests on main by @xinhe-nv in #7720
  • [https://nvbugs/5471106][fix] Remove the waivers by @ziyixiong-nv in #7711
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7746
  • Revert "[None][feat] support attention dp for qwen3 dense model" by @byshiue in #7765
  • [TRTLLM-8044][refactor] Rename data -> cache for cacheTransceiver by @Tabrizian in #7659
  • [None][chore] AutoDeploy: neat disablement of transforms in pipeline by @lucaslie in #7736
  • [None][chore] Remove unused get_quant_scales methods by @achartier in #7687
  • [None][infra] add nspect allow list for false positive secrets by @yuanjingx87 in #5797
  • [TRTLLM-7398][doc] Add doc for KV cache salting support by @chang-l in #7772
  • [None][infra] Update CI allowlist 2025-09-16 ...
Read more

v1.0.0

24 Sep 12:53
ae8270b
Compare
Choose a tag to compare

TensorRT LLM Release 1.0

TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now stable and the default experience, and the LLM API is now stable. For more details on new developments in 1.0, please see below.

Key Features and Enhancements

  • Model Support

    • Add Mistral3.1 VLM model support
    • Add TensorRT-Engine Qwen3 (dense) model support
    • Add phi-4-multimodal model support
    • Add EXAONE 4.0 model support
    • Add Qwen3 MoE support to TensorRT backend
  • Features

    • Add support for sm121
    • Add LoRA support for Gemma3
    • Support PyTorch LoRA adapter eviction
    • Add LoRA support for PyTorch backend in trtllm-serve
    • Add support of scheduling attention dp request
    • Remove padding of FusedMoE in attention DP
    • Support torch compile for attention dp
    • Add KV events support for sliding window attention
    • Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE
    • Add Piecewise CUDA Graph support for MLA
    • Support mutliCtasKvMode for high-throughput MLA kernels
    • Enable kvcache to be reused during request generation
    • Add ADP schedule balance optimization
    • Add chunked prefill support for MLA (Blackwell)
    • Enable Multi-block mode for Hopper spec dec XQA kernel
    • Add vLLM KV Pool support for XQA kernel
    • Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5
    • Add support for fused gate_up_proj scales for FP8 blockwise
    • Support FP8 row-wise dense GEMM in torch flow
    • Enable fp8 SwiGLU to minimize host overhead
    • Add Deepseek R1 FP8 Support on Blackwell
    • Add support for MXFP8xMXFP4 in pytorch
    • Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
    • Opensource MOE MXFP8-MXFP4 implementation
    • Add support for Modelopt fp8_pb_wo quantization scheme
    • Support deepEP fp4 post quant all2all dispatch
    • Fuse w4a8 moe pre-quant scale on Hopper
    • Support Weight-Only-Quantization in PyTorch Workflow
    • Add support for per expert activation scaling factors
    • Add ReDrafter support for Qwen
    • Enable CUDA Graph for Nemotron-H
    • Add support for YARN in NemotronNAS models
    • Switch to internal version of MMProjector in Gemma3
    • Disable add special tokens for Llama3.3 70B
    • Auto-enable ngram with concurrency <= 32
    • Support turning on/off spec decoding dynamically
    • Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21
    • Add support for external multimodal embeddings
    • Add support for disaggregation with pp with pytorch backend
    • Add status tags to LLM API reference
    • Support JSON Schema in OpenAI-Compatible API
    • Support chunked prefill on spec decode 2 model
    • Add KV cache reuse support for multimodal models
    • Support nanobind bindings
    • Add support for two-model engine KV cache reuse
    • Add Eagle-3 support for qwen3 dense model
    • Migrate Eagle-3 and draft/target speculation to Drafter
    • Enable guided decoding with overlap scheduler
    • Support n-gram speculative decoding with disagg
    • Add beam search support to the PyTorch Workflow
    • Add LLGuidance Support for PyTorch Backend
    • Add NGrams V2 support
    • Add MTP support for Online EPLB
    • Support disaggregated serving in TRTLLM Sampler
    • Add core infrastructure to enable loading of custom checkpoint formats
    • Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs
    • Use huge page mapping for host accessible memory on GB200
    • Add user-provided speculative decoding support
    • Add streaming scaffolding_llm.generate_async support
    • Detokenize option in /v1/completions request
    • Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner
    • Remove support for llmapi + TRT backend in Triton
    • Add request_perf_metrics to triton LLMAPI backend
    • Add support for Triton request cancellation
  • Benchmark:

    • Add support for benchmarking individual gemms in MOE benchmark (#6080)
    • Add speculative metrics for trtllm-bench
    • Add the ability to write a request timeline for trtllm-bench
    • Add no_kv_cache_reuse option and streaming support for trtllm-serve bench
    • Add latency support for trtllm-bench
    • Add Acceptance Rate calculation to benchmark_serving
    • Add wide-ep benchmarking scripts
    • Update trtllm-bench to support new Pytorch default
    • Add support for TRTLLM CustomDataset
    • Make benchmark_serving part of the library
  • Documentation:

    • Refactored the doc structure to focus on the PyTorch workflow.
    • Improved the LLMAPI and API reference documentation. Stable APIs are now protected and will remain consistent in subsequent versions following v1.0.
    • Removed legacy documentation related to the TensorRT workflow.

Infrastructure Changes

  • The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.06-py3.
  • The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:25.06-py3.
  • The dependent NVIDIA ModelOpt version is updated to 0.33.
  • The dependent xgrammar version is updated to 0.1.21.
  • The dependent transformers version is updated to 4.53.1.

API Changes

  • BREAKING CHANGE Promote PyTorch to be the default LLM backend
  • BREAKING CHANGE Change default backend to PyTorch in trtllm-serve
  • BREAKING CHANGE Unify KvCacheConfig in LLM class for pytorch backend
  • BREAKING CHANGE Rename cuda_graph_config padding_enabled field
  • BREAKING CHANGE Rename mixed_sampler to enable_mixed_sampler
  • BREAKING CHANGE Rename LLM.autotuner_enabled to enable_autotuner
  • Add back allreduce_strategy parameter into TorchLlmArgs
  • Add LLmArgs option to force using dynamic quantization
  • Change default LoRA cache sizes and change peft_cache_config cache size fields to take effect when not explicitly set in lora_config
  • Remove deprecated LoRA LLM args, that are already specified in lora_config
  • Add request_perf_metrics to LLMAPI
  • Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead
  • Remove TrtGptModelOptionalParams
  • Remove ptuning knobs from TorchLlmArgs

Fixed Issues

  • Fix illegal memory access in MLA (#6437)
  • Fix nemotronNAS loading for TP>1 (#6447)
  • Fix wide EP when using DeepEP with online EPLB (#6429)
  • Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
  • Fix PD + MTP + overlap scheduler accuracy issue (#6136)
  • Fix bug of Qwen3 when using fp4 on sm120 (#6065)
  • Fix TMA error with GEMM+AR on TP=2 (#6075)
  • Fix scaffolding aime test in test_e2e (#6140)
  • Fix KV Cache overrides in trtllm-bench (#6103)
  • Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135)
  • Fix eagle3 two model disaggregated serving test (#6014)
  • Fix chunked prefill + overlap scheduling (#5761)
  • Fix mgmn postprocess error (#5835)
  • Fallback to cubins for fp8 fmha kernels on Ada (#5779)
  • Fix disagg + speculative decoding (#5558)
  • Fix test_generate_with_seed CI failure. (#5772)
  • Fix prompt adapter TP2 case (#5782)
  • Fix disaggregate serving with attention DP (#4993)
  • Fix a quote error introduced in #5534 (#5816)
  • Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801)
  • Fix lost requests for disaggregated serving (#5815)
  • Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test (#5855)
  • Fix GEMM+AR fusion on blackwell (#5563)
  • Fix llama4 multimodal support (#5809)
  • Fix Llama4 Scout FP4 crash issue (#5925)
  • Fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371)
  • Fix moe regression for sm120 (#5823)
  • Fix Qwen2.5VL FP8 support (#5029)
  • Fix the illegal memory access issue in moe gemm on SM120 (#5636)
  • Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
  • Fix incremental detokenization (#5825)
  • Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900)
  • Fix mistral unit tests due to transformers upgrade (#5904)
  • Fix the Llama3.1 405B hanging issue. (#5698) (#5925)
  • Fix Gemma3 unit tests due to transformers upgrade (#5921)
  • Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
  • Remove SpecConfig and fix thread leak issues (#5931)
  • Fast redux detection in trtllm gen routing kernel (#5941)
  • Fix cancel request logic (#5800)
  • Fix errors in wide-ep scripts (#5992)
  • Fix error in post-merge-tests (#5949)
  • Fix missing arg to alltoall_prepare_maybe_dispatch (#5669)
  • Fix attention DP doesn't work with embedding TP (#5642)
  • Fix broken cyclic reference detect (#5417)
  • Fix permission for local user issues in NGC docker container. (#5373)
  • Fix mtp vanilla draft inputs (#5568)
  • Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519)
  • Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
  • Fix the issue MoE autotune fallback failed to query default heuristic (#5520)
  • Fix the unexpected keyword argument 'streaming' (#5436)

Known Issues

  • When using disaggregated serving with pipeline parallelism and KV cache reuse, a hang can occur. This will be fixed in a future release. In the meantime, disabling KV cache reuse will fix this issue.
  • Running multi-node cases where each node has just a single GPU is known to fail. This will be addressed in a future release.
  • For the Llama 3.x and Llama 4 models, there is an issue with pipeline parallelism when using FP8 and NVFP4 weights. As a workaround, you can set the environment variable export TRTLLM_LLAMA_EAGER_FUSION_DISABLED=1.

What's Changed

Read more

v1.1.0rc5

18 Sep 01:49
0c9430e
Compare
Choose a tag to compare
v1.1.0rc5 Pre-release
Pre-release

Announcement Highlights

  • Model Support
    • Enable NvFP4/FP8 quantization for Nemotron-H architecture (#7589)
    • Enable KV-cache reuse and add E2E tests for llava-next (#7349)
    • Support gpt-oss with fp8 kv cache (#7612)
    • Support kvcache reuse for phi4mm (#7563)
  • API
    • Add TorchLlmArgs to the connector api (#7493)
  • Benchmark
    • Extend test_perf.py to add disagg-serving perf tests (#7503)
    • Add accuracy test for deepseek-r1 with chunked_prefill (#7365)
  • Feature
    • Optimize MLA kernels with separate reduction kernels (#7597)
    • Wrap MOE with custom op (#7277)
    • Make the should_use_spec_decode logic a bit smarter (#7112)
    • Use a shell context to install dependancies (#7383)
    • Topk logprobs for TRT backend and top1 logprob for PyT backend (#6097)
    • Support chunked prefill for multimodal models (#6843)
    • Optimize MLA chunked prefill && support fp8 mla chunked prefill (#7477)
    • Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues (#7616)
    • Add deepseek r1-w4afp8 quickstart (#7645)
    • Nanobind: Allow none types for fields in result (#7672)
    • Using arrival time in llmapi when creating LlmRequest in pytorch workflow (#7553)
    • UCX zmq ip support ipv6 (#7530)
    • Refactor: Quantization Transforms with Inheritance (#7227)

What's Changed

  • [None][chore] Remove closed bugs by @xinhe-nv in #7591
  • [https://nvbugs/5454559][fix] handle bias term in fuse_gate_mlp by @Linda-Stadter in #7449
  • [None][fix] enable NvFP4/FP8 quantization for Nemotron-H architecture by @tomeras91 in #7589
  • [None][feat] Optimize MLA kernels with separate reduction kernels by @PerkzZheng in #7597
  • [https://nvbugs/5445466][fix] unwaive DS R1 test cases with bug already fixed by @lancelly in #7429
  • [#6798][fix] fix compilation error in ub_allocator in single device build by @WilliamTambellini in #6874
  • [https://nvbugs/5434424][fix] A quick fix for the wrong output issue of SM89 blocked scaling batched GEMM when the input tensor is non-contiguous. by @StudyingShao in #7615
  • [None][chore] add TorchLlmArgs to the connector api by @richardhuo-nv in #7493
  • [TRTLLM-6707][fix] nanobind fix for executor exit call by @Linda-Stadter in #7565
  • [None][ci] add DGX_H100-2_GPUs-PyTorch-Others-1 pipeline by @QiJune in #7629
  • [TRTLLM-7408][feat] Wrap MOE with custom op. by @liji-nv in #7277
  • [TRTLLM-5059][feat] Enable KV-cache reuse and add E2E tests for llava-next by @chang-l in #7349
  • [None][fix] fix post-merge issue raised by #5488 by @nv-guomingz in #7655
  • [https://nvbugs/5410687][test] Add deepseek r1-w4afp8 quickstart by @fredricz-20070104 in #7645
  • [None][fix]UCX zmq ip support ipv6 by @chuangz0 in #7530
  • [None][feat] Make the should_use_spec_decode logic a bit smarter by @zheyuf in #7112
  • [#5861][autodeploy] Refactor: Quantization Transforms with Inheritance by @Fridah-nv in #7227
  • [#7208][fix] Fix config type of MedusaConfig by @karljang in #7320
  • [None][infra] Bump version to 1.1.0rc5 by @yiqingy0 in #7668
  • [TRTLLM-7871][infra] Extend test_perf.py to add disagg-serving perf tests. by @bo-nv in #7503
  • [https://nvbugs/5494698][fix] skip gemma3 27b on blackwell by @xinhe-nv in #7505
  • [https://nvbugs/5477359][fix] Nanobind: Allow none types for fields in result by @Linda-Stadter in #7672
  • [None][chore] remove executor config in kv cache creator by @leslie-fang25 in #7526
  • [https://nvbugs/5488212][waive] Waive failed tests for L20 by @nvamyt in #7664
  • [None][feat] Use a shell context to install dependancies by @v-shobhit in #7383
  • [https://nvbugs/5505402] [fix] Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues by @DomBrown in #7616
  • [None][infra] Waive failed cases on main 0910 by @EmmaQiaoCh in #7676
  • [None][infra] Adjust labeling llm prompt for bug issues by @karljang in #7385
  • [None][ci] move some test cases from l40s to a30 by @QiJune in #7684
  • [None][fix] Fix the incorrect header file import in dataType.h by @Fan-Yunfan in #7133
  • [https://nvbugs/5498165][fix] fix permission error for config file lock by @chang-l in #7656
  • [https://nvbugs/5513192][fix] Add the missing param for kv_cache_tran… by @nv-guomingz in #7679
  • [TRTLLM-1302][feat] Topk logprobs for TRT backend and top1 logprob for PyT backend by @LinPoly in #6097
  • [TRTLLM-7169][infra] Fix Slurm multi-node test showing "Submit Test Results" in the test name by @ZhanruiSunCh in #6856
  • [TRTLLM-6791][infra] Add check for uploading stage name and avoid overriding test result tar file by @ZhanruiSunCh in #6742
  • [None][ci] Some improvements for Slurm CI by @chzblych in #7689
  • [None][ci] Test waives for the main branch 09/14 by @chzblych in #7698
  • [None][feat] support gpt-oss with fp8 kv cache by @PerkzZheng in #7612
  • [TRTLLM-6903][feat] Support chunked prefill for multimodal models by @chang-l in #6843
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7682
  • [None][chore] Enable multiple postprocess workers tests for chat completions api by @JunyiXu-nv in #7602
  • [TRTLLM-7279][test] add accuracy test for deepseek-r1 with chunked_prefill by @crazydemo in #7365
  • [https://nvbugs/5467981][fix] Fix Qwen2.5-VL fails with cuda graph padding by @DylanChen-NV in #7122
  • [None][chore] move some cases from post-merge to pre-merge to detect errors in early stage by @HuiGao-NV in #7699
  • [TRTLLM-7918][feat] Support kvcache reuse for phi4mm by @Wanli-Jiang in #7563
  • [None][test] add test for min_tokens by @ixlmar in #7678
  • [TRTLLM-7918][feat] Revert "Support kvcache reuse for phi4mm (#7563)" by @Wanli-Jiang in #7722
  • [None][fix] using arrival time in llmapi when creating LlmRequest in pytorch workflow by @zhengd-nv in #7553
  • [TRTLLM-7192][feat] optimize MLA chunked prefill && support fp8 mla chunked prefill by @jmydurant in #7477
  • [None][ci] Test waives for the main branch 09/15 by @chzblych in #7709

New Contributors

Full Changelog: v1.1.0rc4...v1.1.0rc5

v1.1.0rc4

10 Sep 07:32
62b564a
Compare
Choose a tag to compare
v1.1.0rc4 Pre-release
Pre-release

Announcement Highlights:

  • Model Support
    • Support phi-4 model in pytorch backend (#7371)
    • Support Aggregate mode for phi4-mm (#7521)
  • API
    • Implement basic functionalities for Responses API (#7341)
    • Support multiple postprocess workers for chat completions API (#7508)
    • Report failing requests (#7060)
  • Benchmark
    • Test trtllm-serve with --extra_llm_api_options (#7492)
  • Feature
    • Add MOE support for dynamic cluster shapes and custom epilogue schedules (#6126)
    • Autotune TRT-LLM Gen MoE when using CUDA graphs (#7285)
    • Enable guided decoding with speculative decoding (part 2: one-model engine) (#6948)
    • Separate run_shape_prop as another graph utility (#7313)
    • MultiLayer Eagle (#7234)
    • Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec (#7481)
    • Add NVFP4 x FP8 (#6809)
    • Support hashing and KV cache reuse for videos (#7360)
    • Add MCTS and TOT tree-based inference controllers to Scaffolding (#7490)
    • Introduce QKNormRoPEAttention module (#6830)
    • AutoDeploy: flexible args for sequence interface + AD multi-modal input processor + llama4 VLM example (#7221)
    • Support KV cache salting for secure KV cache reuse (#7106)
    • trtllm-gen kernels support sm103 (#7570)
    • Move stop_criteria to sample_async (#7041)
    • KV cache transfer for uneven pp (#7117)
    • Update multimodal utility get_num_tokens_per_image for better generalization (#7544)
    • AutoDeploy: set torch recompile_limit based on cuda_graph_batch_sizes and refactored (#7219)
    • Add Request specific exception (#6931)
    • Add DeepSeek-v3-0324 e2e torch test (#7413)
    • Add 8-GPU test cases for RTX6000 (#7083)
    • add gptoss 20g tests (#7361)
    • Nixl support for GDS (#5488)
    • CMake option to link statically with cublas/curand (#7178)
    • Extend VLM factory and add Mistral3 factory (#7583)
  • Documentation
    • fix example in docstring (#7410)
    • Fix formatting error in Gemma3 readme (#7352)
    • Add note about trtllm-serve to the devel container (#7483)
    • add GPT OSS Eagle3 blog (#7140)
    • 1.0 Documentation. (#6696)
    • Update kvcache part (#7549)
    • Rename TensorRT-LLM to TensorRT LLM. (#7554)
    • refine docs for accuracy evaluation of gpt-oss models (#7252)

What's Changed

  • [https://nvbugs/5485430][fix] Copy the nanobind file when using precompiled package by @jiaganc in #7334
  • [None][infra] Using local variables in rerun function by @yiqingy0 in #7198
  • [None][ci] Correct docker args for GPU devices and remove some stale CI codes by @chzblych in #7417
  • [https://nvbugs/5476580][fix] unwaive test_nvfp4_4gpus by @Superjomn in #7454
  • [None][test] auto reuse torch empty cache on qa test by @crazydemo in #7421
  • [None][doc] fix example in docstring by @tomeras91 in #7410
  • [TRTLLM-6643][feat] Add DeepSeek-v3-0324 e2e torch test by @aalanwyr in #7413
  • [None][infra] waive test case failed on post-merge by @HuiGao-NV in #7471
  • [TRTLLM-7208][feat] Implement basic functionalities for Responses API by @JunyiXu-nv in #7341
  • [https://nvbugs/5453992][unwaive] Unwaive llama quickstart test by @peaceh-nv in #7242
  • [None][infra] Waive failed tests on main branch 0902 by @EmmaQiaoCh in #7482
  • [None][chore] Fix formatting error in Gemma3 readme by @karljang in #7352
  • [https://nvbugs/5470782][fix] Add specific test names for test_deepseek.py by @SimengLiu-nv in #7318
  • [https://nvbugs/5458798][fix] Disabled test_trtllm_bench_backend_comparison due to timeout by @MrGeva in #7397
  • [None][chore] Add note about trtllm-serve to the devel container by @MartinMarciniszyn in #7483
  • [None][chore] rm executor config in kv cache connector by @leslie-fang25 in #7372
  • [None][perf] Add MOE support for dynamic cluster shapes and custom epilogue … by @djns99 in #6126
  • [None][perf] Autotune TRT-LLM Gen MoE when using CUDA graphs by @jinyangyuan-nvidia in #7285
  • [TRTLLM-7261][feat] Support phi-4 model in pytorch backend by @Wanli-Jiang in #7371
  • [https://nvbugs/5480289][fix] release slot manager in mtp MTPHiddenStatesManager by @yweng0828 in #7340
  • [https://nvbugs/5488141][fix] Unwaive llama3 test_eagle3 by @mikeiovine in #7486
  • [https://nvbugs/5472947][fix] wait on isend handles before reusing buffers by @amukkara in #7462
  • [TRTLLM-7363][test] Add 8-GPU test cases for RTX6000 by @StanleySun639 in #7083
  • [https://nvbugs/5485593][fix] improve accuracy/test_disaggregated_serving.py by @reasonsolo in #7366
  • [None][doc] add GPT OSS Eagle3 blog by @IzzyPutterman in #7140
  • [None][fix] Fix KV cache recompute in draft_target spec decode by @mikeiovine in #7348
  • [TRTLLM-7028][feat] Enable guided decoding with speculative decoding (part 2: one-model engine) by @syuoni in #6948
  • [None][chore] Remove two unused parameters in create_py_executor by @leslie-fang25 in #7458
  • [#7222][autodeploy] Separate run_shape_prop as another graph utility by @Fridah-nv in #7313
  • [None][fix] Fix a numerical stability issue for XQA with spec dec by @lowsfer in #7114
  • [https://nvbugs/5470769][fix] fix disagg-serving accuracy test case by @reasonsolo in #7338
  • [TRTLLM-7876][test] Test trtllm-serve with --extra_llm_api_options by @StanleySun639 in #7492
  • [https://nvbugs/5485102][fix] Correctly set stride for piecewise outp… by @liji-nv in #7442
  • [TRTLLM-7442][model] Remove unnecessary D2H copies by @2ez4bz in #7273
  • [TRTLLM-6199][infra] Update for using open driver from BSL by @EmmaQiaoCh in #7430
  • [None][fix] Fix a typo in the Slurm CI codes by @chzblych in #7485
  • [TRTLLM-6342][fix] Fixed triggering BMM sharding by @greg-kwasniewski1 in #7389
  • [None][fix] fix hunyuan_moe init bug by @sorenwu in #7502
  • [None][chore] Bump version to 1.1.0rc4 by @yiqingy0 in #7525
  • [https://nvbugs/5485886][fix] Fix resource free of Eagle3ResourceManager by @kris1025 in #7437
  • [TRTLLM-6893][infra] Disable the x86 / SBSA build stage when run BuildDockerImage by @ZhanruiSunCh in #6729
  • [https://nvbugs/5477730][fix] Fix the alltoall case when tp_size larger than ep_size by @WeiHaocheng in #7331
  • [TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #7521
  • [None][ci] set TORCHINDUCTOR_COMPILE_THREADS for thop/parallel tests by @QiJune in #7489
  • [None][test] update nim and full test list by @crazydemo in #7468
  • [None][feat] MultiLayer Eagle by @IzzyPutterman in #7234
  • [TRTLLM-7027][feat] Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec by @syuoni in #7481
  • [OMNIML-2336][feat] Add NVFP4 x FP8 by @sychen52 in #6809
  • [https://nvbugs/5492485][fix] Use offline dataset from llm-models instead. by @yuxianq in #7435
  • [TRTLLM-7410][feat] Support hashing and KV cache reuse for videos by @chang-l in #7360
  • [https://nvbugs/5369366] [fix] Report failing requests by @arekay in #7060
  • [None][feat] Add Request specific exception by @Shunkangz in #6931
  • [#3325][feat] Add MCTS and TOT tree-based inference controllers to Scaffolding by @therealnaveenkamal in #7490
  • [https://nvbugs/5483615][fix] Remove unnecessary assertion to let mai… by @liji-nv in #7441
  • [None][ci] remove unnecessary test_modeling_deepseek.py by @QiJune in #7542
  • [None][chore] Remove closed bugs by @xinhe-nv in #7408
  • [TRTLLM-6642][feat] add gptoss 20g tests by @xinhe-nv in #7361
  • [None][ci] Increase the number of retries in docker image generation by @chzblych in #7557
  • [None][infra] update nspect version by @niukuo in #7552
    *...
Read more

v1.1.0rc2.post2

15 Sep 05:11
ef0d06d
Compare
Choose a tag to compare
v1.1.0rc2.post2 Pre-release
Pre-release

Announcement Highlights

  • Feature
    • Add MNNVL AlltoAll tests to pre-merge (#7465)
    • Support multi-threaded tokenizers for trtllm-serve (#7515)
    • FP8 Context MLA integration (#7581)
    • Support block wise FP8 in wide ep (#7423)
    • Cherry-pick Responses API and multiple postprocess workers support for chat harmony (#7600)
    • Make low_precision_combine as a llm arg (#7598)
  • Documentation
    • Update deployment guide and cherry-pick CI test fix from main (#7623)

What's Changed

  • [None] [test] Add MNNVL AlltoAll tests to pre-merge by @kaiyux in #7465
  • [TRTLLM-7292][feat] Support multi-threaded tokenizers for trtllm-serve by @nv-yilinf in #7515
  • [None][fix] trtllm-serve yaml loading by @Superjomn in #7551
  • [None][chore] Bump version to 1.1.0rc2.post2 by @yiqingy0 in #7582
  • [https://nvbugs/5498967][fix] Downgrade NCCL by @yizhang-nv in #7556
  • [TRTLLM-6994][feat] FP8 Context MLA integration. by @yuxianq in #7581
  • [TRTLLM-7831][feat] Support block wise FP8 in wide ep by @xxi-nv in #7423
  • [None][chore] Make use_low_precision_moe_combine as a llm arg by @zongfeijing in #7598
  • [None][fix] Update deployment guide and cherry-pick CI test fix from main by @dongfengy in #7623
  • [None][feat] Cherry-pick Responses API and multiple postprocess workers support for chat harmony by @JunyiXu-nv in #7600
  • [None][chore] Fix kernel launch param and add TRTLLM MoE backend test by @pengbowang-nv in #7524

New Contributors

Full Changelog: v1.1.0rc2.post1...v1.1.0rc2.post2

v1.1.0rc2.post1

06 Sep 00:06
9d6e87a
Compare
Choose a tag to compare
v1.1.0rc2.post1 Pre-release
Pre-release

Announcement Highlights:

  • API
    • Update TargetInfo to accommodate CP in disagg (#7224)
  • Benchmark
    • Minor fixes to slurm and benchmark scripts (#7453)
  • Feature
    • Support DeepGEMM swap-AB on sm100 (#7355)
    • Merge add sparse exp and shared exp into local re… (#7422)
    • Add batch waiting when scheduling (#7287)
    • Reuse pytorch memory segments occupied by cudagraph pool (#7457)
    • Complete the last missing allreduce op in Llama3/4 (#7420)
  • Documentation
    • Exposing the ADP balance strategy tech blog (#7380)
    • Update Dynasor paper info (#7137)
    • store blog 10 media via lfs (#7375)

What's Changed

  • [None][doc] Exposing the ADP balance strategy tech blog by @juney-nvidia in #7380
  • [None][feat] Update TargetInfo to accommodate CP in disagg by @brb-nv in #7224
  • [None][docs] Update Dynasor paper info by @AndyDai-nv in #7137
  • [None] [fix] store blog 10 media via lfs by @Funatiq in #7375
  • [TRTLLM-7250][fix] Add failed cases into waives.txt by @xinhe-nv in #7342
  • [None][chore] bump version to 1.1.0rc2.post1 by @litaotju in #7396
  • [TRTLLM-6747][feat] Merge add sparse exp and shared exp into local re… by @zongfeijing in #7422
  • [None] [fix] Fix nsys in slurm scripts by @kaiyux in #7409
  • [None][feat] Support DeepGEMM swap-AB on sm100 by @Barry-Delaney in #7355
  • [None] [fix] Minor fixes to slurm and benchmark scripts by @kaiyux in #7453
  • [None][fix] Fix possible mpi broadcast and gather issue on large object by @dongxuy04 in #7507
  • [TRTLLM-7008][fix] Add automatic shared memory delete if already exist by @dongxuy04 in #7377
  • [None][ci] Cherry-pick some improvements for Slurm CI setup from main branch by @chzblych in #7479
  • [https://nvbugs/5481434][feat] Reuse pytorch memory segments occupied by cudagraph pool by @HuiGao-NV in #7457
  • [None][fix] Update DG side branch name by @Barry-Delaney in #7491
  • [None][fix] Update DG commit by @Barry-Delaney in #7534
  • [None][fix] Fix a typo in the Slurm CI codes (#7485) by @chzblych in #7538
  • [https://nvbugs/5488582][fix] Avoid unexpected Triton recompilation in DG fused_moe. by @hyukn in #7495
  • [None][fix] Cherry-pick 6850: Complete the last missing allreduce op in Llama3/4. by @hyukn in #7420
  • [None][opt] Add batch waiting when scheduling by @yunruis in #7287
  • [https://nvbugs/5485325][fix] Add a postprocess to the model engine to fix the CUDA graph warmup issue when using speculative decoding by @lfr-0531 in #7373
  • [None][fix] Cherry-Pick MNNVLAllreduce Fixes into release/1.1.0rc2 branch by @timlee0212 in #7487

New Contributors

Full Changelog: v1.1.0rc2...v1.1.0rc2.post1

v1.1.0rc3

04 Sep 08:24
e81c50d
Compare
Choose a tag to compare
v1.1.0rc3 Pre-release
Pre-release

Announcement Highlights:

  • Model Support
    • Add fp8 support for Mistral Small 3.1 (#6731)
  • Benchmark
    • add benchmark TRT flow test for MIG (#6884)
    • Mistral Small 3.1 accuracy tests (#6909)
  • Feature
    • Update TargetInfo to accommodate CP in disagg (#7224)
    • Merge add sparse exp and shared exp into local reduction (#7369)
    • Support NVFP4 KV Cache (#6244)
    • Allocate MoE workspace only when necessary (release/1.0 retargeted) (#6955)
    • Implement capturable drafting loops for speculation (#7100)
    • Revert phi4-mm aggregate mode (#6907)
    • Complete the last missing allreduce op in Llama3/4. (#6850)
  • Documentation
    • Exposing the ADP balance strategy tech blog (#7380)
    • Update Dynasor paper info (#7137)
    • Add docs for Gemma3 VLMs (#6880)
    • add legacy section for tensorrt engine (#6724)
    • Update DeepSeek example doc (#7358)

What's Changed

New Contributors

Full Changelog: v1.1.0rc2...v1.1.0rc3

v1.1.0rc2

31 Aug 02:22
15ec2b8
Compare
Choose a tag to compare
v1.1.0rc2 Pre-release
Pre-release

Announcement Highlights:

  • Model Support

    • Refactor llama4 for multimodal encoder IFB (#6844)
  • API

    • Add standalone multimodal encoder (#6743)
    • Enable Cross-Attention to use XQA kernels for Whisper (#7035)
    • Enable nanobind as the default binding library (#6608)
    • trtllm-serve + autodeploy integration (#7141)
    • Chat completions API for gpt-oss (#7261)
    • KV Cache Connector API (#7228)
    • Create PyExecutor from TorchLlmArgs Part 1 (#7105)
    • TP Sharding read from the model config (#6972)
  • Benchmark

    • add llama4 tp4 tests (#6989)
    • add test_multi_nodes_eval tests (#7108)
    • nsys profile output kernel classifier (#7020)
    • add kv cache size in bench metric and fix failed cases (#7160)
    • add perf metrics endpoint to openai server and openai disagg server (#6985)
    • add gpt-osss tests to sanity list (#7158)
    • add l20 specific qa test list (#7067)
    • Add beam search CudaGraph + Overlap Scheduler tests (#7326)
    • Update qwen3 timeout to 60 minutes (#7200)
    • Update maxnt of llama_v3.2_1b bench (#7279)
    • Improve performance of PyTorchModelEngine._get_lora_params_from_requests (#7033)
    • Accelerate global scale calculations for deepEP fp4 combine (#7126)
    • Remove and fuse some element-wise ops in the ds-r1-fp8 model (#7238)
    • Balance the request based on number of tokens in AttentionDP (#7183)
    • Wrap the swiglu into custom op to avoid redundant device copy (#7021)
  • Feature

    • Add QWQ-32b torch test (#7284)
    • Fix llama4 multimodal by skipping request validation (#6957)
    • Add group attention pattern for solar-pro-preview (#7054)
    • Add Mistral Small 3.1 multimodal in Triton Backend (#6714)
    • Update lora for phi4-mm (#6817)
    • refactor the CUDA graph runner to manage all CUDA graphs (#6846)
    • Enable chunked prefill for Nemotron-H (#6334)
    • Add customized default routing method (#6818)
    • Testing cache transmission functionality in Python (#7025)
    • Simplify decoder state initialization for speculative decoding (#6869)
    • Support MMMU for multimodal models (#6828)
    • Deepseek: Start Eagle work (#6210)
    • Optimize and refactor alltoall in WideEP (#6973)
    • Apply AutoTuner to fp8_block_scale_deep_gemm to trigger JIT ahead of time (#7113)
    • Hopper Fp8 context mla (#7116)
    • Padding for piecewise cudagraph (#6750)
    • Add low precision all2all for mnnvl (#7155)
    • Use numa to bind CPU (#7304)
    • Skip prefetching consolidated safetensors when appropriate (#7013)
    • Unify sampler handle logits implementation (#6867)
    • Move fusion, kvcache, and compile to modular inference optimizer (#7057)
    • Make finalize fusion part of the tactic selection logic (#6915)
    • Fuse slicing into MoE (#6728)
    • Add logging for OAI disagg server (#7232)
  • Documentation

    • Update gpt-oss deployment guide to latest release image (#7101)
    • update stale link for AutoDeploy (#7135)
    • Add GPT-OSS Deployment Guide into official doc site (#7143)
    • Refine GPT-OSS doc (#7180)
    • update feature_combination_matrix doc (#6691)
    • update disagg doc about UCX_MAX_RNDV_RAILS (#7205)
    • Display tech blog for nvidia.github.io domain (#7241)
    • Updated blog9_Deploying_GPT_OSS_on_TRTLLM (#7260)
    • Update autodeploy README.md, deprecate lm_eval in examples folder (#7233)
    • add adp balance blog (#7213)
    • fix doc formula (#7367)
    • update disagg readme and scripts for pipeline parallelism (#6875)

What's Changed

  • [None][fix] Fix assertion errors of quantization when using online EPLB by @jinyangyuan-nvidia in #6922
  • [None][autodeploy] Add group attention pattern that supports attention masks by @Fridah-nv in #7054
  • [None][chore] unwaive test_disaggregated_genbs1 by @bo-nv in #6944
  • [None][fix] fix llmapi import error by @crazydemo in #7030
  • [TRTLLM-7326][feat] Add standalone multimodal encoder by @chang-l in #6743
  • [None][infra] update feature_combination_matrix of disaggregated and chunked prefill by @leslie-fang25 in #6661
  • [TRTLLM-7205][feat] add llama4 tp4 tests by @xinhe-nv in #6989
  • [None][infra] "[TRTLLM-6960][fix] enable scaled_mm tests (#6936)" by @Tabrizian in #7059
  • [TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse by @eopXD in #6767
  • [None][fix] fix scaffolding dynasor test by @dc3671 in #7070
  • [None][chore] Update namelist in blossom-ci by @karljang in #7015
  • [None][ci] move unittests to sub-directories by @Funatiq in #6635
  • [None][infra] Waive failed tests on main branch 8/20 by @EmmaQiaoCh in #7092
  • [None][fix] Fix W4A8 MoE kernel issue by @yuhyao in #7072
  • [TRTLLM-7348] [feat] Enable Cross-Attention to use XQA kernels for Whisper by @DomBrown in #7035
  • [None][chore] Only check the bindings lib for current build by @liji-nv in #7026
  • [None][ci] move some tests of b200 to post merge by @QiJune in #7093
  • [https://nvbugs/5457489][fix] unwaive some tests by @byshiue in #6991
  • [TRTLLM-6771][feat] Support MMMU for multimodal models by @yechank-nvidia in #6828
  • [None][fix] Fix llama4 multimodal by skipping request validation by @chang-l in #6957
  • [None][infra] Upgrade UCX to v1.19.x and NIXL to 0.5.0 by @BatshevaBlack in #7024
  • [None][fix] update accelerate dependency to 1.7+ for AutoDeploy by @Fridah-nv in #7077
  • [None][fix] Fix const modifier inconsistency in log function declaration/implementation by @Fan-Yunfan in #6679
  • [None][chore] waive failed cases on H100 by @xinhe-nv in #7084
  • [None][fix] Use safeInitRowMax instead of fp32_lowest to avoid NaN by @lowsfer in #7087
  • [https://nvbugs/5443039][fix] Fix AutoDeploy pattern matcher for torch 2.8 by @Fridah-nv in #7076
  • [https://nvbugs/5437405][fix] qwen3 235b eagle3 ci by @byshiue in #7000
  • [None][doc] Update gpt-oss deployment guide to latest release image by @farshadghodsian in #7101
  • [https://nvbugs/5392414] [fix] Add customized default routing method by @ChristinaZ in #6818
  • [https://nvbugs/5453827][fix] Fix RPATH of th_common shared library to find pip-installed NCCL by @tongyuantongyu in #6984
  • [None][chore] No-op changes to support context parallelism in disaggregated serving later by @brb-nv in #7063
  • [https://nvbugs/5394409][feat] Support Mistral Small 3.1 multimodal in Triton Backend by @dbari in #6714
  • [None][infra] Waive failed case for main branch 08/21 by @EmmaQiaoCh in #7129
  • [#4403][refactor] Move fusion, kvcache, and compile to modular inference optimizer by @Fridah-nv in #7057
  • [None][perf] Make finalize fusion part of the tactic selection logic by @djns99 in #6915
  • [None][chore] Mass integration of release/1.0 by @dominicshanshan in #6864
  • [None][docs] update stale link for AutoDeploy by @suyoggupta in #7135
  • [TRTLLM-6825][fix] Update lora for phi4-mm by @Wanli-Jiang in #6817
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7109
  • [None][fix] Fix mm_placholder_counts extraction issue. by @hyukn in #7118
  • [TRTLLM-7155][feat] Unify sampler handle logits implementation. by @dcampora in #6867
  • [TRTLLM-5801][infra] Add more RTX Pro 6000 test stages by @EmmaQiaoCh in #5126
  • [None][feat] Enable nanobind as the default binding library by @Linda-Stadter in #6608
  • [TRTLLM-7321][doc] Add GPT-OSS Deployment Guide into official doc site by @dongfengy in #7143
  • [TRTLLM-7245][feat] add test_multi_nodes_eval tests by @xinhe-nv in #7108
  • [None][ci] move all B200 TensorRT test cases to post merge by @QiJune in #7165
  • [None][chore] Bump version to 1.1.0rc2 by @yiqingy0 in #7167
  • [#7136][feat] trtllm-serve + autodeploy integration by @suyoggupta in #7141
  • [TRTLLM-4921][feat] Enable chunked prefill for Nemotron-H by @tomeras91 in #6334
  • [None][refactor] Simplify decoder state initialization for speculative decoding by @Funatiq in #6869
  • [None][feat] Deepseek: Start Eag...
Read more

v1.1.0rc1

22 Aug 10:02
7334f93
Compare
Choose a tag to compare
v1.1.0rc1 Pre-release
Pre-release

Announcement Highlights:

  • Model Support

    • Add Tencent HunYuanMoEV1 model support (#5521)
    • Support Yarn on Qwen3 (#6785)
  • API

    • BREAKING CHANGE: Introduce sampler_type, detect sampler according to options (#6831)
    • Introduce sampler options in trtllm bench (#6855)
    • Support accurate device iter time (#6906)
    • Add batch wait timeout in fetching requests (#6923)
  • Benchmark

    • Add accuracy evaluation for AutoDeploy (#6764)
    • Add accuracy test for context and generation workers with different models (#6741)
    • Add DeepSeek-R1 FP8 accuracy tests on Blackwell (#6710)
    • Add NIM Related Cases [StarCoder2_7B] and [Codestral_22B_V01] (#6939)
    • Add NIM Related Cases Part 1 (#6684)
  • Feature

    • Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow (#6629)
    • Add single block version renormalized routing kernel (#6756)
    • Use Separate QKV Input Layout for Context MLA (#6538)
    • Enable accuracy test for MTP and chunked prefill (#6314)
  • Documentation

    • Update gpt-oss doc on MoE support matrix (#6908)
    • Modify the description for MLA chunked context (#6929)
    • Update wide-ep doc (#6933)
    • Update gpt oss doc (#6954)
    • Add more documents for large scale EP (#7029)
    • Add documentation for relaxed test threshold (#6997)

What's Changed

  • [https://nvbugs/5455651][fix] Make ngram use XQA attention on Blackwell by @mikeiovine in #6873
  • [https://nvbugs/5441714][chore] remove skip on disagg n-gram test by @raayandhar in #6872
  • [None] [feat] Add Tencent HunYuanMoEV1 model support by @qianbiaoxiang in #5521
  • [None][chore] Add tests for non-existent and completed request cancellation by @achartier in #6840
  • [None][doc] Update gpt-oss doc on MoE support matrix by @hlu1 in #6908
  • [https://nvbugs/5394685][fix] using static scheduler 2CTA MLA as WAR for an accuracy issue by @PerkzZheng in #6896
  • [https://nvbugs/5437106][fix] Add L4 Scout benchmarking WAR option in deploy guide by @JunyiXu-nv in #6829
  • [None][fix] Fix the issue of responsibility boundary between the assert and tllmException files by @Fan-Yunfan in #6723
  • [None][fix] Correct reporting of torch_dtype for ModelConfig class. by @FrankD412 in #6800
  • [None][fix] Fix perfect router. by @bobboli in #6797
  • [https://nvbugs/5415862][fix] Update cublas as 12.9.1 and cuda memory alignment as 256 by @Wanli-Jiang in #6501
  • [None][fix] Update tests to use standardized uppercase backend identifiers by @bo-nv in #6921
  • [TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures by @chzblych in #6836
  • [None][doc] Modify the description for mla chunked context by @jmydurant in #6929
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #6914
  • [None][chore] add a EditorConfig config by @zhenhuaw-me in #6897
  • [https://nvbugs/5451373][fix] : Fix the accuracy issue when using FP8 context MLA by @peaceh-nv in #6881
  • [https://nvbugs/5405041][fix] Update wide-ep doc by @qiaoxj07 in #6933
  • [None] [chore] Mamba cache in separate file by @tomeras91 in #6796
  • [https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… by @liji-nv in #6858
  • [https://nvbugs/5394685][fix] proper fix for the accuracy issue in 2CTA MLA kernels by @PerkzZheng in #6941
  • [https://nvbugs/5394392][fix] Enlarge scheduler capacity under disagg bs == 1 by @yifeizhang-c in #6537
  • [None][test] Add accuracy evaluation for AutoDeploy by @ajrasane in #6764
  • [None][fix] Make TP working for Triton MOE (in additional to EP we are using) by @dongfengy in #6722
  • [TRTLLM-5863][feat] Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow by @Yuening-wa in #6629
  • [https://nvbugs/5401114][fix] Unwaive Gemma3 tests by @brb-nv in #6952
  • [None][chore] Bump version to 1.1.0rc1 by @yiqingy0 in #6953
  • [TRTLLM-7157][feat] BREAKING CHANGE Introduce sampler_type, detect sampler according to options by @dcampora in #6831
  • [None][fix] Skip Topk if 0 by @IzzyPutterman in #6934
  • [None][fix] Fix: Using RAII to automatically manage the allocation and release of va_list for potential resource leak by @Fan-Yunfan in #6758
  • [None][feat] Support Yarn on Qwen3 by @byshiue in #6785
  • [None][feat] Add single block version renormalized routing kernel by @ChristinaZ in #6756
  • [None][infra] Waive failed cases in main branch by @EmmaQiaoCh in #6951
  • [https://nvbugs/5390853][fix] Fix _test_openai_lora.py - disable cuda graph by @amitz-nv in #6965
  • [https://nvbugs/5451028][fix] Constrain NemotronSuper test parameters to prevent OOMs by @Naveassaf in #6970
  • [None][infra] update feature_combination_matrix of disaggregated and Eagle3 by @leslie-fang25 in #6945
  • [None][doc] Update gpt oss doc by @bobboli in #6954
  • [None] [feat] Support accurate device iter time by @kaiyux in #6906
  • [TRTLLM-7030][fix] uppercase def value in pd-config by @Shixiaowei02 in #6981
  • [None] [fix] Fix the macro name by @ChristinaZ in #6983
  • [None][infra] Waive failed tests on main 0818 by @EmmaQiaoCh in #6992
  • [None][chore] Remove duplicate test waives by @yiqingy0 in #6998
  • [None][fix] Clean up linking to CUDA stub libraries in build_wheel.py by @MartinMarciniszyn in #6823
  • [None][infra] Cherry-pick #6836 from main branch and improve SSH connection (#6971) by @chzblych in #7005
  • [TRTLLM-7158][feat] Introduce sampler options in trtllm bench by @dcampora in #6855
  • [None][infra] Enable accuracy test for mtp and chunked prefill by @leslie-fang25 in #6314
  • [None][autodeploy] Doc: fix link path in trtllm bench doc by @Fridah-nv in #7007
  • [https://nvbugs/5371480][fix] Enable test_phi3_small_8k by @Wanli-Jiang in #6938
  • [TRTLLM-7014][chore] Add accuracy test for ctx and gen workers with different models by @reasonsolo in #6741
  • [None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic by @yizhang-nv in #6615
  • [None] [infra] stricter coderabbit pr title generation instructions by @venkywonka in #6918
  • [TRTLLM-6960][fix] enable scaled_mm tests by @dc3671 in #6936
  • [TRTLLM-6991][chore] add DeepSeek-R1 FP8 accuracy tests on Blackwell by @lfr-0531 in #6710
  • [TRTLLM-6541][test] Add NIM Related Cases [StarCoder2_7B] and [Codestral_22B_V01] by @fredricz-20070104 in #6939
  • [https://nvbugs/5454875][ci] Unwaive Mistral Small 3.1 test by @2ez4bz in #7011
  • [TRTLLM-6541][test] Add NIM Related Cases Part 1 by @crazydemo in #6684
  • [https://nvbugs/5458798][fix] Relaxed test threshold, added documentation by @MrGeva in #6997
  • [None][opt] Add batch wait timeout in fetching requests by @Shunkangz in #6923
  • [None][chore] Remove closed bugs by @xinhe-nv in #6969
  • [None][fix] acceptance rate calculation fix in benchmark_serving by @zerollzeng in #6746
  • [None] [doc] Add more documents for large scale EP by @kaiyux in #7029
  • [None] [chore] Update wide-ep genonly scripts by @qiaoxj07 in #6995
  • [TRTLLM-7263][fix] Prevent recreation of cublas handles in lora_grouped_gemm every call by @amitz-nv in #6968
  • [https://nvbugs/5458874][fix] Fix Nemotron-H flaky CUDA graph / overlap scheduler test by @tomeras91 in #6996
  • [https://nvbugs/5455140][fix] unwaive DSR1-fp4 throughput_tp8 by @lfr-0531 in #7022
  • [None][chore] Remo...
Read more

v1.1.0rc0

16 Aug 00:09
26f413a
Compare
Choose a tag to compare
v1.1.0rc0 Pre-release
Pre-release

Announcement Highlights:

  • Model Support

    • Add model gpt-oss (#6645)
    • Support Aggregate mode for phi4-mm (#6184)
    • Add support for Eclairv2 model - cherry-pick changes and minor fix (#6493)
    • Support running heterogeneous model execution for Nemotron-H (#6866)
    • Add whisper support (Bert Attention on SM100 and GPTAttention for cross attention on SM100) (#5527)
  • API

    • BREAKING CHANGE Enable TRTLLM sampler by default (#6216)
  • Benchmark

    • Improve Llama4 performance for small max_seqlen cases (#6306)
    • Multimodal benchmark_serving support (#6622)
    • Add perf-sweep scripts (#6738)
  • Feature

    • Support LoRA reload CPU cache evicted adapter (#6510)
    • Add FP8 context MLA support for SM120 (#6059)
    • Enable guided decoding with speculative decoding (part 1: two-model engine) (#6300)
    • Include attention dp rank info with KV cache events (#6563)
    • Clean up ngram auto mode, add max_concurrency to configs (#6676)
    • Add NCCL Symmetric Integration for All Reduce (#4500)
    • Remove input_sf swizzle for module WideEPMoE (#6231)
    • Enable guided decoding with disagg serving (#6704)
    • Make fused_moe_cute_dsl work on blackwell (#6616)
    • Move kv cache measure into transfer session (#6633)
    • Optimize CUDA graph memory usage for spec decode cases (#6718)
    • Core Metrics Implementation (#5785)
    • Resolve KV cache divergence issue (#6628)
    • AutoDeploy: Optimize prepare_inputs (#6634)
    • Enable FP32 mamba ssm cache (#6574)
    • Support SharedTensor on MultimodalParams (#6254)
    • Improve dataloading for benchmark_dataset by using batch processing (#6548)
    • Store the block of context request into kv cache (#6683)
    • Add standardized GitHub issue templates and disable blank issues (#6494)
    • Improve the performance of online EPLB on Hopper by better overlapping (#6624)
    • Enable guided decoding with CUDA graph padding and draft model chunked prefill (#6774)
    • CUTLASS MoE FC2+Finalize fusion (#3294)
    • Add GPT OSS support for AutoDeploy (#6641)
    • Add LayerNorm module (#6625)
    • Support custom repo_dir for SLURM script (#6546)
    • DeepEP LL combine FP4 (#6822)
    • AutoTuner tuning config refactor and valid tactic generalization (#6545)
    • Hopper W4A8 MoE supports ModelOpt ckpt for PyT backend (#6200)
    • Add support for Hopper MLA chunked prefill (#6655)
    • Helix: extend mapping to support different CP types (#6816)
  • Documentation

    • Remove the outdated features which marked as Experimental (#5995)
    • Add LoRA feature usage doc (#6603)
    • Add deployment guide section for VDR task (#6669)
    • Add doc for multimodal feature support matrix (#6619)
    • Move AutoDeploy README.md to torch docs (#6528)
    • Add checkpoint refactor docs (#6592)
    • Add K2 tool calling examples (#6667)
    • Add the workaround doc for H200 OOM (#6853)
    • Update moe support matrix for DS R1 (#6883)
    • BREAKING CHANGE: Mismatch between docs and actual commands (#6323)

What's Changed

  • Qwen3: Fix eagle hidden states by @IzzyPutterman in #6199
  • [None][fix] Upgrade dependencies version to avoid security vulnerability by @yibinl-nvidia in #6506
  • [None][chore] update readme for perf release test by @ruodil in #6664
  • [None][test] remove trt backend cases in release perf test and move NIM cases to llm_perf_nim.yml by @ruodil in #6662
  • [None][fix] Explicitly add tiktoken as required by kimi k2 by @pengbowang-nv in #6663
  • [None][doc]: remove the outdated features which marked as Experimental by @nv-guomingz in #5995
  • [https://nvbugs/5375966][chore] Unwaive test_disaggregated_deepseek_v3_lite_fp8_attention_dp_one by @yweng0828 in #6658
  • [TRTLLM-6892][infra] Run guardwords scan first in Release Check stage by @yiqingy0 in #6659
  • [None][chore] optimize kv cache transfer for context TEP and gen DEP by @chuangz0 in #6657
  • [None][chore] Bump version to 1.1.0rc0 by @yiqingy0 in #6651
  • [TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter by @amitz-nv in #6510
  • [None][test] correct test-db context for perf yaml file by @ruodil in #6686
  • [None] [feat] Add model gpt-oss by @hlu1 in #6645
  • [https://nvbugs/5409414][fix] fix Not registered specs by @xinhe-nv in #6660
  • [None][feat] : Add FP8 context MLA support for SM120 by @peaceh-nv in #6059
  • [TRTLLM-6092][doc] Add LoRA feature usage doc by @shaharmor98 in #6603
  • [TRTLLM-6409][feat] Enable guided decoding with speculative decoding (part 1: two-model engine) by @syuoni in #6300
  • [TRTLLM-6881][feat] Include attention dp rank info with KV cache events by @pcastonguay in #6563
  • [None][infra] Fix guardwords by @EmmaQiaoCh in #6711
  • [None][package] Pin cuda-python version to >=12,<13 by @yiqingy0 in #6702
  • [None][doc] Add deployment guide section to the official doc website by @nv-guomingz in #6669
  • [None][fix] disagg ctx pp4 + gen pp4 integ test by @raayandhar in #6489
  • [None][feat] Clean up ngram auto mode, add max_concurrency to configs by @mikeiovine in #6676
  • [None][chore] Remove py_executor from disagg gh team by @pcastonguay in #6716
  • [https://nvbugs/5423962][fix] Address broken links by @chenopis in #6531
  • [None][fix] Migrate to new cuda binding package name by @tongyuantongyu in #6700
  • [https://nvbugs/5410687][fix] Hopper w4a8 groupwise MoE interleave by @symphonylyh in #6708
  • [None][feat] Add NCCL Symmetric Integration for All Reduce by @Tabrizian in #4500
  • [TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default by @dcampora in #6216
  • [TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6719
  • [TRTLLM-5252][test] add for mistral_small_3.1_24b perf test by @ruodil in #6685
  • [TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE by @StudyingShao in #6231
  • [None][fix] Fix unnecessary GPU synchronization in torch sampler caused by incorrect tensor reference by @zhanghaotong in #6626
  • [TRTLLM-6854][feat] Enable guided decoding with disagg serving by @syuoni in #6704
  • [TRTLLM-5252][fix] Propagate mapping to intermediate layers by @2ez4bz in #6611
  • [None][test] fix yml condition error under qa folder by @ruodil in #6734
  • [None][doc] Add doc for multimodal feature support matrix by @chang-l in #6619
  • [TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell by @limin2021 in #6616
  • [https://nvbugs/5436461][infra] Adjust free_gpu_memory_fraction of test_eagle3 to prevent OOM on CI by @leslie-fang25 in #6631
  • [None][refactor] Combine resmooth_to_fp8_e8m0 and transform_sf_into_required_layout by @yuxianq in #6654
  • [https://nvbugs/5437106][fix] Fix llama4 scout TRTLLM attn_backend by @JunyiXu-nv in #6690
  • [None][fix] Remove lock related typo in py_executor by @lancelly in #6653
  • [None][feat] move kv cache measure into transfer session by @zhengd-nv in #6633
  • [None][fix]revert kvcache transfer by @chuangz0 in #6709
  • [TRTLLM-6650][fix] Enhance CUDA graph + Beam search to correctly handle padding by @stnie in #6665
  • [TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #6184
  • [None][feat] Optimize CUDA graph memory usage for spec decode cases by @mikeiovine in #6718
  • [TRTLLM-7025] [infra] Reorganize CODEOWNERS to rectify examples mapping by @venkywonka in #6762
  • [None][doc] Move AutoDeploy README.md to torch docs by @Fridah-nv in #6528
  • [None][fix] WAR GPT OSS on H20 with Triton MOE by @dongfengy in #6721
  • [TRTLLM-6420][feat] add support for Eclairv2 model - cherry-pick changes and minor fix by @yibinl-nvidia in #6493
  • [None][feat] Core Metrics Implementation by @hcyezhang in #5785
  • [https://nvbugs/5398180][feat] Improve Llama4 performance for small max_seqlen cases by @nv-yilinf in #6306
  • [TRTLLM-6637][feat]...
Read more