Releases: tile-ai/tilelang
Releases · tile-ai/tilelang
v0.1.6.post2
The Last Release for Python 3.8 (without tvm-ffi) 🚀
What's Changed
- [Analyzer] Enhance ConstIntBoundAnalyzer and IntervalSet with modular set analysis by @LeiWang1999 in #856
- [Doc] Optimize the quickstart guide for clarity and not just for CUDA by @LeiWang1999 in #858
- [TMA] Bugfix when a shared buffer is both issued with tma store and tma load by @LeiWang1999 in #857
- [AMD][MLA] Fix mla autotune for rocm by @LeiWang1999 in #861
- [Bugfix] Ensure correct handling for cases where
seq_q<seq_kvin flash attention examples by @Rachmanino in #864 - [AMD] refactor MatrixCoreIntrinEmitter by @Paran0idy in #860
- [Feat] Add fast sine and cosine definitions in CUDA templates by @Rachmanino in #865
- [Layout] Support layout forward with multi dimension by @LeiWang1999 in #867
- [Autotune][Conv] optimize convolution examples to use autotune by @LeiWang1999 in #866
- [Example] Add examples to support efficient attention sink forward process by @Rachmanino in #853
- [Parser] Adapt Parser to work with Python 3.8 in some cases by @LeiWang1999 in #869
- [Fix] Fix bug 0905: tilelang doesn't vectorize
B[i,j] = c[i] + A[i,j]by @kurisu6912 in #798 - [Language] Support sequence comparisons by @LeiWang1999 in #872
- [Language] Support loop_break primitive by @chengyupku in #873
- [Bugfix] Use
ExprDeepEqualinstead ofStructuralEqualwhen merge consecutive If stmt by @LeiWang1999 in #876 - [Language] Support atomic add with ret by @LeiWang1999 in #870
- [Cython] Remove an incorrect check by @LJC00118 in #880
- Update amd_ci.yml by @Alex4210987 in #881
- [FastMath] Disable default TVM fastmath intrinsic dispatch and add explicit fastmath op to invoke by @LeiWang1999 in #875
- [Example] Add efficient attention sink backward implementations and tests by @Rachmanino in #877
- [Precision] Introduce
T.ieee_rsqrtand related high precision op by @LeiWang1999 in #882 - [Dist] Provide an option to include commit ID in version by @LeiWang1999 in #884
- [Example] Optimize sink attention forward via swizzled layout and report benchmark results by @Rachmanino in #885
- [Layout] Introduce Flexible Parallel to Support T.serial and local buffers inside T.Parallel loop by @LeiWang1999 in #844
- [Bugfix][Enhancement] Fix a bug in previous commit and enhance cuda backend by @Hamerlate in #887
- [Bugfix] Fix CopyNode Lower method to include disable_tma flag in GetCopyInst by @Rachmanino in #888
- [Layout] Fix plot layout by @Paran0idy in #890
- [Example] Add example by @LeiWang1999 in #894
- [News] Add announcement of support for Huawei Ascend chips by @xwhzz in #895
- [Example] Add sparse mla examples by @LeiWang1999 in #896
- [Typo] Fix backend name for Huawei Ascend by @xwhzz in #898
- [CI] Legalize math related test by @LeiWang1999 in #899
- [Bugfix] Fix flops comp and softmax scale in mla by @Edenzzzz in #900
- [Example] Specify a fixed commit for the flash-linear-attention repository and optimize nsa examples by @LeiWang1999 in #913
- [CI] optimize CI time for sparse gemm by @botbw in #906
- [Enhancement] Include compile flags into the hash key of cached kernels by @Rachmanino in #911
- [Bugfix] Fix saving kernel source code where JITKernel.artifact is None by @zjudmd1015 in #921
- [CI] Refactor import paths in dequantization examples to use dequantize_utils by @LeiWang1999 in #914
- [Example] Add MLA decode ws example by @chengyupku in #928
- [CI] Fix documentation runner by adding 'nvidia' tag by @xwhzz in #927
- [Layout] Strict annotate completed replicated layout for fragment with constant index by @LeiWang1999 in #929
- [Bugfix] Fix tensor memory copy layout by @Hamerlate in #933
- [Example] Optimize online_softmax example by @lijinpei in #934
- [Example] Add correctness assert into dsa example by @LeiWang1999 in #937
- [Enhancement] Enhance and add new GQA backward examples for Hopper by @Rachmanino in #930
- [Enhancement] Fix lint to improve grouped GEMM performance with TMA by @Cunxiao2002 in #938
- [Example] Introduce split+sum template, and optimize
atomic_addperformance for bwd examples by @LeiWang1999 in #940 - [Example] Disable TMA and enable FastMath for NSA Examples (#941) by @LeiWang1999 in #941
- [Example] Revert the atomic/split&sum templates in MHA backward examples by @Rachmanino in #943
- [Example] Add sparse mla bwd example for deepseek_v32 by @Zhichenzzz in #919
- [Profiler]Adds CUPTI profiler support by @Cunxiao2002 in #936
- [Enhancement] Support Copy for Buffer Load witih scalar indices by @LeiWang1999 in #946
- [Code Style] Refine nvrtc compile related check style by @BBuf in #945
- [Backend] Add metal backend by @oraluben in #799
- [CI] enable dependabot for GHA workflows by @XuehaiPan in #950
- Modify the SM architecture number to support Thor’s sm110. by @iloveai8086 in #957
- [CI] auto-cancel in-progress PR CI when new commits are pushed by @XuehaiPan in #956
- [bug] fix type object is not subscriptable in py38 by @BBuf in #959
- [Bugfix][Doc] Add astroid version constraint to requirements.txt by @xwhzz in #958
- [CI]: Bump actions/setup-python from 2 to 6 by @dependabot[bot] in #951
- [CI]: Bump astral-sh/setup-uv from 6 to 7 by @dependabot[bot] in #952
- [CI]: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #954
- [CI]: Bump actions/checkout from 2 to 5 by @dependabot[bot] in #953
- [TileOp] Implement WGMMA for T.gemm_v2 by @LeiWang1999 in #813
- [Docs] add CODE_OF_CONDUCT.md by @XuehaiPan in #965
- [Example] Add support for
bfloat16and user-definedsm_scalein attention sink examples by @Rachmanino in #924 - [Bugfix] Do not force inline let stmt by @LeiWang1999 in #947
- [CI] add
pre-commitintegration by @XuehaiPan in #955 - [Doc] Install docs add docker install method by @BBuf in #961
- [Bugfix] Fix dummy kernel compliation by @SiriusNEO in #962
- [CI][Refactor] Refactor non-test CI workflow files by @XuehaiPan in #971
- [TileOp] Implememt
CumSum1Dby @LeiWang1999 in #978 - [Language] Enhance
T.alloc_varfor AugAssign and AnnAsign by @LeiWang1999 in #979 - [Refactor] Refactor Pass
InjectFenceProxyand expose some warp group primitives in frontend by @LeiWang1999 in #977 - [Typo] Remove debug print by @LeiWang1999 in #980
- [Bugfix] Use
access_ptr("r")instead ofaccess_ptr("w")for correct pipeline analysis by @LeiWang1999 in #983 - [Feature][Example] Support TMA reduce operation and update GQA bwd example by @chengyupku in #969
- [Bugfix] Add NVIDIA HPC SDK support in CUDA detection (#974) by @Degeneracy-Evil in #976
- [BugFix] Robust gemm policy for sparse_mla_fwd in Hopper and Ada Lovelace architectures by @tzj-fxz in #984
- [Bugfix] Fallback
torch.accelerator.synchronize()totorch.cuda.synchronize()by @yyttt6 in #987
...
v0.1.6.post1
In version 0.1.6, libgcc and libg++ were statically linked to improve version compatibility. However, this could introduce certain unpredictable risks in some programs.
In post1, this process was reworked based on the PyTorch build workflow, eliminating the risks while ensuring better compatibility. This is the reason for releasing version 0.1.6.post1.
v0.1.6
What's Changed
- [Bugfix] Added missing thread offsets and other information to reduce by @LeiWang1999 in #646
- [Bugfix] Adjust role assignment in warp specialization based on read access by @chengyupku in #647
- Fix/jit kernel use target by @meinie0826 in #648
- [Bugfix] Remove small array reuse condition in shared memory allocation merging by @LeiWang1999 in #654
- [Enhancement] Add role assignment for AllocateNode in warp specialization by @chengyupku in #657
- [Bugfix][CI] Bug fixing and migrate CI from ada to hopper by @xwhzz in #652
- [CI] Enable cache for virtual env and parallelize pytest via xdist by @LeiWang1999 in #660
- [Cache] Support shared cache directories for multiple process by @LeiWang1999 in #649
- [Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control by @xwhzz in #656
- add the support of rocm arch detecting by @zhangnju in #661
- [BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking by @tzj-fxz in #653
- [Bugfix][Docs] Update documentation build process and configurations for autoapi support by @xwhzz in #663
- [Enhancement] Improve buffer conflict detection in thread storage synchronization by @LeiWang1999 in #658
- [Bugfix] Consider buffer data type into indices provably disjoint analysis by @LeiWang1999 in #664
- [Bugfix] Remove redundant T.fill to fix precision issue by @xuchangtolearn in #667
- [Enhancement] Refactor buffer index handling for improved precision a… by @Alex4210987 in #671
- Reverts #671 by @LeiWang1999 in #672
- [Bugfix] Passing correct nvcc to cmake by @chenyang78 in #670
- [CI] Improve format check output and automate commit of changes by @xwhzz in #669
- [Bugfix][CI] Use valid runner labels in workflow by @xwhzz in #674
- [Enhancement] passing verbose to LibraryGenerator by @chenyang78 in #673
- [Enhancement] Enhance lint error messaging in CI by @xwhzz in #675
- Refactor to support upstream tvm by @Hzfengsy in #595
- Do not check for short variables by @oraluben in #676
- [Refactor] Phaseout version with commit id in editable model by @LeiWang1999 in #677
- [CI] Update CI workflow to use Python 3.12 by @LeiWang1999 in #679
- [Enhancement] Output cache-file-related messages with verbose=True by @chenyang78 in #683
- [Enhancement] Enhance warp specialization logic by @chengyupku in #680
- Add Flash Attn example on amd mi300 series by @Alex4210987 in #682
- [Enhancement] Refactored buffer detection logic in warp_specialized_rewriter.cc by @chengyupku in #685
- [Fix] fix some issues with JIT decorators existing in the examples by @Cunxiao2002 in #681
- [Enhancement] Add
--ptxas-options=--register-usage-level=10option by @LeiWang1999 in #684 - [Feature]:Add auto vectorize for atomic add by @yyttt6 in #686
- [Refactor] Rebase pipeline injector from upstream tvm by @LeiWang1999 in #687
- [Refactor] Introduce GemmInst for different targets handling by @LeiWang1999 in #688
- [Enhancement] Optimize BF16 casting performance by @xwhzz in #689
- [Smem Reuse] Optimize to do memory alignment on identical buffers. by @LeiWang1999 in #693
- [Version] Keep local commit id as it somehow help with debugging by @LeiWang1999 in #697
- [Example] Optimize warp specialize flashmla example by @LeiWang1999 in #698
- Bump transformers from 4.52.1 to 4.53.0 in /examples/bitnet-1.58b by @dependabot[bot] in #700
- Gated Delta Net(GDN) kernel implementation in TileLang by @tzj-fxz in #695
- Trivial update to calculate target arch by @oraluben in #702
- [CI] Remove Flash Attention dependency by @LeiWang1999 in #705
- [Layout] Introduce a new layout inference mechanism by @LeiWang1999 in #699
- [Pipeline] Optimize inject software pipeline and pipeline planing pass by @LeiWang1999 in #706
- Low-bit kernels fix and implementation by @tzj-fxz in #704
- [Feat] Support gemm with stride by @smallscientist1 in #701
- [Enhancement] Add eviction policy support for TMA operations, enhance CUDA codegen, and introduce new pass config by @xwhzz in #690
- [Enhancement] Enhance the robustness and generality of MLA examples by @Rachmanino in #709
- [Refactor] MergeAnnotations function to accept Map<Any, Any> instead of Map<String, Any> by @LeiWang1999 in #710
- [Pipeline] Phaseout fragment and double buffer info from pipeline pass by @LeiWang1999 in #711
- [Pipeline] Skip condition expression analysis for global reading by @LeiWang1999 in #713
- [Index] Relocate Int64 Auto Promoter to ConfigBitWidth Pass, removing it from FlattenBuffer by @LeiWang1999 in #714
- [CI] Bind build-test CI to NVIDIA as AMD runners are being introduced by @LeiWang1999 in #718
- fix: NVRTC backend by @lucifer1004 in #717
- [CUDA] Init support for sm_120 by @oraluben in #716
- [Bugfix] Correct git configuration in docs CI by @xwhzz in #720
- [Chore] fix typos by @lucifer1004 in #719
- [CI][AMD] Add AMD GPU CI and fix some related bugs by @Alex4210987 in #694
- [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy by @NaOHCC in #724
- [Refactor] Refactor CUDA code generation to simplify eviction policy handling by @LeiWang1999 in #721
- [Language] Introduce
StridedTensorto support non contigious torch inputs by @LeiWang1999 in #722 - [Enhancement][Bugfix] Fix bug in warp specialized pass and add gemm_sr fallback support for Hopper by @xwhzz in #712
- 📝 Add docstrings to
fixby @coderabbitai[bot] in #726 - fix amd ci&add examples by @Alex4210987 in #729
- [Feature] Low-bit twiddling dequantization and FP4 GEMM by @tzj-fxz in #725
- 📝 Add docstrings to
mxfp4by @coderabbitai[bot] in #732 - [Refactor] Refactor env into a more flexible version by @LeiWang1999 in #740
- [Bugfix] Align stride index validation with torch in CythonKernelWrapper by @LeiWang1999 in #743
- [Bugfix]:Fix atomic add auto vectorize memory access out of bound error by @yyttt6 in #742
- 📝 Add docstrings to
mainby @coderabbitai[bot] in #745 - [Refactor] Refactor barrier management by @LeiWang1999 in #744
- [Refactor] Merge bulk copy into copy and improve layout inference for bulk copy by @LeiWang1999 in #746
- [Refactor] Merge ThreadPartialSync and ThreadStorageSync by @LeiWang1999 in #741
- [Enhancement] Optimize loop body handling in IR by @chengyupku in #749
- [MXFP4] Fix bugs and optimize exponential operation by @tzj-fxz in #750
- [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h by @LeiWang1999 in #751
- [Enhancement] Add shape checking for reduce options by @kurisu6912 in #748
- [Bugfix] Add missing FP8 header include by @LeiWang1999 in #752
- [MXFP4] Add bias to MXFP4 GEMM kernel by @tzj-fxz in #753
- [Bugfix][WS] Consider loop min e...
v0.1.5
What's Changed
- [Release] Bump version from 0.1.3 into 0.1.4 by @LeiWang1999 in #375
- [Enhancement] Remove redundant recursive rewrite rule for FloorDiv in RewriteSimplifier by @LeiWang1999 in #408
- [Docker] cu128 Support by @andyluo03 in #410
- [Refactor] Phaseout python dependency
attrsanddecoratorby @LeiWang1999 in #411 - [Language] make linter and type checker happy with mocking by @YouJiacheng in #407
- [Bugfix] Support larger than 256 box size tma copy by @LeiWang1999 in #413
- [Enhancement] Add get_nvcc_compiler function to retrieve nvcc path by @LeiWang1999 in #414
- Update lower.py to set default value for params by @Alex4210987 in #416
- [Enhancement] Support Auto Layout Inference and Parallelism with variable constraint by @LeiWang1999 in #417
- [Enhancement] Support to find Cython path more automatically by @FrozenGene in #418
- [Refactor] Enhance layout inference logic in ParallelOp by @chengyupku in #420
- [BugFix] Fix tvm simplify pass by @smallscientist1 in #421
- [Enhancement] Add TMA+WS support in pipeline planning logic by @chengyupku in #422
- [Language] Support tile operator
T.cumsumby @LeiWang1999 in #423 - Delete testing/python/language/test_tilelang_language_reduce_sum.py by @LeiWang1999 in #424
- [Bugfix] Fix a bug for simplifier by @LeiWang1999 in #425
- [Layout] Enhance layout inference pass by @LeiWang1999 in #427
- [Enhancement] Remove DeReplicate during parallel loop layout inference by @LeiWang1999 in #430
- [Bugfix] Fix the test data distribution of cumsum by @LeiWang1999 in #432
- [Enhancement] Support cute mma tile mxn8ky by @LeiWang1999 in #434
- [Bugfix] Removed the behavior that treated global -> local as a copy operation. by @LeiWang1999 in #435
- [Language] Support accumulative
T.reduce_sumby @LeiWang1999 in #436 - [Bugfix] fix the unexpected keyword error of autotune by @yyttt6 in #438
- [Testing] Add atomic add test by @LeiWang1999 in #439
- [Typo] Rename warp_source to wrap_source by @lucifer1004 in #440
- [Refactor] Update KernelLaunch to clarify block name by @LeiWang1999 in #441
- [Enhancement] Reduce CPU overhead during kernel execution by @Cunxiao2002 in #437
- [Enhancement] Improve layout inference accuracy in ParallelOp by @LeiWang1999 in #442
- [Bugfix] Fix layout inference for free fragment buffer by @LeiWang1999 in #443
- Bump transformers from 4.48.0 to 4.50.0 in /examples/bitnet-1.58b by @dependabot in #444
- [Language] Support explicit programming for identified warp groups by @LeiWang1999 in #445
- [Bugfix] Fix safe memory legalization for fragment store by @LeiWang1999 in #446
- [Refactor] Separate warp specialize rewriter and tma barrier injector pass by @LeiWang1999 in #447
- [Enhancement] Add new examples for warp specialization and TMA integration by @LeiWang1999 in #448
- [Refactor] Phaseout torch>=2.2.0 dependency by @LeiWang1999 in #451
- [Feature] Add TILELANG_CHECK_LAST_ERROR macro for improved error handling in CUDA and HIP by @LeiWang1999 in #450
- [Enhancement] Introduce pass_configs parameter for kernel Caching by @LeiWang1999 in #452
- [Feature] Add cache directory management functions in tilelang.cache by @LeiWang1999 in #453
- [Bugfix] Fix get_swizzle_layout implementation. by @cherichy in #455
- [Refactor] Update barrier functions and add new example for GEMM with warp specialization by @LeiWang1999 in #456
- [Refactor] Include examples in CI by @LeiWang1999 in #457
- docs: add llvm version info to installation.md. by @AsakusaRinne in #459
- [CI] Add elementwise and gemv examples to CI. by @Cunxiao2002 in #458
- [Bugfix] Fix for T.copy with dynamic range by @LeiWang1999 in #462
- [Bugfix] Fix copy region automation for dynamic extent by @LeiWang1999 in #465
- [Feature] Implement fast integer power operation and related API by @LeiWang1999 in #466
- [Typo] Rename
power_of_intwithpow_of_intfor consistency by @LeiWang1999 in #468 - [CI] Add BlocksparseGemm, Dynamic, and Cast examples to CI by @tzj-fxz in #467
- [Refactor] Update set_compile_args to allow None for out_idx parameter by @LeiWang1999 in #469
- [Refactor] Simplify buffer_region_to_tile_region function in copy.py by @LeiWang1999 in #470
- [CI] Add Convolution example to CI by @xwhzz in #473
- [BugFix] Correct argparse for example_convolution test by @xwhzz in #474
- [Refactor] set USE_LLVM to optional. by @hyx1999 in #476
- [CI] Add Analyzer and blocksparse_attention examples to CI by @yyttt6 in #472
- [Refactor] Skip patchelf if not installed by @LeiWang1999 in #477
- [Refactor] Improve layout equality checks and error messaging by @LeiWang1999 in #471
- [Doc] Update version retrieval in conf.py to read from VERSION file by @xwhzz in #478
- Fix Device Consistency in Autotuner Threads and Add Manual Profiler Check by @yuanjypku in #481
- [Bugfix] Check CUDA target before checking for TMA by @gau-nernst in #482
- [Bugfix] Use AutoTune cache_input_tensors properly by @yyttt6 in #483
- Revert "[Bugfix] Use AutoTune cache_input_tensors properly" by @LeiWang1999 in #488
- [Enhancement] Support register input for gemm when trans_a or trans_b is true by @LeiWang1999 in #490
- [CI] Add flash_decoding example to CI by @xuchangtolearn in #487
- [CI] Add Reminder Bot for pull request contributions by @xwhzz in #491
- [Refactor] Introduce quantize components of TileLang and add testing for dequant gemm exmaple by @LeiWang1999 in #494
- [Enhancement] Introduce flag to visualize shared memory merge plan by @LeiWang1999 in #496
- [Refactor] Update main function structure in example scripts and add tests by @chengyupku in #475
- [Bugfix] Fix Hopper GEMM layout for small tile size by @LeiWang1999 in #497
- [Enhancement] Fallback transposed_ldmatrix into
SM75_U16x4_LDSM_Nwhen warp_n is 8 by @LeiWang1999 in #498 - [Bugfix] Rename SM75_U16x8_LDSM_N to SM75_U16x8_LDSM_T to reflect correct matrix type by @LeiWang1999 in #499
- [Refactor] Update GEMM layout and operand traits for improved CUDA compatibility by @LeiWang1999 in #500
- [Refactor] Update JIT kernel functions and streamline GEMM tests by @LeiWang1999 in #501
- Fix AMD Docker issues related to conda environment setup by @Hamerlate in #503
- [Refactor] Refactor
jitto_JitImplementationto support@tilelang.jitby @LeiWang1999 in #502 - [Refactor] Adjust in fragment GEMM layout by @LeiWang1999 in #504
- [Refactor] Update GlobalMemChecker to Detect Lower Bound illegal memory access automatically by @LeiWang1999 in #505
- [Enhancement] Enhance ReduceOp and JITKernel for improved dimension handling and initialization by @LeiWang1999 in #507
- [Refactor] Update buffer handling in layout transformation to support layout on
T.viewby @LeiWang1999 in #509 - [Bugfix] Enhance smem copy selector for uncommon shape by @LeiWang1999 in https://github.com/tile-ai/tilelang...
v0.1.4
What's Changed
- [Bugfix] Support
T.clearfor let binding by @LeiWang1999 in #268 - [Bugfix] Add TMA and Producer Buffer Analysis in Warp Specialized Rewriter by @chengyupku in #269
- [Refactor] Improve flash attention example and layout comparison logic by @LeiWang1999 in #270
- [Bugfix]Add CUDA availability check in CtypesKernelAdapter by @XueSongTap in #267
- [CI] Add gemm performance test by @xwhzz in #274
- [Language] Introduce
T.ptrandT.Tensorby @LeiWang1999 in #276 - [Refactor] Enhance Autotune by @yyttt6 in #266
- [Refactor] Update cache key generation in KernelCache by @LeiWang1999 in #283
- [Docs][Tutorial] Add tutorial for auto-tuning by @yyttt6 in #285
- [Refactor] Deprecated
T.Bufferas arguments and rename related calls intoT.Tensorby @LeiWang1999 in #281 - [Doc] Update README.md to correct documentation link for TileLang debug tools by @chengyupku in #286
- [Feature] Introduce NoSetMaxNReg for warp specialization by @chengyupku in #289
- [Language] Proxy tvm ir to make linter happy by @LeiWang1999 in #287
- [Bugfix] Enable bfloat16 atomic operations only for CUDA architectures greater than 7.5 by @LeiWang1999 in #291
- [Doc] Update Python API docs generation by @xwhzz in #278
- [Doc] Remove citation page by @LeiWang1999 in #292
- [Dev] Correcting cxx compiler by @penguin-wwy in #294
- [doc/example] add gemv doc and examples by @botbw in #293
- [Feature] Implement ParallelLoopTransformer for enhanced loop analysis by @LeiWang1999 in #295
- [Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h by @LeiWang1999 in #297
- [Refactor] Improve documentation and add detailed docstrings across multiple modules by @LeiWang1999 in #298
- [Bugfix] Correct method call for block reduction check when analyzing memory footprint by @NaOHCC in #299
- [Dynamic Symbolic] Refactor passes with dynamic symbolic and check shape bound precisely by @tzj-fxz in #302
- Add autotune to conv example by @yyttt6 in #301
- [Bugfix] Resolve autotuner bugs for blocksparse GEMM example by @tth37 in #300
- [Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError by @LeslinD in #305
- [Enhancement] Add support for CUDA architecture 8.9 in GEMM template by @LeiWang1999 in #304
- [BugFix] Fix unintended Git config overrides in CI runners by @xwhzz in #306
- [Cache] Implement in-memory cache by @LeiWang1999 in #308
- [Bugfix] Updated autotune usage in the examples to align with the latest changes by @LeiWang1999 in #309
- [Bugfix] Fix dynamic axis with variable extent by @LeiWang1999 in #311
- [Bugfix] Fix layout conflict issue for gqa decoding examples by @LeiWang1999 in #314
- [Bugfix] Fixed the handling logic of IfThenElseNode in if_stmt_binding by @chengyupku in #315
- [Bugfix] Fix logic error in ReduceOp when handling CUDA architecture by @chengyupku in #316
- [CostModel] Introduce cuda driver api to get precise shared memory capacity by @LeiWang1999 in #317
- [Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support by @chengyupku in #320
- [Tools] Summarize TFLOPS Information from a tilelang program by @yyttt6 in #321
- Support block_N sizes that are 2^n in deepgemm example by @zcnrex in #319
- [Feat] Enhance CUDA Property Handling by @LeiWang1999 in #322
- [Bugfix] add a patch to fix T.abs on float16 by @botbw in #325
- [AMD] Adapt rocm and support
T.gemmwith transpose_b=False for amd backend by @LeiWang1999 in #327 - [Dynamic Symbolic] Adaptively vectorize with different condition expressions by @tzj-fxz in #326
- [Bugfix] Fix fragment layout annotation in example gqa decode by @LeiWang1999 in #329
- [AMD] Support
Transpose_A=Trueand GEMM_RS for hip backend by @LeiWang1999 in #331 - [Refactor] Optimize RMS normalization kernel in rms_norm.py by @chengyupku in #333
- [AMD] Fix for missing composable kernel include path when compile kernels on amd gpus by @LeiWang1999 in #334
- [Example] Add sparse gqa decode example by @xiayuqing0622 in #332
- [Enhancement] Enhance FP8/FP4 type handling in CUDA codegen by @LeiWang1999 in #323
- [Doc] Fix typo and heading level in GEMV tutorial by @yeh-sudo in #337
- [Dev] Add Group Cast FP8 Example by @chengyupku in #338
- [Enhancement] Support region padding when convert buffer load to buffer region by @LeiWang1999 in #342
- [Example] Add triton block sparse gqa decode by @YizhaoGao in #341
- [Enhancement] Support index bit width configuration by @LeiWang1999 in #343
- [Bugfix] Fix X_amax Correctness Issue in Group Cast FP8 by @chengyupku in #345
- [Bugfix] Fix Transposed Fragment Layout for amd GEMM_RS matrix core by @LeiWang1999 in #346
- [AutoTune] Refactor AutoTuneArtifact to utilize kernel as context instead of profiler by @LeiWang1999 in #344
- [Bugfix] Compile/"cached" still not loading cached kernel for example in example_mha_bwd by @Alex4210987 in #339
- [Refactor] Implement thread-local storage for FrameStack in frame.py and kernel.py by @LeiWang1999 in #352
- [Typo] Replace
kernel.funcwithkernelin mla benchmark scripts by @LeiWang1999 in #354 - [AMD][Docker] Create Dockerfile for ROCm environment setup by @LeiWang1999 in #355
- [Enhancement] Update group_per_split_token_cast_to_fp8 to support multiple data types by @chengyupku in #356
- [Enhancement] Support pass config
disable_warp_specializeto disable auto specialization on hopper by @LeiWang1999 in #357 - [Example] Introduce autotuning example for GEMM with enhanced configuration options by @chengyupku in #360
- [Example] Handle Scenarios in Which a Threadblock is Assigned Only Invalid Block Indices for Sparse Attention by @xiayuqing0622 in #361
- [Bugfix] Correct dynamic shared memory size error handling in HIP by @LeiWang1999 in #362
- [AMD] Implement Deepseek MLA for AMD by @LeiWang1999 in #363
- [Bugfix] Fix compilation issues for amd cdna element size check by @LeiWang1999 in #364
- [AMD] Support FlashMLA with num split template for AMD gpus by @LeiWang1999 in #366
- [MLA][AMD] Add amd mla benchmarking by @LeiWang1999 in #367
- [Bugfix] Adjust Autotuner threadpool
max_workerslimit to available CPUs by @tth37 in #368 - [Language] Introduce
T.any_ofandT.all_ofto reduce a bool arrary by @LeiWang1999 in #371 - [AMD][Setup] Support HIP in setup.py by @zhhangBian in #369
- [Typo] Remove debug print by @LeiWang1999 in #373
- [Docs] Add AMD Flash MLA Documentation to Tutorials Section by @LeiWang1999 in #376
- [Bugfix] Add filelock for cython build by @LeiWang1999 in #377
- [Typo] Remove unused comments generated by copilot by @LeiWang1999 in #379
- [Doc] Add deepseek_mla to documentation index by @LeiWang1999 in #380
- [Refactor] Remove debug message in pass legalize_safe_memory_access by @LeiWang1999 in #381
- [Enhancement][Pipeline] More precise copy code block detection in pipeline by ...
v0.1.3
What's Changed
- [Docker] Add libstdcxx-ng-12 to Dockerfiles for CUDA versions by @LeiWang1999 in #160
- Add cpu jit with backend ctypes by @xs-keju in #154
- [Carver] Multi-Threads Compilation for Fast Auto Tuning by @SiriusNEO in #156
- [Refactor] Replace T.If with native Python if statement for mla paged kernel by @LeiWang1999 in #162
- [Enhancement] Improve CUDA path detection by @xwhzz in #157
- [Refactor] Replace
T.thread_bindingwithT.get_thread_bindingin examples and test cases by @LeiWang1999 in #163 - [Bugfix] Cast bool dtype into int8 in blocksparse examples by @LeiWang1999 in #167
- [Example] Implement NSA Decode tilelang exampls by @LeiWang1999 in #168
- [Release] Bump version to v0.1.2.post1 by @LeiWang1999 in #166
- Use SS-GEMM for PV in mla by @YouJiacheng in #165
- [Example] Implement tilelang native sparse attention varlen example by @LeiWang1999 in #170
- [Bugfix] Implement boundary check for the buffer shape with dynamic symbolic by @LeiWang1999 in #173
- [AutoTune] Enable config-performance trace by @LeiWang1999 in #174
- [Feat] Append Pass Context and TMA lowering configuration option by @LeiWang1999 in #175
- [Feat] Introduce new caching mechanism for compiled kernels by @LeiWang1999 in #176
- [Refactor] Enhance GPU Kernel Launch with Environment Thread Creation by @LeiWang1999 in #178
- [Bugfix] Improve Thread Variable Handling in Layout Inference by @LeiWang1999 in #179
- [Examples] Implement NSA Backward kernels by @LeiWang1999 in #180
- [Enhancement] Optimize CMake build process with dynamic job count calculation by @LeiWang1999 in #183
- [Bugfix] Add dynamic shape support with out_idx in Cython JIT kernel compilation by @LeiWang1999 in #185
- [Dev][Bugfix] Add RMS Normalization Kernels and Fix Reduce Bug by @chengyupku in #188
- [Dev] Add the failed nvcc command to the exception message by @penguin-wwy in #189
- [Bugfix] Fix
T.copyfor scalar datatypes by @LeiWang1999 in #190 - [Enhancement] Simplify GEMM example with direct kernel compilation by @LeiWang1999 in #191
- [Bugfix] Make quickstart work properly on cu118 by @penguin-wwy in #193
- [Language] Support clamp in language by @hyx1999 in #192
- [Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp Specialized Rewriter by @chengyupku in #194
- [Feature] Add TMA Store Synchronization Support by @chengyupku in #195
- Update expired example code. by @66RING in #196
- [CMake] Add CUDA Major Version Detection for Conditional Compilation by @chengyupku in #197
- [Feature] Support Async Pipeline inference within if scope by @LeiWang1999 in #198
- [Dev] Add new example for FlashAttention with pipelined execution by @chengyupku in #200
- [Enhancement] Enhancing the handling of conditional statements in the pipeline by @LeiWang1999 in #201
- [Feature] Upgrade cutlass version and support fp8 T.gemm by @zqh-wz in #202
- [Docker] Update Dockerfiles to specify exact version of libstdcxx-ng by @LeiWang1999 in #203
- [Dev] Add GQA backward example by @chengyupku in #205
- [LICENSE] Typo fix in LICENSE by @LeiWang1999 in #208
- [Enhancement] Allow mma fallback when wgmma is not supported by @LeiWang1999 in #206
- [Examples] Expand tuning configurations for FlashAttention example by @chenghuaWang in #204
- [Enhancement] Avoid tvm ffi handling when out_idx is specified by @LeiWang1999 in #209
- [Fix] Fix K // block_K to T.ceildiv(K,block_K) and add tests by @hyx1999 in #210
- [Dev] Implement IfStmtBinding and MergeIfStmt transformations by @chengyupku in #211
- [Language] Introduce
T.reshapeandT.viewby @LeiWang1999 in #212 - [Enhancement] Improve device handling in Cython kernel adapter by @LeiWang1999 in #220
- [Enhancement] Update format script to support force compare with upstream by @LeiWang1999 in #221
- [Refactor] Introduce KernelParam integration across modules by @LeiWang1999 in #223
- [Bugfix] Fix mismatch of shared memory layout and mma atom on Hopper by @zqh-wz in #224
- [Refactor] Update kernel compilation and profiling in examples by @chengyupku in #225
- [Examples] Add fp8 gemm 2xAcc and deepgemm example by @cherichy in #217
- [Doc] Add instructions for installing nightly version by @xwhzz in #226
- [Bugfix] Disable force inline for ldmatrix by @LeiWang1999 in #227
- [Bugfix] Support duplicate tma desc declaration by @LeiWang1999 in #228
- [Refactor] Rename clamp functions and enhance dtype handling in tests by @LeiWang1999 in #232
- [Enhancement] Simplify kernel source extraction in JIT adapters by @LeiWang1999 in #230
- [Feature] Add reduce_max corresponding tests by @LeiWang1999 in #236
- [BugFix] Fix bug of missing MBarrierExpectTX by @chengyupku in #241
- [Refactor] Refactor for Better Layout Conflict Handling by @LeiWang1999 in #240
- [Refactor] Align torch_assert_close tensor comparison with torch.testing.assert_close by @xwhzz in #239
- [Dev] Implement FlashAttention3 Backward by @chengyupku in #244
- [BugFix] Fix bug of mismatching dtype in testing by @xwhzz in #245
- [Enhancement] Add zero initialization option to GEMM operations by @chengyupku in #246
- [Enhancement][CUDA] Avoid C7508 for CUDA backend via assigning default value to
minBlocksPerMultiprocesorby @cherichy in #248 - [Feature] Add database storage for JITKernel cache with Cython and Ctypes adapters by @Alex4210987 in #213
- [Examples] Implement elementwise add kernel by @chenghuaWang in #219
- [Refactor] Phaseout LLVM Dependency by Making it Optional by @LeiWang1999 in #247
- [Readme] Update Bib Citation Section by @LeiWang1999 in #249
- [Enhancement] Support float variable as arguments by @LeiWang1999 in #250
- add autotune to example_gemm.py by @yyttt6 in #252
- [Language] Introduce
T.alloc_varto define a variable likeint var;by @LeiWang1999 in #255 - [Example] Implement Kernel Example cumsum by @LeiWang1999 in #258
- [Refactor] Refactor CUDA post-processing callback registration in TileLang by @LeiWang1999 in #259
- [Refactor] Move compilation outside critical section by @YouJiacheng in #260
- [CI] Use auditwheel to generate manylinux wheels by @oraluben in #251
- [Bugfix] Fix Benchmark/Example Code for Autotuning by @SiriusNEO in #254
- [Language] Enhance alias to support blockwise memory load by @LeiWang1999 in #261
- [Bugfix] Fix auto tuning tma handling by @LeiWang1999 in #263
- [Release] Bump version to 0.1.3 by @LeiWang1999 in #264
New Contributors
- @xs-keju made their first contribution in #154
- @YouJiacheng made their first contribution in #165
- @penguin-wwy made their first contribution in #189
- @hyx1999 made their first contribution in #192
- @66RING made their first contribution in https://github.com/tile-ai/tilelang/pull/...
v0.1.2.post1
Why we need this post release?
The v0.1.2 prebuild package used a legacy cython file, which may lead to some bugs.
What's Changed
- [Docker] Add libstdcxx-ng-12 to Dockerfiles for CUDA versions by @LeiWang1999 in #160
- Add cpu jit with backend ctypes by @xs-keju in #154
- [Carver] Multi-Threads Compilation for Fast Auto Tuning by @SiriusNEO in #156
- [Refactor] Replace T.If with native Python if statement for mla paged kernel by @LeiWang1999 in #162
- [Enhancement] Improve CUDA path detection by @xwhzz in #157
- [Refactor] Replace
T.thread_bindingwithT.get_thread_bindingin examples and test cases by @LeiWang1999 in #163 - [Bugfix] Cast bool dtype into int8 in blocksparse examples by @LeiWang1999 in #167
- [Example] Implement NSA Decode tilelang exampls by @LeiWang1999 in #168
New Contributors
- @xs-keju made their first contribution in #154
Full Changelog: v0.1.2...v0.1.2.post1
v0.1.2
What's Changed
- [Dev] Add MLA and GQA decode examples by @chengyupku in #109
- [Example] Add Split-K and Stream-K Examples and move MLA from fld to mla by @LeiWang1999 in #110
- [Typo] Fix a typo in gemm splitk examples by @LeiWang1999 in #111
- [Typo] Fix links in installation instructions in README.md by @xwhzz in #112
- [Typo] Fix formatting in installation instructions in README.md by @xwhzz in #113
- [Benchmark] Add benchmark scripts for block sparse attention by @LeiWang1999 in #114
- [Dev] Support vectorized value pack and atomicAdd for BFloat16 DType by @LeiWang1999 in #116
- [Bugfix] Bugfix of pass order for hopper by @chengyupku in #117
- [Dev] Update MLA decode kernel by @chengyupku in #120
- [Example] Add GQA Example by @LeiWang1999 in #118
- [Example] Implement TileLang Native Sparse Attention Kernel by @LeiWang1999 in #121
- [Doc] Update README.md with new example links for Flash MLA Decoding and Native Sparse Attention by @chengyupku in #122
- [Example] Update GEMM FP8 Example by @LeiWang1999 in #123
- [Dev] Add RetNet Linear Attention example by @chengyupku in #124
- [JIT] Enhance cython/ctypes wrapper for tma descriptor by @LeiWang1999 in #126
- [Dev][Bugfix] Fix bug in ThreadTagChecker; Add WgmmaSync rewriter and add MHA WGMMA pipelined example by @chengyupku in #128
- [Dev] Remove buffer flatten when debug print a shared buffer by @LeiWang1999 in #129
- [Debug] Support
T.printforfragmentscope by @LeiWang1999 in #130 - [Example] Implememt FMHA Varlen Example by @LeiWang1999 in #131
- [Refactor] Set default log level from waning into info by @LeiWang1999 in #132
- [Kernel] Implement different SEQ Q/KV examples with block sparse by @LeiWang1999 in #133
- [Dev][Doc] Add DeepSeek MLA Decode Example with Documentation and Performance Benchmarks by @chengyupku in #134
- [Doc] Update MLA Documentation by @chengyupku in #135
- [Debug] Improve Memory Layout Plot by @LeiWang1999 in #136
- [Doc] Add MLA Decoding Performance Benchmarks and Documentation by @chengyupku in #137
- [Bugfix] Add missing definition for AtomicAdd by @LeiWang1999 in #138
- [Dev][Doc] Enhance Flash Attention Implementation in GQA Decoding Example and Fix Typo by @chengyupku in #139
- [Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16 by @chengyupku in #141
- [Refactor] Rename gemm fp8 example as we currently lack
T.gemmsupport for fp8 by @LeiWang1999 in #144 - [Enhancement] Support debug print for unsigned char datatype by @LeiWang1999 in #145
- [Enhancement] Enable runtime tensor data type validation by @LeiWang1999 in #146
- [Refactor] Adapt Caver to benchmark by @LeiWang1999 in #148
- [Refactor] Remove BitBLAS Import Check in Benchmark by @SiriusNEO in #150
- [Enhancement] Optimize TileLang install scripts with Dynamic CPU Cores by @LeiWang1999 in #152
- [Carver] Enhance Carver Adaptation for MatMul Benchmarking by @LeiWang1999 in #153
- [Dev][Benchmark] Add MLA paged decoding example and benchmark script by @chengyupku in #158
- [Release] Bump Version to v0.1.2 by @LeiWang1999 in #155
New Contributors
- @SiriusNEO made their first contribution in #150
Full Changelog: v0.1.1...v0.1.2
v0.1.1
What's Changed
- [Doc] Update release news by @LeiWang1999 in #80
- [Doc] Convert docs from rst format to Markdown format. by @xwhzz in #82
- [Bugfix] Bugfix of installing with develop mode by @LeiWang1999 in #81
- [WHL] Support whl building for different python versions via tox by @LeiWang1999 in #83
- [Refactor] Separate tilelang Pass Thread Sync (with Hopper support) from tvm by @LeiWang1999 in #85
- [Backend][WebGPU] Support WebGPU WGSL code generation by @LeiWang1999 in #86
- [Wheel] Support pypi build scripts for different python via tox by @LeiWang1999 in #93
- [Wrap] Use a ctypes-based kernel wrapper instead of dlpack for runtime efficiency by @LeiWang1999 in #95
- [Bugfix] Update Dockerfile.cu120 by @LeiWang1999 in #98
- [Bugfix] Put
InjectPtxAsyncCopyPass behindThreadSyncPass by @LeiWang1999 in #97 - [Feature] Add CTypes JIT kernel support by @LeiWang1999 in #100
- [Docker] Add Dockerfiles for multiple CUDA versions by @LeiWang1999 in #103
- [JIT] Support Cython jit and make cython a default execution backend by @LeiWang1999 in #102
- [Refactor] Phrase out torch cpp extension backend by @LeiWang1999 in #104
- [Wheel] Provide a bare docker scripts to help build wheels for manylinux by @LeiWang1999 in #105
- [Example] Implement simple block sparse kernel by @LeiWang1999 in #106
- [Release] Bumpy version to v0.1.1 by @LeiWang1999 in #107
Full Changelog: v0.1.0...v0.1.1
v0.1.0
What's Changed
- [LICENSE] Add LICENSE for flashinfer by @LeiWang1999 in #19
- [Doc] Fix installation scripts and docs for dequantize gemm by @LeiWang1999 in #20
- [Doc] Use sphinx to generate docs. by @xwhzz in #21
- [Doc] update installation.md and readme by @Cunxiao2002 in #22
- [Doc] fix a typo in installation.rst by @Cunxiao2002 in #24
- [Doc] Remove legacy files and update reference by @LeiWang1999 in #25
- [CI][Test] Add test cases for tilelang transform
AnnotateDeviceRegionsandMakePackedAPIby @LeiWang1999 in #26 - [Doc] Create a workflow to host docs using GitHub Pages. by @xwhzz in #28
- [CI][Test] Add test cases for tilelang transform InjectSoftwarePipeline and FrontendLegalize by @Cunxiao2002 in #30
- [Bugfix] Replace thread binding detector in LayoutInference Pass by @LeiWang1999 in #31
- [CI] Comprehensive Test cases Implementation of Matmul Dequantize by @LeiWang1999 in #32
- [Doc] Update GitHub Actions workflow for documentation deployment and add CNAME file. by @xwhzz in #33
- [Refactor] Simplify interface via replacing argument thread binding of intrinsics with
KernelFrame.Currentby @LeiWang1999 in #34 - [Bugfix] Reorder Passes: Place Vectorize Loop Before StorageFlatten and FlattenBuffer to Prevent Redundant Allocations by @LeiWang1999 in #37
- [Doc] Update documentation structure and content by @LeiWang1999 in #39
- [Doc][CI] Update GitHub Actions workflow for documentation build and deployment. by @xwhzz in #42
- [CI] Allow manual triggering of documentation workflow in addition to… by @xwhzz in #43
- [CI][Test] Add test cases for tilelang transform PipelinePlanning by @Cunxiao2002 in #44
- [CI][Test] Add test cases for tilelang transform
LayoutInferenceandLowerTileOpon loop tail split functionality by @tzj-fxz in #29 - [Debug] Introduce
T.printfor buffer and variables logging on frontend by @LeiWang1999 in #45 - [CI] Change pull request trigger to
pull_request_targetfor documen… by @xwhzz in #48 - [Dev] Add FlashDecoding example by @chengyupku in #46
- [Doc] update README that tilelang has been used in AttentionEngine by @smallscientist1 in #50
- [Doc] Remove unnecessary layout annotation by @LeiWang1999 in #49
- [CI][Test] Add test cases for tilelang kernel convolution by @chengyupku in #51
- [Dev] Implement test case for tilelang transformations by @LeiWang1999 in #53
- [CI][Test] Add test cases for tilelang kernel FlashAttention by @chengyupku in #54
- [CI][Test] Add test cases for element_add by @Cunxiao2002 in #47
- [CI] Clean up target repository before publishing documentation. by @xwhzz in #55
- [CI][Test] Add test cases for tilelang transform ClusterPlanning by @chengyupku in #57
- [Doc] Append debug relevant testing and documentations by @LeiWang1999 in #58
- [CI][Test] Add test cases for tilelang transform LowerHopperIntrin by @chengyupku in #59
- [Doc] Add matmul kernel tutorial with tile library by @LeiWang1999 in #60
- [Dev] Separate
LoopVectorizePass from upstream tvm by @LeiWang1999 in #62 - [Dev] Support FP8 Codegen for cuda backend by @LeiWang1999 in #64
- [Dev] Add test case for bfloat16 and int4 gemm with mma by @LeiWang1999 in #65
- [CI][Test] Add test cases for tilelang transform InjectFenceProxy by @chengyupku in #66
- [Tools] Introduce
plot_layoutto visualize the fragment layout by @LeiWang1999 in #68 - [Dev] Remove unnecessary python dependencies by @LeiWang1999 in #69
- [Carver] Introduce a tile-structure based cost model for auto tuning by @LeiWang1999 in #70
- [Bugfix] bug fix for bitblas dependency by @LeiWang1999 in #71
- [CI][Test] Add test cases for tilelang transform MultiVersionBuffer and WarpSpecialized by @chengyupku in #72
- [CostModel][Carver] Support Hint Recommend for Shared memory Kernel Fusion by @LeiWang1999 in #73
- [Carver] Remove legacy todo items in carver's readme by @LeiWang1999 in #74
- [Dev] Add mha backward example by @chengyupku in #77
- [Release] Bump version into v0.1.0 by @LeiWang1999 in #76
New Contributors
- @xwhzz made their first contribution in #21
- @Cunxiao2002 made their first contribution in #22
- @tzj-fxz made their first contribution in #29
- @chengyupku made their first contribution in #46
- @smallscientist1 made their first contribution in #50
Full Changelog: v0.0.1...v0.1.0