v0.1.6
What's Changed
- [Bugfix] Added missing thread offsets and other information to reduce by @LeiWang1999 in #646
- [Bugfix] Adjust role assignment in warp specialization based on read access by @chengyupku in #647
- Fix/jit kernel use target by @meinie0826 in #648
- [Bugfix] Remove small array reuse condition in shared memory allocation merging by @LeiWang1999 in #654
- [Enhancement] Add role assignment for AllocateNode in warp specialization by @chengyupku in #657
- [Bugfix][CI] Bug fixing and migrate CI from ada to hopper by @xwhzz in #652
- [CI] Enable cache for virtual env and parallelize pytest via xdist by @LeiWang1999 in #660
- [Cache] Support shared cache directories for multiple process by @LeiWang1999 in #649
- [Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control by @xwhzz in #656
- add the support of rocm arch detecting by @zhangnju in #661
- [BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking by @tzj-fxz in #653
- [Bugfix][Docs] Update documentation build process and configurations for autoapi support by @xwhzz in #663
- [Enhancement] Improve buffer conflict detection in thread storage synchronization by @LeiWang1999 in #658
- [Bugfix] Consider buffer data type into indices provably disjoint analysis by @LeiWang1999 in #664
- [Bugfix] Remove redundant T.fill to fix precision issue by @xuchangtolearn in #667
- [Enhancement] Refactor buffer index handling for improved precision a… by @Alex4210987 in #671
- Reverts #671 by @LeiWang1999 in #672
- [Bugfix] Passing correct nvcc to cmake by @chenyang78 in #670
- [CI] Improve format check output and automate commit of changes by @xwhzz in #669
- [Bugfix][CI] Use valid runner labels in workflow by @xwhzz in #674
- [Enhancement] passing verbose to LibraryGenerator by @chenyang78 in #673
- [Enhancement] Enhance lint error messaging in CI by @xwhzz in #675
- Refactor to support upstream tvm by @Hzfengsy in #595
- Do not check for short variables by @oraluben in #676
- [Refactor] Phaseout version with commit id in editable model by @LeiWang1999 in #677
- [CI] Update CI workflow to use Python 3.12 by @LeiWang1999 in #679
- [Enhancement] Output cache-file-related messages with verbose=True by @chenyang78 in #683
- [Enhancement] Enhance warp specialization logic by @chengyupku in #680
- Add Flash Attn example on amd mi300 series by @Alex4210987 in #682
- [Enhancement] Refactored buffer detection logic in warp_specialized_rewriter.cc by @chengyupku in #685
- [Fix] fix some issues with JIT decorators existing in the examples by @Cunxiao2002 in #681
- [Enhancement] Add
--ptxas-options=--register-usage-level=10option by @LeiWang1999 in #684 - [Feature]:Add auto vectorize for atomic add by @yyttt6 in #686
- [Refactor] Rebase pipeline injector from upstream tvm by @LeiWang1999 in #687
- [Refactor] Introduce GemmInst for different targets handling by @LeiWang1999 in #688
- [Enhancement] Optimize BF16 casting performance by @xwhzz in #689
- [Smem Reuse] Optimize to do memory alignment on identical buffers. by @LeiWang1999 in #693
- [Version] Keep local commit id as it somehow help with debugging by @LeiWang1999 in #697
- [Example] Optimize warp specialize flashmla example by @LeiWang1999 in #698
- Bump transformers from 4.52.1 to 4.53.0 in /examples/bitnet-1.58b by @dependabot[bot] in #700
- Gated Delta Net(GDN) kernel implementation in TileLang by @tzj-fxz in #695
- Trivial update to calculate target arch by @oraluben in #702
- [CI] Remove Flash Attention dependency by @LeiWang1999 in #705
- [Layout] Introduce a new layout inference mechanism by @LeiWang1999 in #699
- [Pipeline] Optimize inject software pipeline and pipeline planing pass by @LeiWang1999 in #706
- Low-bit kernels fix and implementation by @tzj-fxz in #704
- [Feat] Support gemm with stride by @smallscientist1 in #701
- [Enhancement] Add eviction policy support for TMA operations, enhance CUDA codegen, and introduce new pass config by @xwhzz in #690
- [Enhancement] Enhance the robustness and generality of MLA examples by @Rachmanino in #709
- [Refactor] MergeAnnotations function to accept Map<Any, Any> instead of Map<String, Any> by @LeiWang1999 in #710
- [Pipeline] Phaseout fragment and double buffer info from pipeline pass by @LeiWang1999 in #711
- [Pipeline] Skip condition expression analysis for global reading by @LeiWang1999 in #713
- [Index] Relocate Int64 Auto Promoter to ConfigBitWidth Pass, removing it from FlattenBuffer by @LeiWang1999 in #714
- [CI] Bind build-test CI to NVIDIA as AMD runners are being introduced by @LeiWang1999 in #718
- fix: NVRTC backend by @lucifer1004 in #717
- [CUDA] Init support for sm_120 by @oraluben in #716
- [Bugfix] Correct git configuration in docs CI by @xwhzz in #720
- [Chore] fix typos by @lucifer1004 in #719
- [CI][AMD] Add AMD GPU CI and fix some related bugs by @Alex4210987 in #694
- [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy by @NaOHCC in #724
- [Refactor] Refactor CUDA code generation to simplify eviction policy handling by @LeiWang1999 in #721
- [Language] Introduce
StridedTensorto support non contigious torch inputs by @LeiWang1999 in #722 - [Enhancement][Bugfix] Fix bug in warp specialized pass and add gemm_sr fallback support for Hopper by @xwhzz in #712
- 📝 Add docstrings to
fixby @coderabbitai[bot] in #726 - fix amd ci&add examples by @Alex4210987 in #729
- [Feature] Low-bit twiddling dequantization and FP4 GEMM by @tzj-fxz in #725
- 📝 Add docstrings to
mxfp4by @coderabbitai[bot] in #732 - [Refactor] Refactor env into a more flexible version by @LeiWang1999 in #740
- [Bugfix] Align stride index validation with torch in CythonKernelWrapper by @LeiWang1999 in #743
- [Bugfix]:Fix atomic add auto vectorize memory access out of bound error by @yyttt6 in #742
- 📝 Add docstrings to
mainby @coderabbitai[bot] in #745 - [Refactor] Refactor barrier management by @LeiWang1999 in #744
- [Refactor] Merge bulk copy into copy and improve layout inference for bulk copy by @LeiWang1999 in #746
- [Refactor] Merge ThreadPartialSync and ThreadStorageSync by @LeiWang1999 in #741
- [Enhancement] Optimize loop body handling in IR by @chengyupku in #749
- [MXFP4] Fix bugs and optimize exponential operation by @tzj-fxz in #750
- [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h by @LeiWang1999 in #751
- [Enhancement] Add shape checking for reduce options by @kurisu6912 in #748
- [Bugfix] Add missing FP8 header include by @LeiWang1999 in #752
- [MXFP4] Add bias to MXFP4 GEMM kernel by @tzj-fxz in #753
- [Bugfix][WS] Consider loop min extent when computing phase id by @LeiWang1999 in #754
- [Typo] Remove
disable_cachein some tests by @LeiWang1999 in #755 - [README] Update GDN README for clarity and add acknowledgements by @chengyupku in #758
- cutlass v4.2.0 supporting cuda 13 by @johnnynunez in #760
- [Feature] Add 1D TMA support by @tzj-fxz in #761
- [Example] Add vertical slash sparse attention pattern by @xwhzz in #762
- [Bugfix] Address PassContext contamination from CI and fix incorrect rewrites in warp specialized pass by @xwhzz in #767
- [MXFP4] Add 1D TMA copy for Scale tensor in MXFP4 GEMM by @tzj-fxz in #766
- [CUTLASS] hot fix blackwell by @johnnynunez in #768
- [Refactor] Refactor
OperatorintoTileOperatorand with tvm reflection by @LeiWang1999 in #763 - [Reducer] Introduce
alloc_reducerto separate inter and intra warp reduction by @LeiWang1999 in #757 - 📝 Add docstrings to
pytile_0826by @coderabbitai[bot] in #770 - [Bugfix]:Fix atomic add auto vectorize negative optimization by @yyttt6 in #765
- 📝 Add docstrings to
reducer_0825by @coderabbitai[bot] in #772 - Allow fill global buffer by @kurisu6912 in #774
- [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match by @tzj-fxz in #771
- [Enhancement] Add exp fallback for bf16 by @xwhzz in #776
- [Lint] Introduce clang-tidy into format.sh by @LeiWang1999 in #777
- [Cache] Introduce detailed target information for the disk kernel cache by @LeiWang1999 in #780
- [Example]Adds example for top-k operation by @Cunxiao2002 in #775
- [Math] Dispatch
T.rsqrt(x)into cuda intrin instead of1 / T.sqrt(x)by @LeiWang1999 in #781 - [CI] Adds pytest-durations for test timing by @Cunxiao2002 in #782
- [Refactor] Support python reflection for tile operators by @LeiWang1999 in #783
- fix amd tir&add examples by @Alex4210987 in #784
- [Nvidia][SM121] Add intrin.h include to gemm_mma.h for sm120+ by @HaoKang-Timmy in #785
- [Feat] Add tilelang T.assume support and assume injection for buffer shapes by @kurisu6912 in #787
- [Bugfix] Fix incorrect synchronization bug in minference example by @xwhzz in #786
- [AMD] fix bugs in warp shuffle by @txs19991 in #790
- [AMD] fix mfma op interface by @Paran0idy in #791
- [TMA] Automatically lower 1d tma in appropriate cases by @LeiWang1999 in #788
- [CI]Adds pytest timeout to CI by @Cunxiao2002 in #792
- [Enhancement] Resolve reference cycle by @LeiWang1999 in #795
- [Bugfix] Fix index handling to promote 64-bit integers by @LeiWang1999 in #796
- [AMD] support mfma i32_16x16x32_i8 by @Paran0idy in #800
- [TileOp] Introduce a experimental python defined
T.gemm_v2by @LeiWang1999 in #793 - [Bugfix] Expose
alloc_reducerdefinition to the python side by @LeiWang1999 in #802 - [Refactor] Use new namespace and enhance dispatch macros for mma by @LeiWang1999 in #801
- [AMD] support fp8 T.gemm by @txs19991 in #804
- [AMD] support preshuffle weight mfma by @Paran0idy in #806
- Add pytest-durations to requirements for ROCm by @Alex4210987 in #810
- Add ruff config to check for useless spaces by @oraluben in #807
- [Feature] Add ptx_cp_async_barrier_noinc intrinsic and related functionality by @chengyupku in #809
- [Fix] Fix lower bug when buffer store is not guarded by any tile op by @kurisu6912 in #794
- [feat] support gemm_sp for ampere arch by @botbw in #691
- [Refactor] Update TVM subproject and refactor BlockNode handling in warp_specialized_rewriter.cc by @chengyupku in #812
- [Refactor] Reopen #794 Fix lower bug when buffer store is not guarded by any tile op by @kurisu6912 in #817
- [Refactor] Update TVM subproject and streamline buffer store handling by @chengyupku in #816
- [Example] add w4a8 gemm kernel by @Cunxiao2002 in #815
- [CI] fix rocm ci by @Cunxiao2002 in #819
- [example] fix unused param in mhs example by @botbw in #821
- [DSL] Support python tenary if then else expression by @LeiWang1999 in #822
- [Bugfix] Bug fix when git command is not installed by @LeiWang1999 in #823
- [Bugfix] Skip fp4 dtype binding when using older versions of ml_dtypes by @LeiWang1999 in #824
- [Enhancement] Add a MXFP4 grouped GEMM example for FusedMoE by @Rachmanino in #811
- [CMake] Added support for statically linked system libc library by @LeiWang1999 in #825
- [Refactor] Refactor some build related configurations by @LeiWang1999 in #827
- [CI] Test Fix: Handle BufferLoad nodes when T.gemm input has a stride by @LeiWang1999 in #843
- [Refactor] Turn off
ENABLE_FAST_MATHby default by @LeiWang1999 in #846 - [AMD] fix bf16x2 dtype codegen by @Paran0idy in #847
- [Typing] Fallback from Python 3.10+ type syntax for compatibility by @LeiWang1999 in #848
- [TIR] Refactor division simplification in RewriteSimplifier by @LeiWang1999 in #849
- [Py38] Revert typing and parser updates for Python 3.8 compatibility by @LeiWang1999 in #850
- [Bugfix] Disable Memory Info Analysis for
local.varby @LeiWang1999 in #851 - [Release] Bump Version to 0.1.6 by @LeiWang1999 in #818
New Contributors
- @meinie0826 made their first contribution in #648
- @zhangnju made their first contribution in #661
- @chenyang78 made their first contribution in #670
- @Hzfengsy made their first contribution in #595
- @coderabbitai[bot] made their first contribution in #726
- @kurisu6912 made their first contribution in #748
- @johnnynunez made their first contribution in #760
- @HaoKang-Timmy made their first contribution in #785
- @txs19991 made their first contribution in #790
- @Paran0idy made their first contribution in #791
Full Changelog: https://github.com/tile-ai/tilelang/commits/0.1.6