Release v0.1.6 · tile-ai/tilelang

What's Changed

[Bugfix] Added missing thread offsets and other information to reduce by @LeiWang1999 in #646
[Bugfix] Adjust role assignment in warp specialization based on read access by @chengyupku in #647
Fix/jit kernel use target by @meinie0826 in #648
[Bugfix] Remove small array reuse condition in shared memory allocation merging by @LeiWang1999 in #654
[Enhancement] Add role assignment for AllocateNode in warp specialization by @chengyupku in #657
[Bugfix][CI] Bug fixing and migrate CI from ada to hopper by @xwhzz in #652
[CI] Enable cache for virtual env and parallelize pytest via xdist by @LeiWang1999 in #660
[Cache] Support shared cache directories for multiple process by @LeiWang1999 in #649
[Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control by @xwhzz in #656
add the support of rocm arch detecting by @zhangnju in #661
[BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking by @tzj-fxz in #653
[Bugfix][Docs] Update documentation build process and configurations for autoapi support by @xwhzz in #663
[Enhancement] Improve buffer conflict detection in thread storage synchronization by @LeiWang1999 in #658
[Bugfix] Consider buffer data type into indices provably disjoint analysis by @LeiWang1999 in #664
[Bugfix] Remove redundant T.fill to fix precision issue by @xuchangtolearn in #667
[Enhancement] Refactor buffer index handling for improved precision a… by @Alex4210987 in #671
Reverts #671 by @LeiWang1999 in #672
[Bugfix] Passing correct nvcc to cmake by @chenyang78 in #670
[CI] Improve format check output and automate commit of changes by @xwhzz in #669
[Bugfix][CI] Use valid runner labels in workflow by @xwhzz in #674
[Enhancement] passing verbose to LibraryGenerator by @chenyang78 in #673
[Enhancement] Enhance lint error messaging in CI by @xwhzz in #675
Refactor to support upstream tvm by @Hzfengsy in #595
Do not check for short variables by @oraluben in #676
[Refactor] Phaseout version with commit id in editable model by @LeiWang1999 in #677
[CI] Update CI workflow to use Python 3.12 by @LeiWang1999 in #679
[Enhancement] Output cache-file-related messages with verbose=True by @chenyang78 in #683
[Enhancement] Enhance warp specialization logic by @chengyupku in #680
Add Flash Attn example on amd mi300 series by @Alex4210987 in #682
[Enhancement] Refactored buffer detection logic in warp_specialized_rewriter.cc by @chengyupku in #685
[Fix] fix some issues with JIT decorators existing in the examples by @Cunxiao2002 in #681
[Enhancement] Add --ptxas-options=--register-usage-level=10 option by @LeiWang1999 in #684
[Feature]:Add auto vectorize for atomic add by @yyttt6 in #686
[Refactor] Rebase pipeline injector from upstream tvm by @LeiWang1999 in #687
[Refactor] Introduce GemmInst for different targets handling by @LeiWang1999 in #688
[Enhancement] Optimize BF16 casting performance by @xwhzz in #689
[Smem Reuse] Optimize to do memory alignment on identical buffers. by @LeiWang1999 in #693
[Version] Keep local commit id as it somehow help with debugging by @LeiWang1999 in #697
[Example] Optimize warp specialize flashmla example by @LeiWang1999 in #698
Bump transformers from 4.52.1 to 4.53.0 in /examples/bitnet-1.58b by @dependabot[bot] in #700
Gated Delta Net(GDN) kernel implementation in TileLang by @tzj-fxz in #695
Trivial update to calculate target arch by @oraluben in #702
[CI] Remove Flash Attention dependency by @LeiWang1999 in #705
[Layout] Introduce a new layout inference mechanism by @LeiWang1999 in #699
[Pipeline] Optimize inject software pipeline and pipeline planing pass by @LeiWang1999 in #706
Low-bit kernels fix and implementation by @tzj-fxz in #704
[Feat] Support gemm with stride by @smallscientist1 in #701
[Enhancement] Add eviction policy support for TMA operations, enhance CUDA codegen, and introduce new pass config by @xwhzz in #690
[Enhancement] Enhance the robustness and generality of MLA examples by @Rachmanino in #709
[Refactor] MergeAnnotations function to accept Map<Any, Any> instead of Map<String, Any> by @LeiWang1999 in #710
[Pipeline] Phaseout fragment and double buffer info from pipeline pass by @LeiWang1999 in #711
[Pipeline] Skip condition expression analysis for global reading by @LeiWang1999 in #713
[Index] Relocate Int64 Auto Promoter to ConfigBitWidth Pass, removing it from FlattenBuffer by @LeiWang1999 in #714
[CI] Bind build-test CI to NVIDIA as AMD runners are being introduced by @LeiWang1999 in #718
fix: NVRTC backend by @lucifer1004 in #717
[CUDA] Init support for sm_120 by @oraluben in #716
[Bugfix] Correct git configuration in docs CI by @xwhzz in #720
[Chore] fix typos by @lucifer1004 in #719
[CI][AMD] Add AMD GPU CI and fix some related bugs by @Alex4210987 in #694
[Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy by @NaOHCC in #724
[Refactor] Refactor CUDA code generation to simplify eviction policy handling by @LeiWang1999 in #721
[Language] Introduce StridedTensor to support non contigious torch inputs by @LeiWang1999 in #722
[Enhancement][Bugfix] Fix bug in warp specialized pass and add gemm_sr fallback support for Hopper by @xwhzz in #712
📝 Add docstrings to fix by @coderabbitai[bot] in #726
fix amd ci&add examples by @Alex4210987 in #729
[Feature] Low-bit twiddling dequantization and FP4 GEMM by @tzj-fxz in #725
📝 Add docstrings to mxfp4 by @coderabbitai[bot] in #732
[Refactor] Refactor env into a more flexible version by @LeiWang1999 in #740
[Bugfix] Align stride index validation with torch in CythonKernelWrapper by @LeiWang1999 in #743
[Bugfix]:Fix atomic add auto vectorize memory access out of bound error by @yyttt6 in #742
📝 Add docstrings to main by @coderabbitai[bot] in #745
[Refactor] Refactor barrier management by @LeiWang1999 in #744
[Refactor] Merge bulk copy into copy and improve layout inference for bulk copy by @LeiWang1999 in #746
[Refactor] Merge ThreadPartialSync and ThreadStorageSync by @LeiWang1999 in #741
[Enhancement] Optimize loop body handling in IR by @chengyupku in #749
[MXFP4] Fix bugs and optimize exponential operation by @tzj-fxz in #750
[Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h by @LeiWang1999 in #751
[Enhancement] Add shape checking for reduce options by @kurisu6912 in #748
[Bugfix] Add missing FP8 header include by @LeiWang1999 in #752
[MXFP4] Add bias to MXFP4 GEMM kernel by @tzj-fxz in #753
[Bugfix][WS] Consider loop min extent when computing phase id by @LeiWang1999 in #754
[Typo] Remove disable_cache in some tests by @LeiWang1999 in #755
[README] Update GDN README for clarity and add acknowledgements by @chengyupku in #758
cutlass v4.2.0 supporting cuda 13 by @johnnynunez in #760
[Feature] Add 1D TMA support by @tzj-fxz in #761
[Example] Add vertical slash sparse attention pattern by @xwhzz in #762
[Bugfix] Address PassContext contamination from CI and fix incorrect rewrites in warp specialized pass by @xwhzz in #767
[MXFP4] Add 1D TMA copy for Scale tensor in MXFP4 GEMM by @tzj-fxz in #766
[CUTLASS] hot fix blackwell by @johnnynunez in #768
[Refactor] Refactor Operator into TileOperator and with tvm reflection by @LeiWang1999 in #763
[Reducer] Introduce alloc_reducer to separate inter and intra warp reduction by @LeiWang1999 in #757
📝 Add docstrings to pytile_0826 by @coderabbitai[bot] in #770
[Bugfix]:Fix atomic add auto vectorize negative optimization by @yyttt6 in #765
📝 Add docstrings to reducer_0825 by @coderabbitai[bot] in #772
Allow fill global buffer by @kurisu6912 in #774
[BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match by @tzj-fxz in #771
[Enhancement] Add exp fallback for bf16 by @xwhzz in #776
[Lint] Introduce clang-tidy into format.sh by @LeiWang1999 in #777
[Cache] Introduce detailed target information for the disk kernel cache by @LeiWang1999 in #780
[Example]Adds example for top-k operation by @Cunxiao2002 in #775
[Math] Dispatch T.rsqrt(x) into cuda intrin instead of 1 / T.sqrt(x) by @LeiWang1999 in #781
[CI] Adds pytest-durations for test timing by @Cunxiao2002 in #782
[Refactor] Support python reflection for tile operators by @LeiWang1999 in #783
fix amd tir&add examples by @Alex4210987 in #784
[Nvidia][SM121] Add intrin.h include to gemm_mma.h for sm120+ by @HaoKang-Timmy in #785
[Feat] Add tilelang T.assume support and assume injection for buffer shapes by @kurisu6912 in #787
[Bugfix] Fix incorrect synchronization bug in minference example by @xwhzz in #786
[AMD] fix bugs in warp shuffle by @txs19991 in #790
[AMD] fix mfma op interface by @Paran0idy in #791
[TMA] Automatically lower 1d tma in appropriate cases by @LeiWang1999 in #788
[CI]Adds pytest timeout to CI by @Cunxiao2002 in #792
[Enhancement] Resolve reference cycle by @LeiWang1999 in #795
[Bugfix] Fix index handling to promote 64-bit integers by @LeiWang1999 in #796
[AMD] support mfma i32_16x16x32_i8 by @Paran0idy in #800
[TileOp] Introduce a experimental python defined T.gemm_v2 by @LeiWang1999 in #793
[Bugfix] Expose alloc_reducer definition to the python side by @LeiWang1999 in #802
[Refactor] Use new namespace and enhance dispatch macros for mma by @LeiWang1999 in #801
[AMD] support fp8 T.gemm by @txs19991 in #804
[AMD] support preshuffle weight mfma by @Paran0idy in #806
Add pytest-durations to requirements for ROCm by @Alex4210987 in #810
Add ruff config to check for useless spaces by @oraluben in #807
[Feature] Add ptx_cp_async_barrier_noinc intrinsic and related functionality by @chengyupku in #809
[Fix] Fix lower bug when buffer store is not guarded by any tile op by @kurisu6912 in #794
[feat] support gemm_sp for ampere arch by @botbw in #691
[Refactor] Update TVM subproject and refactor BlockNode handling in warp_specialized_rewriter.cc by @chengyupku in #812
[Refactor] Reopen #794 Fix lower bug when buffer store is not guarded by any tile op by @kurisu6912 in #817
[Refactor] Update TVM subproject and streamline buffer store handling by @chengyupku in #816
[Example] add w4a8 gemm kernel by @Cunxiao2002 in #815
[CI] fix rocm ci by @Cunxiao2002 in #819
[example] fix unused param in mhs example by @botbw in #821
[DSL] Support python tenary if then else expression by @LeiWang1999 in #822
[Bugfix] Bug fix when git command is not installed by @LeiWang1999 in #823
[Bugfix] Skip fp4 dtype binding when using older versions of ml_dtypes by @LeiWang1999 in #824
[Enhancement] Add a MXFP4 grouped GEMM example for FusedMoE by @Rachmanino in #811
[CMake] Added support for statically linked system libc library by @LeiWang1999 in #825
[Refactor] Refactor some build related configurations by @LeiWang1999 in #827
[CI] Test Fix: Handle BufferLoad nodes when T.gemm input has a stride by @LeiWang1999 in #843
[Refactor] Turn off ENABLE_FAST_MATH by default by @LeiWang1999 in #846
[AMD] fix bf16x2 dtype codegen by @Paran0idy in #847
[Typing] Fallback from Python 3.10+ type syntax for compatibility by @LeiWang1999 in #848
[TIR] Refactor division simplification in RewriteSimplifier by @LeiWang1999 in #849
[Py38] Revert typing and parser updates for Python 3.8 compatibility by @LeiWang1999 in #850
[Bugfix] Disable Memory Info Analysis for local.var by @LeiWang1999 in #851
[Release] Bump Version to 0.1.6 by @LeiWang1999 in #818

New Contributors

@meinie0826 made their first contribution in #648
@zhangnju made their first contribution in #661
@chenyang78 made their first contribution in #670
@Hzfengsy made their first contribution in #595
@coderabbitai[bot] made their first contribution in #726
@kurisu6912 made their first contribution in #748
@johnnynunez made their first contribution in #760
@HaoKang-Timmy made their first contribution in #785
@txs19991 made their first contribution in #790
@Paran0idy made their first contribution in #791

Full Changelog: https://github.com/tile-ai/tilelang/commits/0.1.6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.1.6

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!