Skip to content

v0.1.6

Choose a tag to compare

@LeiWang1999 LeiWang1999 released this 19 Sep 15:28
· 306 commits to main since this release
1ad6e46

What's Changed

  • [Bugfix] Added missing thread offsets and other information to reduce by @LeiWang1999 in #646
  • [Bugfix] Adjust role assignment in warp specialization based on read access by @chengyupku in #647
  • Fix/jit kernel use target by @meinie0826 in #648
  • [Bugfix] Remove small array reuse condition in shared memory allocation merging by @LeiWang1999 in #654
  • [Enhancement] Add role assignment for AllocateNode in warp specialization by @chengyupku in #657
  • [Bugfix][CI] Bug fixing and migrate CI from ada to hopper by @xwhzz in #652
  • [CI] Enable cache for virtual env and parallelize pytest via xdist by @LeiWang1999 in #660
  • [Cache] Support shared cache directories for multiple process by @LeiWang1999 in #649
  • [Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control by @xwhzz in #656
  • add the support of rocm arch detecting by @zhangnju in #661
  • [BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking by @tzj-fxz in #653
  • [Bugfix][Docs] Update documentation build process and configurations for autoapi support by @xwhzz in #663
  • [Enhancement] Improve buffer conflict detection in thread storage synchronization by @LeiWang1999 in #658
  • [Bugfix] Consider buffer data type into indices provably disjoint analysis by @LeiWang1999 in #664
  • [Bugfix] Remove redundant T.fill to fix precision issue by @xuchangtolearn in #667
  • [Enhancement] Refactor buffer index handling for improved precision a… by @Alex4210987 in #671
  • Reverts #671 by @LeiWang1999 in #672
  • [Bugfix] Passing correct nvcc to cmake by @chenyang78 in #670
  • [CI] Improve format check output and automate commit of changes by @xwhzz in #669
  • [Bugfix][CI] Use valid runner labels in workflow by @xwhzz in #674
  • [Enhancement] passing verbose to LibraryGenerator by @chenyang78 in #673
  • [Enhancement] Enhance lint error messaging in CI by @xwhzz in #675
  • Refactor to support upstream tvm by @Hzfengsy in #595
  • Do not check for short variables by @oraluben in #676
  • [Refactor] Phaseout version with commit id in editable model by @LeiWang1999 in #677
  • [CI] Update CI workflow to use Python 3.12 by @LeiWang1999 in #679
  • [Enhancement] Output cache-file-related messages with verbose=True by @chenyang78 in #683
  • [Enhancement] Enhance warp specialization logic by @chengyupku in #680
  • Add Flash Attn example on amd mi300 series by @Alex4210987 in #682
  • [Enhancement] Refactored buffer detection logic in warp_specialized_rewriter.cc by @chengyupku in #685
  • [Fix] fix some issues with JIT decorators existing in the examples by @Cunxiao2002 in #681
  • [Enhancement] Add --ptxas-options=--register-usage-level=10 option by @LeiWang1999 in #684
  • [Feature]:Add auto vectorize for atomic add by @yyttt6 in #686
  • [Refactor] Rebase pipeline injector from upstream tvm by @LeiWang1999 in #687
  • [Refactor] Introduce GemmInst for different targets handling by @LeiWang1999 in #688
  • [Enhancement] Optimize BF16 casting performance by @xwhzz in #689
  • [Smem Reuse] Optimize to do memory alignment on identical buffers. by @LeiWang1999 in #693
  • [Version] Keep local commit id as it somehow help with debugging by @LeiWang1999 in #697
  • [Example] Optimize warp specialize flashmla example by @LeiWang1999 in #698
  • Bump transformers from 4.52.1 to 4.53.0 in /examples/bitnet-1.58b by @dependabot[bot] in #700
  • Gated Delta Net(GDN) kernel implementation in TileLang by @tzj-fxz in #695
  • Trivial update to calculate target arch by @oraluben in #702
  • [CI] Remove Flash Attention dependency by @LeiWang1999 in #705
  • [Layout] Introduce a new layout inference mechanism by @LeiWang1999 in #699
  • [Pipeline] Optimize inject software pipeline and pipeline planing pass by @LeiWang1999 in #706
  • Low-bit kernels fix and implementation by @tzj-fxz in #704
  • [Feat] Support gemm with stride by @smallscientist1 in #701
  • [Enhancement] Add eviction policy support for TMA operations, enhance CUDA codegen, and introduce new pass config by @xwhzz in #690
  • [Enhancement] Enhance the robustness and generality of MLA examples by @Rachmanino in #709
  • [Refactor] MergeAnnotations function to accept Map<Any, Any> instead of Map<String, Any> by @LeiWang1999 in #710
  • [Pipeline] Phaseout fragment and double buffer info from pipeline pass by @LeiWang1999 in #711
  • [Pipeline] Skip condition expression analysis for global reading by @LeiWang1999 in #713
  • [Index] Relocate Int64 Auto Promoter to ConfigBitWidth Pass, removing it from FlattenBuffer by @LeiWang1999 in #714
  • [CI] Bind build-test CI to NVIDIA as AMD runners are being introduced by @LeiWang1999 in #718
  • fix: NVRTC backend by @lucifer1004 in #717
  • [CUDA] Init support for sm_120 by @oraluben in #716
  • [Bugfix] Correct git configuration in docs CI by @xwhzz in #720
  • [Chore] fix typos by @lucifer1004 in #719
  • [CI][AMD] Add AMD GPU CI and fix some related bugs by @Alex4210987 in #694
  • [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy by @NaOHCC in #724
  • [Refactor] Refactor CUDA code generation to simplify eviction policy handling by @LeiWang1999 in #721
  • [Language] Introduce StridedTensor to support non contigious torch inputs by @LeiWang1999 in #722
  • [Enhancement][Bugfix] Fix bug in warp specialized pass and add gemm_sr fallback support for Hopper by @xwhzz in #712
  • 📝 Add docstrings to fix by @coderabbitai[bot] in #726
  • fix amd ci&add examples by @Alex4210987 in #729
  • [Feature] Low-bit twiddling dequantization and FP4 GEMM by @tzj-fxz in #725
  • 📝 Add docstrings to mxfp4 by @coderabbitai[bot] in #732
  • [Refactor] Refactor env into a more flexible version by @LeiWang1999 in #740
  • [Bugfix] Align stride index validation with torch in CythonKernelWrapper by @LeiWang1999 in #743
  • [Bugfix]:Fix atomic add auto vectorize memory access out of bound error by @yyttt6 in #742
  • 📝 Add docstrings to main by @coderabbitai[bot] in #745
  • [Refactor] Refactor barrier management by @LeiWang1999 in #744
  • [Refactor] Merge bulk copy into copy and improve layout inference for bulk copy by @LeiWang1999 in #746
  • [Refactor] Merge ThreadPartialSync and ThreadStorageSync by @LeiWang1999 in #741
  • [Enhancement] Optimize loop body handling in IR by @chengyupku in #749
  • [MXFP4] Fix bugs and optimize exponential operation by @tzj-fxz in #750
  • [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h by @LeiWang1999 in #751
  • [Enhancement] Add shape checking for reduce options by @kurisu6912 in #748
  • [Bugfix] Add missing FP8 header include by @LeiWang1999 in #752
  • [MXFP4] Add bias to MXFP4 GEMM kernel by @tzj-fxz in #753
  • [Bugfix][WS] Consider loop min extent when computing phase id by @LeiWang1999 in #754
  • [Typo] Remove disable_cache in some tests by @LeiWang1999 in #755
  • [README] Update GDN README for clarity and add acknowledgements by @chengyupku in #758
  • cutlass v4.2.0 supporting cuda 13 by @johnnynunez in #760
  • [Feature] Add 1D TMA support by @tzj-fxz in #761
  • [Example] Add vertical slash sparse attention pattern by @xwhzz in #762
  • [Bugfix] Address PassContext contamination from CI and fix incorrect rewrites in warp specialized pass by @xwhzz in #767
  • [MXFP4] Add 1D TMA copy for Scale tensor in MXFP4 GEMM by @tzj-fxz in #766
  • [CUTLASS] hot fix blackwell by @johnnynunez in #768
  • [Refactor] Refactor Operator into TileOperator and with tvm reflection by @LeiWang1999 in #763
  • [Reducer] Introduce alloc_reducer to separate inter and intra warp reduction by @LeiWang1999 in #757
  • 📝 Add docstrings to pytile_0826 by @coderabbitai[bot] in #770
  • [Bugfix]:Fix atomic add auto vectorize negative optimization by @yyttt6 in #765
  • 📝 Add docstrings to reducer_0825 by @coderabbitai[bot] in #772
  • Allow fill global buffer by @kurisu6912 in #774
  • [BugFix] Refactor the op check in LowerTileOp pass using the member function instead of string match by @tzj-fxz in #771
  • [Enhancement] Add exp fallback for bf16 by @xwhzz in #776
  • [Lint] Introduce clang-tidy into format.sh by @LeiWang1999 in #777
  • [Cache] Introduce detailed target information for the disk kernel cache by @LeiWang1999 in #780
  • [Example]Adds example for top-k operation by @Cunxiao2002 in #775
  • [Math] Dispatch T.rsqrt(x) into cuda intrin instead of 1 / T.sqrt(x) by @LeiWang1999 in #781
  • [CI] Adds pytest-durations for test timing by @Cunxiao2002 in #782
  • [Refactor] Support python reflection for tile operators by @LeiWang1999 in #783
  • fix amd tir&add examples by @Alex4210987 in #784
  • [Nvidia][SM121] Add intrin.h include to gemm_mma.h for sm120+ by @HaoKang-Timmy in #785
  • [Feat] Add tilelang T.assume support and assume injection for buffer shapes by @kurisu6912 in #787
  • [Bugfix] Fix incorrect synchronization bug in minference example by @xwhzz in #786
  • [AMD] fix bugs in warp shuffle by @txs19991 in #790
  • [AMD] fix mfma op interface by @Paran0idy in #791
  • [TMA] Automatically lower 1d tma in appropriate cases by @LeiWang1999 in #788
  • [CI]Adds pytest timeout to CI by @Cunxiao2002 in #792
  • [Enhancement] Resolve reference cycle by @LeiWang1999 in #795
  • [Bugfix] Fix index handling to promote 64-bit integers by @LeiWang1999 in #796
  • [AMD] support mfma i32_16x16x32_i8 by @Paran0idy in #800
  • [TileOp] Introduce a experimental python defined T.gemm_v2 by @LeiWang1999 in #793
  • [Bugfix] Expose alloc_reducer definition to the python side by @LeiWang1999 in #802
  • [Refactor] Use new namespace and enhance dispatch macros for mma by @LeiWang1999 in #801
  • [AMD] support fp8 T.gemm by @txs19991 in #804
  • [AMD] support preshuffle weight mfma by @Paran0idy in #806
  • Add pytest-durations to requirements for ROCm by @Alex4210987 in #810
  • Add ruff config to check for useless spaces by @oraluben in #807
  • [Feature] Add ptx_cp_async_barrier_noinc intrinsic and related functionality by @chengyupku in #809
  • [Fix] Fix lower bug when buffer store is not guarded by any tile op by @kurisu6912 in #794
  • [feat] support gemm_sp for ampere arch by @botbw in #691
  • [Refactor] Update TVM subproject and refactor BlockNode handling in warp_specialized_rewriter.cc by @chengyupku in #812
  • [Refactor] Reopen #794 Fix lower bug when buffer store is not guarded by any tile op by @kurisu6912 in #817
  • [Refactor] Update TVM subproject and streamline buffer store handling by @chengyupku in #816
  • [Example] add w4a8 gemm kernel by @Cunxiao2002 in #815
  • [CI] fix rocm ci by @Cunxiao2002 in #819
  • [example] fix unused param in mhs example by @botbw in #821
  • [DSL] Support python tenary if then else expression by @LeiWang1999 in #822
  • [Bugfix] Bug fix when git command is not installed by @LeiWang1999 in #823
  • [Bugfix] Skip fp4 dtype binding when using older versions of ml_dtypes by @LeiWang1999 in #824
  • [Enhancement] Add a MXFP4 grouped GEMM example for FusedMoE by @Rachmanino in #811
  • [CMake] Added support for statically linked system libc library by @LeiWang1999 in #825
  • [Refactor] Refactor some build related configurations by @LeiWang1999 in #827
  • [CI] Test Fix: Handle BufferLoad nodes when T.gemm input has a stride by @LeiWang1999 in #843
  • [Refactor] Turn off ENABLE_FAST_MATH by default by @LeiWang1999 in #846
  • [AMD] fix bf16x2 dtype codegen by @Paran0idy in #847
  • [Typing] Fallback from Python 3.10+ type syntax for compatibility by @LeiWang1999 in #848
  • [TIR] Refactor division simplification in RewriteSimplifier by @LeiWang1999 in #849
  • [Py38] Revert typing and parser updates for Python 3.8 compatibility by @LeiWang1999 in #850
  • [Bugfix] Disable Memory Info Analysis for local.var by @LeiWang1999 in #851
  • [Release] Bump Version to 0.1.6 by @LeiWang1999 in #818

New Contributors

Full Changelog: https://github.com/tile-ai/tilelang/commits/0.1.6