Releases · tile-ai/tilelang

31 Oct 01:00

LeiWang1999

v0.1.6.post2

c37621c

v0.1.6.post2 Latest

Latest

The Last Release for Python 3.8 (without tvm-ffi) 🚀

What's Changed

[Analyzer] Enhance ConstIntBoundAnalyzer and IntervalSet with modular set analysis by @LeiWang1999 in #856
[Doc] Optimize the quickstart guide for clarity and not just for CUDA by @LeiWang1999 in #858
[TMA] Bugfix when a shared buffer is both issued with tma store and tma load by @LeiWang1999 in #857
[AMD][MLA] Fix mla autotune for rocm by @LeiWang1999 in #861
[Bugfix] Ensure correct handling for cases where seq_q<seq_kv in flash attention examples by @Rachmanino in #864
[AMD] refactor MatrixCoreIntrinEmitter by @Paran0idy in #860
[Feat] Add fast sine and cosine definitions in CUDA templates by @Rachmanino in #865
[Layout] Support layout forward with multi dimension by @LeiWang1999 in #867
[Autotune][Conv] optimize convolution examples to use autotune by @LeiWang1999 in #866
[Example] Add examples to support efficient attention sink forward process by @Rachmanino in #853
[Parser] Adapt Parser to work with Python 3.8 in some cases by @LeiWang1999 in #869
[Fix] Fix bug 0905: tilelang doesn't vectorize B[i,j] = c[i] + A[i,j] by @kurisu6912 in #798
[Language] Support sequence comparisons by @LeiWang1999 in #872
[Language] Support loop_break primitive by @chengyupku in #873
[Bugfix] Use ExprDeepEqual instead of StructuralEqual when merge consecutive If stmt by @LeiWang1999 in #876
[Language] Support atomic add with ret by @LeiWang1999 in #870
[Cython] Remove an incorrect check by @LJC00118 in #880
Update amd_ci.yml by @Alex4210987 in #881
[FastMath] Disable default TVM fastmath intrinsic dispatch and add explicit fastmath op to invoke by @LeiWang1999 in #875
[Example] Add efficient attention sink backward implementations and tests by @Rachmanino in #877
[Precision] Introduce T.ieee_rsqrt and related high precision op by @LeiWang1999 in #882
[Dist] Provide an option to include commit ID in version by @LeiWang1999 in #884
[Example] Optimize sink attention forward via swizzled layout and report benchmark results by @Rachmanino in #885
[Layout] Introduce Flexible Parallel to Support T.serial and local buffers inside T.Parallel loop by @LeiWang1999 in #844
[Bugfix][Enhancement] Fix a bug in previous commit and enhance cuda backend by @Hamerlate in #887
[Bugfix] Fix CopyNode Lower method to include disable_tma flag in GetCopyInst by @Rachmanino in #888
[Layout] Fix plot layout by @Paran0idy in #890
[Example] Add example by @LeiWang1999 in #894
[News] Add announcement of support for Huawei Ascend chips by @xwhzz in #895
[Example] Add sparse mla examples by @LeiWang1999 in #896
[Typo] Fix backend name for Huawei Ascend by @xwhzz in #898
[CI] Legalize math related test by @LeiWang1999 in #899
[Bugfix] Fix flops comp and softmax scale in mla by @Edenzzzz in #900
[Example] Specify a fixed commit for the flash-linear-attention repository and optimize nsa examples by @LeiWang1999 in #913
[CI] optimize CI time for sparse gemm by @botbw in #906
[Enhancement] Include compile flags into the hash key of cached kernels by @Rachmanino in #911
[Bugfix] Fix saving kernel source code where JITKernel.artifact is None by @zjudmd1015 in #921
[CI] Refactor import paths in dequantization examples to use dequantize_utils by @LeiWang1999 in #914
[Example] Add MLA decode ws example by @chengyupku in #928
[CI] Fix documentation runner by adding 'nvidia' tag by @xwhzz in #927
[Layout] Strict annotate completed replicated layout for fragment with constant index by @LeiWang1999 in #929
[Bugfix] Fix tensor memory copy layout by @Hamerlate in #933
[Example] Optimize online_softmax example by @lijinpei in #934
[Example] Add correctness assert into dsa example by @LeiWang1999 in #937
[Enhancement] Enhance and add new GQA backward examples for Hopper by @Rachmanino in #930
[Enhancement] Fix lint to improve grouped GEMM performance with TMA by @Cunxiao2002 in #938
[Example] Introduce split+sum template, and optimize atomic_add performance for bwd examples by @LeiWang1999 in #940
[Example] Disable TMA and enable FastMath for NSA Examples (#941) by @LeiWang1999 in #941
[Example] Revert the atomic/split&sum templates in MHA backward examples by @Rachmanino in #943
[Example] Add sparse mla bwd example for deepseek_v32 by @Zhichenzzz in #919
[Profiler]Adds CUPTI profiler support by @Cunxiao2002 in #936
[Enhancement] Support Copy for Buffer Load witih scalar indices by @LeiWang1999 in #946
[Code Style] Refine nvrtc compile related check style by @BBuf in #945
[Backend] Add metal backend by @oraluben in #799
[CI] enable dependabot for GHA workflows by @XuehaiPan in #950
Modify the SM architecture number to support Thor’s sm110. by @iloveai8086 in #957
[CI] auto-cancel in-progress PR CI when new commits are pushed by @XuehaiPan in #956
[bug] fix type object is not subscriptable in py38 by @BBuf in #959
[Bugfix][Doc] Add astroid version constraint to requirements.txt by @xwhzz in #958
[CI]: Bump actions/setup-python from 2 to 6 by @dependabot[bot] in #951
[CI]: Bump astral-sh/setup-uv from 6 to 7 by @dependabot[bot] in #952
[CI]: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #954
[CI]: Bump actions/checkout from 2 to 5 by @dependabot[bot] in #953
[TileOp] Implement WGMMA for T.gemm_v2 by @LeiWang1999 in #813
[Docs] add CODE_OF_CONDUCT.md by @XuehaiPan in #965
[Example] Add support for bfloat16 and user-defined sm_scale in attention sink examples by @Rachmanino in #924
[Bugfix] Do not force inline let stmt by @LeiWang1999 in #947
[CI] add pre-commit integration by @XuehaiPan in #955
[Doc] Install docs add docker install method by @BBuf in #961
[Bugfix] Fix dummy kernel compliation by @SiriusNEO in #962
[CI][Refactor] Refactor non-test CI workflow files by @XuehaiPan in #971
[TileOp] Implememt CumSum1D by @LeiWang1999 in #978
[Language] Enhance T.alloc_var for AugAssign and AnnAsign by @LeiWang1999 in #979
[Refactor] Refactor Pass InjectFenceProxy and expose some warp group primitives in frontend by @LeiWang1999 in #977
[Typo] Remove debug print by @LeiWang1999 in #980
[Bugfix] Use access_ptr("r") instead of access_ptr("w") for correct pipeline analysis by @LeiWang1999 in #983
[Feature][Example] Support TMA reduce operation and update GQA bwd example by @chengyupku in #969
[Bugfix] Add NVIDIA HPC SDK support in CUDA detection (#974) by @Degeneracy-Evil in #976
[BugFix] Robust gemm policy for sparse_mla_fwd in Hopper and Ada Lovelace architectures by @tzj-fxz in #984
[Bugfix] Fallback torch.accelerator.synchronize() to torch.cuda.synchronize() by @yyttt6 in #987
...

Contributors

oraluben, lijinpei, and 22 other contributors

Assets 4

0 Join discussion

21 Sep 19:43

LeiWang1999

v0.1.6.post1

a3497eb

v0.1.6.post1

In version 0.1.6, libgcc and libg++ were statically linked to improve version compatibility. However, this could introduce certain unpredictable risks in some programs.
In post1, this process was reworked based on the PyTorch build workflow, eliminating the risks while ensuring better compatibility. This is the reason for releasing version 0.1.6.post1.

Assets 7

19 Sep 15:28

LeiWang1999

0.1.6

1ad6e46

v0.1.6

What's Changed

[Bugfix] Added missing thread offsets and other information to reduce by @LeiWang1999 in #646
[Bugfix] Adjust role assignment in warp specialization based on read access by @chengyupku in #647
Fix/jit kernel use target by @meinie0826 in #648
[Bugfix] Remove small array reuse condition in shared memory allocation merging by @LeiWang1999 in #654
[Enhancement] Add role assignment for AllocateNode in warp specialization by @chengyupku in #657
[Bugfix][CI] Bug fixing and migrate CI from ada to hopper by @xwhzz in #652
[CI] Enable cache for virtual env and parallelize pytest via xdist by @LeiWang1999 in #660
[Cache] Support shared cache directories for multiple process by @LeiWang1999 in #649
[Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control by @xwhzz in #656
add the support of rocm arch detecting by @zhangnju in #661
[BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking by @tzj-fxz in #653
[Bugfix][Docs] Update documentation build process and configurations for autoapi support by @xwhzz in #663
[Enhancement] Improve buffer conflict detection in thread storage synchronization by @LeiWang1999 in #658
[Bugfix] Consider buffer data type into indices provably disjoint analysis by @LeiWang1999 in #664
[Bugfix] Remove redundant T.fill to fix precision issue by @xuchangtolearn in #667
[Enhancement] Refactor buffer index handling for improved precision a… by @Alex4210987 in #671
Reverts #671 by @LeiWang1999 in #672
[Bugfix] Passing correct nvcc to cmake by @chenyang78 in #670
[CI] Improve format check output and automate commit of changes by @xwhzz in #669
[Bugfix][CI] Use valid runner labels in workflow by @xwhzz in #674
[Enhancement] passing verbose to LibraryGenerator by @chenyang78 in #673
[Enhancement] Enhance lint error messaging in CI by @xwhzz in #675
Refactor to support upstream tvm by @Hzfengsy in #595
Do not check for short variables by @oraluben in #676
[Refactor] Phaseout version with commit id in editable model by @LeiWang1999 in #677
[CI] Update CI workflow to use Python 3.12 by @LeiWang1999 in #679
[Enhancement] Output cache-file-related messages with verbose=True by @chenyang78 in #683
[Enhancement] Enhance warp specialization logic by @chengyupku in #680
Add Flash Attn example on amd mi300 series by @Alex4210987 in #682
[Enhancement] Refactored buffer detection logic in warp_specialized_rewriter.cc by @chengyupku in #685
[Fix] fix some issues with JIT decorators existing in the examples by @Cunxiao2002 in #681
[Enhancement] Add --ptxas-options=--register-usage-level=10 option by @LeiWang1999 in #684
[Feature]:Add auto vectorize for atomic add by @yyttt6 in #686
[Refactor] Rebase pipeline injector from upstream tvm by @LeiWang1999 in #687
[Refactor] Introduce GemmInst for different targets handling by @LeiWang1999 in #688
[Enhancement] Optimize BF16 casting performance by @xwhzz in #689
[Smem Reuse] Optimize to do memory alignment on identical buffers. by @LeiWang1999 in #693
[Version] Keep local commit id as it somehow help with debugging by @LeiWang1999 in #697
[Example] Optimize warp specialize flashmla example by @LeiWang1999 in #698
Bump transformers from 4.52.1 to 4.53.0 in /examples/bitnet-1.58b by @dependabot[bot] in #700
Gated Delta Net(GDN) kernel implementation in TileLang by @tzj-fxz in #695
Trivial update to calculate target arch by @oraluben in #702
[CI] Remove Flash Attention dependency by @LeiWang1999 in #705
[Layout] Introduce a new layout inference mechanism by @LeiWang1999 in #699
[Pipeline] Optimize inject software pipeline and pipeline planing pass by @LeiWang1999 in #706
Low-bit kernels fix and implementation by @tzj-fxz in #704
[Feat] Support gemm with stride by @smallscientist1 in #701
[Enhancement] Add eviction policy support for TMA operations, enhance CUDA codegen, and introduce new pass config by @xwhzz in #690
[Enhancement] Enhance the robustness and generality of MLA examples by @Rachmanino in #709
[Refactor] MergeAnnotations function to accept Map<Any, Any> instead of Map<String, Any> by @LeiWang1999 in #710
[Pipeline] Phaseout fragment and double buffer info from pipeline pass by @LeiWang1999 in #711
[Pipeline] Skip condition expression analysis for global reading by @LeiWang1999 in #713
[Index] Relocate Int64 Auto Promoter to ConfigBitWidth Pass, removing it from FlattenBuffer by @LeiWang1999 in #714
[CI] Bind build-test CI to NVIDIA as AMD runners are being introduced by @LeiWang1999 in #718
fix: NVRTC backend by @lucifer1004 in #717
[CUDA] Init support for sm_120 by @oraluben in #716
[Bugfix] Correct git configuration in docs CI by @xwhzz in #720
[Chore] fix typos by @lucifer1004 in #719
[CI][AMD] Add AMD GPU CI and fix some related bugs by @Alex4210987 in #694
[Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy by @NaOHCC in #724
[Refactor] Refactor CUDA code generation to simplify eviction policy handling by @LeiWang1999 in #721
[Language] Introduce StridedTensor to support non contigious torch inputs by @LeiWang1999 in #722
[Enhancement][Bugfix] Fix bug in warp specialized pass and add gemm_sr fallback support for Hopper by @xwhzz in #712
📝 Add docstrings to fix by @coderabbitai[bot] in #726
fix amd ci&add examples by @Alex4210987 in #729
[Feature] Low-bit twiddling dequantization and FP4 GEMM by @tzj-fxz in #725
📝 Add docstrings to mxfp4 by @coderabbitai[bot] in #732
[Refactor] Refactor env into a more flexible version by @LeiWang1999 in #740
[Bugfix] Align stride index validation with torch in CythonKernelWrapper by @LeiWang1999 in #743
[Bugfix]:Fix atomic add auto vectorize memory access out of bound error by @yyttt6 in #742
📝 Add docstrings to main by @coderabbitai[bot] in #745
[Refactor] Refactor barrier management by @LeiWang1999 in #744
[Refactor] Merge bulk copy into copy and improve layout inference for bulk copy by @LeiWang1999 in #746
[Refactor] Merge ThreadPartialSync and ThreadStorageSync by @LeiWang1999 in #741
[Enhancement] Optimize loop body handling in IR by @chengyupku in #749
[MXFP4] Fix bugs and optimize exponential operation by @tzj-fxz in #750
[Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h by @LeiWang1999 in #751
[Enhancement] Add shape checking for reduce options by @kurisu6912 in #748
[Bugfix] Add missing FP8 header include by @LeiWang1999 in #752
[MXFP4] Add bias to MXFP4 GEMM kernel by @tzj-fxz in #753
[Bugfix][WS] Consider loop min e...

Contributors

chenyang78, oraluben, and 23 other contributors

Assets 7

0 Join discussion

05 Jun 08:32

LeiWang1999

v0.1.5

a32009b

v0.1.5

What's Changed

[Release] Bump version from 0.1.3 into 0.1.4 by @LeiWang1999 in #375
[Enhancement] Remove redundant recursive rewrite rule for FloorDiv in RewriteSimplifier by @LeiWang1999 in #408
[Docker] cu128 Support by @andyluo03 in #410
[Refactor] Phaseout python dependency attrs and decorator by @LeiWang1999 in #411
[Language] make linter and type checker happy with mocking by @YouJiacheng in #407
[Bugfix] Support larger than 256 box size tma copy by @LeiWang1999 in #413
[Enhancement] Add get_nvcc_compiler function to retrieve nvcc path by @LeiWang1999 in #414
Update lower.py to set default value for params by @Alex4210987 in #416
[Enhancement] Support Auto Layout Inference and Parallelism with variable constraint by @LeiWang1999 in #417
[Enhancement] Support to find Cython path more automatically by @FrozenGene in #418
[Refactor] Enhance layout inference logic in ParallelOp by @chengyupku in #420
[BugFix] Fix tvm simplify pass by @smallscientist1 in #421
[Enhancement] Add TMA+WS support in pipeline planning logic by @chengyupku in #422
[Language] Support tile operator T.cumsum by @LeiWang1999 in #423
Delete testing/python/language/test_tilelang_language_reduce_sum.py by @LeiWang1999 in #424
[Bugfix] Fix a bug for simplifier by @LeiWang1999 in #425
[Layout] Enhance layout inference pass by @LeiWang1999 in #427
[Enhancement] Remove DeReplicate during parallel loop layout inference by @LeiWang1999 in #430
[Bugfix] Fix the test data distribution of cumsum by @LeiWang1999 in #432
[Enhancement] Support cute mma tile mxn8ky by @LeiWang1999 in #434
[Bugfix] Removed the behavior that treated global -> local as a copy operation. by @LeiWang1999 in #435
[Language] Support accumulative T.reduce_sum by @LeiWang1999 in #436
[Bugfix] fix the unexpected keyword error of autotune by @yyttt6 in #438
[Testing] Add atomic add test by @LeiWang1999 in #439
[Typo] Rename warp_source to wrap_source by @lucifer1004 in #440
[Refactor] Update KernelLaunch to clarify block name by @LeiWang1999 in #441
[Enhancement] Reduce CPU overhead during kernel execution by @Cunxiao2002 in #437
[Enhancement] Improve layout inference accuracy in ParallelOp by @LeiWang1999 in #442
[Bugfix] Fix layout inference for free fragment buffer by @LeiWang1999 in #443
Bump transformers from 4.48.0 to 4.50.0 in /examples/bitnet-1.58b by @dependabot in #444
[Language] Support explicit programming for identified warp groups by @LeiWang1999 in #445
[Bugfix] Fix safe memory legalization for fragment store by @LeiWang1999 in #446
[Refactor] Separate warp specialize rewriter and tma barrier injector pass by @LeiWang1999 in #447
[Enhancement] Add new examples for warp specialization and TMA integration by @LeiWang1999 in #448
[Refactor] Phaseout torch>=2.2.0 dependency by @LeiWang1999 in #451
[Feature] Add TILELANG_CHECK_LAST_ERROR macro for improved error handling in CUDA and HIP by @LeiWang1999 in #450
[Enhancement] Introduce pass_configs parameter for kernel Caching by @LeiWang1999 in #452
[Feature] Add cache directory management functions in tilelang.cache by @LeiWang1999 in #453
[Bugfix] Fix get_swizzle_layout implementation. by @cherichy in #455
[Refactor] Update barrier functions and add new example for GEMM with warp specialization by @LeiWang1999 in #456
[Refactor] Include examples in CI by @LeiWang1999 in #457
docs: add llvm version info to installation.md. by @AsakusaRinne in #459
[CI] Add elementwise and gemv examples to CI. by @Cunxiao2002 in #458
[Bugfix] Fix for T.copy with dynamic range by @LeiWang1999 in #462
[Bugfix] Fix copy region automation for dynamic extent by @LeiWang1999 in #465
[Feature] Implement fast integer power operation and related API by @LeiWang1999 in #466
[Typo] Rename power_of_int with pow_of_int for consistency by @LeiWang1999 in #468
[CI] Add BlocksparseGemm, Dynamic, and Cast examples to CI by @tzj-fxz in #467
[Refactor] Update set_compile_args to allow None for out_idx parameter by @LeiWang1999 in #469
[Refactor] Simplify buffer_region_to_tile_region function in copy.py by @LeiWang1999 in #470
[CI] Add Convolution example to CI by @xwhzz in #473
[BugFix] Correct argparse for example_convolution test by @xwhzz in #474
[Refactor] set USE_LLVM to optional. by @hyx1999 in #476
[CI] Add Analyzer and blocksparse_attention examples to CI by @yyttt6 in #472
[Refactor] Skip patchelf if not installed by @LeiWang1999 in #477
[Refactor] Improve layout equality checks and error messaging by @LeiWang1999 in #471
[Doc] Update version retrieval in conf.py to read from VERSION file by @xwhzz in #478
Fix Device Consistency in Autotuner Threads and Add Manual Profiler Check by @yuanjypku in #481
[Bugfix] Check CUDA target before checking for TMA by @gau-nernst in #482
[Bugfix] Use AutoTune cache_input_tensors properly by @yyttt6 in #483
Revert "[Bugfix] Use AutoTune cache_input_tensors properly" by @LeiWang1999 in #488
[Enhancement] Support register input for gemm when trans_a or trans_b is true by @LeiWang1999 in #490
[CI] Add flash_decoding example to CI by @xuchangtolearn in #487
[CI] Add Reminder Bot for pull request contributions by @xwhzz in #491
[Refactor] Introduce quantize components of TileLang and add testing for dequant gemm exmaple by @LeiWang1999 in #494
[Enhancement] Introduce flag to visualize shared memory merge plan by @LeiWang1999 in #496
[Refactor] Update main function structure in example scripts and add tests by @chengyupku in #475
[Bugfix] Fix Hopper GEMM layout for small tile size by @LeiWang1999 in #497
[Enhancement] Fallback transposed_ldmatrix into SM75_U16x4_LDSM_N when warp_n is 8 by @LeiWang1999 in #498
[Bugfix] Rename SM75_U16x8_LDSM_N to SM75_U16x8_LDSM_T to reflect correct matrix type by @LeiWang1999 in #499
[Refactor] Update GEMM layout and operand traits for improved CUDA compatibility by @LeiWang1999 in #500
[Refactor] Update JIT kernel functions and streamline GEMM tests by @LeiWang1999 in #501
Fix AMD Docker issues related to conda environment setup by @Hamerlate in #503
[Refactor] Refactor jit to _JitImplementation to support @tilelang.jit by @LeiWang1999 in #502
[Refactor] Adjust in fragment GEMM layout by @LeiWang1999 in #504
[Refactor] Update GlobalMemChecker to Detect Lower Bound illegal memory access automatically by @LeiWang1999 in #505
[Enhancement] Enhance ReduceOp and JITKernel for improved dimension handling and initialization by @LeiWang1999 in #507
[Refactor] Update buffer handling in layout transformation to support layout on T.view by @LeiWang1999 in #509
[Bugfix] Enhance smem copy selector for uncommon shape by @LeiWang1999 in https://github.com/tile-ai/tilelang...

Contributors

FrozenGene, cherichy, and 22 other contributors

Assets 7

0 Join discussion

18 Apr 09:14

LeiWang1999

v0.1.4

a41a473

v0.1.4

What's Changed

[Bugfix] Support T.clear for let binding by @LeiWang1999 in #268
[Bugfix] Add TMA and Producer Buffer Analysis in Warp Specialized Rewriter by @chengyupku in #269
[Refactor] Improve flash attention example and layout comparison logic by @LeiWang1999 in #270
[Bugfix]Add CUDA availability check in CtypesKernelAdapter by @XueSongTap in #267
[CI] Add gemm performance test by @xwhzz in #274
[Language] Introduce T.ptr and T.Tensor by @LeiWang1999 in #276
[Refactor] Enhance Autotune by @yyttt6 in #266
[Refactor] Update cache key generation in KernelCache by @LeiWang1999 in #283
[Docs][Tutorial] Add tutorial for auto-tuning by @yyttt6 in #285
[Refactor] Deprecated T.Buffer as arguments and rename related calls into T.Tensor by @LeiWang1999 in #281
[Doc] Update README.md to correct documentation link for TileLang debug tools by @chengyupku in #286
[Feature] Introduce NoSetMaxNReg for warp specialization by @chengyupku in #289
[Language] Proxy tvm ir to make linter happy by @LeiWang1999 in #287
[Bugfix] Enable bfloat16 atomic operations only for CUDA architectures greater than 7.5 by @LeiWang1999 in #291
[Doc] Update Python API docs generation by @xwhzz in #278
[Doc] Remove citation page by @LeiWang1999 in #292
[Dev] Correcting cxx compiler by @penguin-wwy in #294
[doc/example] add gemv doc and examples by @botbw in #293
[Feature] Implement ParallelLoopTransformer for enhanced loop analysis by @LeiWang1999 in #295
[Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h by @LeiWang1999 in #297
[Refactor] Improve documentation and add detailed docstrings across multiple modules by @LeiWang1999 in #298
[Bugfix] Correct method call for block reduction check when analyzing memory footprint by @NaOHCC in #299
[Dynamic Symbolic] Refactor passes with dynamic symbolic and check shape bound precisely by @tzj-fxz in #302
Add autotune to conv example by @yyttt6 in #301
[Bugfix] Resolve autotuner bugs for blocksparse GEMM example by @tth37 in #300
[Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError by @LeslinD in #305
[Enhancement] Add support for CUDA architecture 8.9 in GEMM template by @LeiWang1999 in #304
[BugFix] Fix unintended Git config overrides in CI runners by @xwhzz in #306
[Cache] Implement in-memory cache by @LeiWang1999 in #308
[Bugfix] Updated autotune usage in the examples to align with the latest changes by @LeiWang1999 in #309
[Bugfix] Fix dynamic axis with variable extent by @LeiWang1999 in #311
[Bugfix] Fix layout conflict issue for gqa decoding examples by @LeiWang1999 in #314
[Bugfix] Fixed the handling logic of IfThenElseNode in if_stmt_binding by @chengyupku in #315
[Bugfix] Fix logic error in ReduceOp when handling CUDA architecture by @chengyupku in #316
[CostModel] Introduce cuda driver api to get precise shared memory capacity by @LeiWang1999 in #317
[Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support by @chengyupku in #320
[Tools] Summarize TFLOPS Information from a tilelang program by @yyttt6 in #321
Support block_N sizes that are 2^n in deepgemm example by @zcnrex in #319
[Feat] Enhance CUDA Property Handling by @LeiWang1999 in #322
[Bugfix] add a patch to fix T.abs on float16 by @botbw in #325
[AMD] Adapt rocm and support T.gemm with transpose_b=False for amd backend by @LeiWang1999 in #327
[Dynamic Symbolic] Adaptively vectorize with different condition expressions by @tzj-fxz in #326
[Bugfix] Fix fragment layout annotation in example gqa decode by @LeiWang1999 in #329
[AMD] Support Transpose_A=True and GEMM_RS for hip backend by @LeiWang1999 in #331
[Refactor] Optimize RMS normalization kernel in rms_norm.py by @chengyupku in #333
[AMD] Fix for missing composable kernel include path when compile kernels on amd gpus by @LeiWang1999 in #334
[Example] Add sparse gqa decode example by @xiayuqing0622 in #332
[Enhancement] Enhance FP8/FP4 type handling in CUDA codegen by @LeiWang1999 in #323
[Doc] Fix typo and heading level in GEMV tutorial by @yeh-sudo in #337
[Dev] Add Group Cast FP8 Example by @chengyupku in #338
[Enhancement] Support region padding when convert buffer load to buffer region by @LeiWang1999 in #342
[Example] Add triton block sparse gqa decode by @YizhaoGao in #341
[Enhancement] Support index bit width configuration by @LeiWang1999 in #343
[Bugfix] Fix X_amax Correctness Issue in Group Cast FP8 by @chengyupku in #345
[Bugfix] Fix Transposed Fragment Layout for amd GEMM_RS matrix core by @LeiWang1999 in #346
[AutoTune] Refactor AutoTuneArtifact to utilize kernel as context instead of profiler by @LeiWang1999 in #344
[Bugfix] Compile/"cached" still not loading cached kernel for example in example_mha_bwd by @Alex4210987 in #339
[Refactor] Implement thread-local storage for FrameStack in frame.py and kernel.py by @LeiWang1999 in #352
[Typo] Replace kernel.func with kernel in mla benchmark scripts by @LeiWang1999 in #354
[AMD][Docker] Create Dockerfile for ROCm environment setup by @LeiWang1999 in #355
[Enhancement] Update group_per_split_token_cast_to_fp8 to support multiple data types by @chengyupku in #356
[Enhancement] Support pass config disable_warp_specialize to disable auto specialization on hopper by @LeiWang1999 in #357
[Example] Introduce autotuning example for GEMM with enhanced configuration options by @chengyupku in #360
[Example] Handle Scenarios in Which a Threadblock is Assigned Only Invalid Block Indices for Sparse Attention by @xiayuqing0622 in #361
[Bugfix] Correct dynamic shared memory size error handling in HIP by @LeiWang1999 in #362
[AMD] Implement Deepseek MLA for AMD by @LeiWang1999 in #363
[Bugfix] Fix compilation issues for amd cdna element size check by @LeiWang1999 in #364
[AMD] Support FlashMLA with num split template for AMD gpus by @LeiWang1999 in #366
[MLA][AMD] Add amd mla benchmarking by @LeiWang1999 in #367
[Bugfix] Adjust Autotuner threadpool max_workers limit to available CPUs by @tth37 in #368
[Language] Introduce T.any_of and T.all_of to reduce a bool arrary by @LeiWang1999 in #371
[AMD][Setup] Support HIP in setup.py by @zhhangBian in #369
[Typo] Remove debug print by @LeiWang1999 in #373
[Docs] Add AMD Flash MLA Documentation to Tutorials Section by @LeiWang1999 in #376
[Bugfix] Add filelock for cython build by @LeiWang1999 in #377
[Typo] Remove unused comments generated by copilot by @LeiWang1999 in #379
[Doc] Add deepseek_mla to documentation index by @LeiWang1999 in #380
[Refactor] Remove debug message in pass legalize_safe_memory_access by @LeiWang1999 in #381
[Enhancement][Pipeline] More precise copy code block detection in pipeline by ...

Contributors

zcnrex, penguin-wwy, and 19 other contributors

Assets 7

1 Join discussion

23 Mar 15:21

LeiWang1999

v0.1.3

f308c8a

v0.1.3

What's Changed

[Docker] Add libstdcxx-ng-12 to Dockerfiles for CUDA versions by @LeiWang1999 in #160
Add cpu jit with backend ctypes by @xs-keju in #154
[Carver] Multi-Threads Compilation for Fast Auto Tuning by @SiriusNEO in #156
[Refactor] Replace T.If with native Python if statement for mla paged kernel by @LeiWang1999 in #162
[Enhancement] Improve CUDA path detection by @xwhzz in #157
[Refactor] Replace T.thread_binding with T.get_thread_binding in examples and test cases by @LeiWang1999 in #163
[Bugfix] Cast bool dtype into int8 in blocksparse examples by @LeiWang1999 in #167
[Example] Implement NSA Decode tilelang exampls by @LeiWang1999 in #168
[Release] Bump version to v0.1.2.post1 by @LeiWang1999 in #166
Use SS-GEMM for PV in mla by @YouJiacheng in #165
[Example] Implement tilelang native sparse attention varlen example by @LeiWang1999 in #170
[Bugfix] Implement boundary check for the buffer shape with dynamic symbolic by @LeiWang1999 in #173
[AutoTune] Enable config-performance trace by @LeiWang1999 in #174
[Feat] Append Pass Context and TMA lowering configuration option by @LeiWang1999 in #175
[Feat] Introduce new caching mechanism for compiled kernels by @LeiWang1999 in #176
[Refactor] Enhance GPU Kernel Launch with Environment Thread Creation by @LeiWang1999 in #178
[Bugfix] Improve Thread Variable Handling in Layout Inference by @LeiWang1999 in #179
[Examples] Implement NSA Backward kernels by @LeiWang1999 in #180
[Enhancement] Optimize CMake build process with dynamic job count calculation by @LeiWang1999 in #183
[Bugfix] Add dynamic shape support with out_idx in Cython JIT kernel compilation by @LeiWang1999 in #185
[Dev][Bugfix] Add RMS Normalization Kernels and Fix Reduce Bug by @chengyupku in #188
[Dev] Add the failed nvcc command to the exception message by @penguin-wwy in #189
[Bugfix] Fix T.copy for scalar datatypes by @LeiWang1999 in #190
[Enhancement] Simplify GEMM example with direct kernel compilation by @LeiWang1999 in #191
[Bugfix] Make quickstart work properly on cu118 by @penguin-wwy in #193
[Language] Support clamp in language by @hyx1999 in #192
[Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp Specialized Rewriter by @chengyupku in #194
[Feature] Add TMA Store Synchronization Support by @chengyupku in #195
Update expired example code. by @66RING in #196
[CMake] Add CUDA Major Version Detection for Conditional Compilation by @chengyupku in #197
[Feature] Support Async Pipeline inference within if scope by @LeiWang1999 in #198
[Dev] Add new example for FlashAttention with pipelined execution by @chengyupku in #200
[Enhancement] Enhancing the handling of conditional statements in the pipeline by @LeiWang1999 in #201
[Feature] Upgrade cutlass version and support fp8 T.gemm by @zqh-wz in #202
[Docker] Update Dockerfiles to specify exact version of libstdcxx-ng by @LeiWang1999 in #203
[Dev] Add GQA backward example by @chengyupku in #205
[LICENSE] Typo fix in LICENSE by @LeiWang1999 in #208
[Enhancement] Allow mma fallback when wgmma is not supported by @LeiWang1999 in #206
[Examples] Expand tuning configurations for FlashAttention example by @chenghuaWang in #204
[Enhancement] Avoid tvm ffi handling when out_idx is specified by @LeiWang1999 in #209
[Fix] Fix K // block_K to T.ceildiv(K,block_K) and add tests by @hyx1999 in #210
[Dev] Implement IfStmtBinding and MergeIfStmt transformations by @chengyupku in #211
[Language] Introduce T.reshape and T.view by @LeiWang1999 in #212
[Enhancement] Improve device handling in Cython kernel adapter by @LeiWang1999 in #220
[Enhancement] Update format script to support force compare with upstream by @LeiWang1999 in #221
[Refactor] Introduce KernelParam integration across modules by @LeiWang1999 in #223
[Bugfix] Fix mismatch of shared memory layout and mma atom on Hopper by @zqh-wz in #224
[Refactor] Update kernel compilation and profiling in examples by @chengyupku in #225
[Examples] Add fp8 gemm 2xAcc and deepgemm example by @cherichy in #217
[Doc] Add instructions for installing nightly version by @xwhzz in #226
[Bugfix] Disable force inline for ldmatrix by @LeiWang1999 in #227
[Bugfix] Support duplicate tma desc declaration by @LeiWang1999 in #228
[Refactor] Rename clamp functions and enhance dtype handling in tests by @LeiWang1999 in #232
[Enhancement] Simplify kernel source extraction in JIT adapters by @LeiWang1999 in #230
[Feature] Add reduce_max corresponding tests by @LeiWang1999 in #236
[BugFix] Fix bug of missing MBarrierExpectTX by @chengyupku in #241
[Refactor] Refactor for Better Layout Conflict Handling by @LeiWang1999 in #240
[Refactor] Align torch_assert_close tensor comparison with torch.testing.assert_close by @xwhzz in #239
[Dev] Implement FlashAttention3 Backward by @chengyupku in #244
[BugFix] Fix bug of mismatching dtype in testing by @xwhzz in #245
[Enhancement] Add zero initialization option to GEMM operations by @chengyupku in #246
[Enhancement][CUDA] Avoid C7508 for CUDA backend via assigning default value to minBlocksPerMultiprocesor by @cherichy in #248
[Feature] Add database storage for JITKernel cache with Cython and Ctypes adapters by @Alex4210987 in #213
[Examples] Implement elementwise add kernel by @chenghuaWang in #219
[Refactor] Phaseout LLVM Dependency by Making it Optional by @LeiWang1999 in #247
[Readme] Update Bib Citation Section by @LeiWang1999 in #249
[Enhancement] Support float variable as arguments by @LeiWang1999 in #250
add autotune to example_gemm.py by @yyttt6 in #252
[Language] Introduce T.alloc_var to define a variable like int var; by @LeiWang1999 in #255
[Example] Implement Kernel Example cumsum by @LeiWang1999 in #258
[Refactor] Refactor CUDA post-processing callback registration in TileLang by @LeiWang1999 in #259
[Refactor] Move compilation outside critical section by @YouJiacheng in #260
[CI] Use auditwheel to generate manylinux wheels by @oraluben in #251
[Bugfix] Fix Benchmark/Example Code for Autotuning by @SiriusNEO in #254
[Language] Enhance alias to support blockwise memory load by @LeiWang1999 in #261
[Bugfix] Fix auto tuning tma handling by @LeiWang1999 in #263
[Release] Bump version to 0.1.3 by @LeiWang1999 in #264

New Contributors

@xs-keju made their first contribution in #154
@YouJiacheng made their first contribution in #165
@penguin-wwy made their first contribution in #189
@hyx1999 made their first contribution in #192
@66RING made their first contribution in https://github.com/tile-ai/tilelang/pull/...

Contributors

oraluben, cherichy, and 13 other contributors

Assets 7

0 Join discussion

07 Mar 13:18

LeiWang1999

v0.1.2.post1

d8a06c0

v0.1.2.post1

Why we need this post release?

The v0.1.2 prebuild package used a legacy cython file, which may lead to some bugs.

What's Changed

[Docker] Add libstdcxx-ng-12 to Dockerfiles for CUDA versions by @LeiWang1999 in #160
Add cpu jit with backend ctypes by @xs-keju in #154
[Carver] Multi-Threads Compilation for Fast Auto Tuning by @SiriusNEO in #156
[Refactor] Replace T.If with native Python if statement for mla paged kernel by @LeiWang1999 in #162
[Enhancement] Improve CUDA path detection by @xwhzz in #157
[Refactor] Replace T.thread_binding with T.get_thread_binding in examples and test cases by @LeiWang1999 in #163
[Bugfix] Cast bool dtype into int8 in blocksparse examples by @LeiWang1999 in #167
[Example] Implement NSA Decode tilelang exampls by @LeiWang1999 in #168

New Contributors

@xs-keju made their first contribution in #154

Full Changelog: v0.1.2...v0.1.2.post1

Contributors

LeiWang1999, SiriusNEO, and 2 other contributors

Assets 7

0 Join discussion

06 Mar 10:02

LeiWang1999

v0.1.2

c8c7dec

v0.1.2

What's Changed

[Dev] Add MLA and GQA decode examples by @chengyupku in #109
[Example] Add Split-K and Stream-K Examples and move MLA from fld to mla by @LeiWang1999 in #110
[Typo] Fix a typo in gemm splitk examples by @LeiWang1999 in #111
[Typo] Fix links in installation instructions in README.md by @xwhzz in #112
[Typo] Fix formatting in installation instructions in README.md by @xwhzz in #113
[Benchmark] Add benchmark scripts for block sparse attention by @LeiWang1999 in #114
[Dev] Support vectorized value pack and atomicAdd for BFloat16 DType by @LeiWang1999 in #116
[Bugfix] Bugfix of pass order for hopper by @chengyupku in #117
[Dev] Update MLA decode kernel by @chengyupku in #120
[Example] Add GQA Example by @LeiWang1999 in #118
[Example] Implement TileLang Native Sparse Attention Kernel by @LeiWang1999 in #121
[Doc] Update README.md with new example links for Flash MLA Decoding and Native Sparse Attention by @chengyupku in #122
[Example] Update GEMM FP8 Example by @LeiWang1999 in #123
[Dev] Add RetNet Linear Attention example by @chengyupku in #124
[JIT] Enhance cython/ctypes wrapper for tma descriptor by @LeiWang1999 in #126
[Dev][Bugfix] Fix bug in ThreadTagChecker; Add WgmmaSync rewriter and add MHA WGMMA pipelined example by @chengyupku in #128
[Dev] Remove buffer flatten when debug print a shared buffer by @LeiWang1999 in #129
[Debug] Support T.print for fragment scope by @LeiWang1999 in #130
[Example] Implememt FMHA Varlen Example by @LeiWang1999 in #131
[Refactor] Set default log level from waning into info by @LeiWang1999 in #132
[Kernel] Implement different SEQ Q/KV examples with block sparse by @LeiWang1999 in #133
[Dev][Doc] Add DeepSeek MLA Decode Example with Documentation and Performance Benchmarks by @chengyupku in #134
[Doc] Update MLA Documentation by @chengyupku in #135
[Debug] Improve Memory Layout Plot by @LeiWang1999 in #136
[Doc] Add MLA Decoding Performance Benchmarks and Documentation by @chengyupku in #137
[Bugfix] Add missing definition for AtomicAdd by @LeiWang1999 in #138
[Dev][Doc] Enhance Flash Attention Implementation in GQA Decoding Example and Fix Typo by @chengyupku in #139
[Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16 by @chengyupku in #141
[Refactor] Rename gemm fp8 example as we currently lack T.gemm support for fp8 by @LeiWang1999 in #144
[Enhancement] Support debug print for unsigned char datatype by @LeiWang1999 in #145
[Enhancement] Enable runtime tensor data type validation by @LeiWang1999 in #146
[Refactor] Adapt Caver to benchmark by @LeiWang1999 in #148
[Refactor] Remove BitBLAS Import Check in Benchmark by @SiriusNEO in #150
[Enhancement] Optimize TileLang install scripts with Dynamic CPU Cores by @LeiWang1999 in #152
[Carver] Enhance Carver Adaptation for MatMul Benchmarking by @LeiWang1999 in #153
[Dev][Benchmark] Add MLA paged decoding example and benchmark script by @chengyupku in #158
[Release] Bump Version to v0.1.2 by @LeiWang1999 in #155

New Contributors

@SiriusNEO made their first contribution in #150

Full Changelog: v0.1.1...v0.1.2

Contributors

LeiWang1999, SiriusNEO, and 2 other contributors

Assets 7

0 Join discussion

23 Feb 09:32

LeiWang1999

v0.1.1

59342bb

v0.1.1

What's Changed

[Doc] Update release news by @LeiWang1999 in #80
[Doc] Convert docs from rst format to Markdown format. by @xwhzz in #82
[Bugfix] Bugfix of installing with develop mode by @LeiWang1999 in #81
[WHL] Support whl building for different python versions via tox by @LeiWang1999 in #83
[Refactor] Separate tilelang Pass Thread Sync (with Hopper support) from tvm by @LeiWang1999 in #85
[Backend][WebGPU] Support WebGPU WGSL code generation by @LeiWang1999 in #86
[Wheel] Support pypi build scripts for different python via tox by @LeiWang1999 in #93
[Wrap] Use a ctypes-based kernel wrapper instead of dlpack for runtime efficiency by @LeiWang1999 in #95
[Bugfix] Update Dockerfile.cu120 by @LeiWang1999 in #98
[Bugfix] Put InjectPtxAsyncCopy Pass behind ThreadSync Pass by @LeiWang1999 in #97
[Feature] Add CTypes JIT kernel support by @LeiWang1999 in #100
[Docker] Add Dockerfiles for multiple CUDA versions by @LeiWang1999 in #103
[JIT] Support Cython jit and make cython a default execution backend by @LeiWang1999 in #102
[Refactor] Phrase out torch cpp extension backend by @LeiWang1999 in #104
[Wheel] Provide a bare docker scripts to help build wheels for manylinux by @LeiWang1999 in #105
[Example] Implement simple block sparse kernel by @LeiWang1999 in #106
[Release] Bumpy version to v0.1.1 by @LeiWang1999 in #107

Full Changelog: v0.1.0...v0.1.1

Contributors

LeiWang1999 and xwhzz

Assets 7

0 Join discussion

12 Feb 06:23

LeiWang1999

v0.1.0

02a2cba

v0.1.0

What's Changed

[LICENSE] Add LICENSE for flashinfer by @LeiWang1999 in #19
[Doc] Fix installation scripts and docs for dequantize gemm by @LeiWang1999 in #20
[Doc] Use sphinx to generate docs. by @xwhzz in #21
[Doc] update installation.md and readme by @Cunxiao2002 in #22
[Doc] fix a typo in installation.rst by @Cunxiao2002 in #24
[Doc] Remove legacy files and update reference by @LeiWang1999 in #25
[CI][Test] Add test cases for tilelang transform AnnotateDeviceRegions and MakePackedAPI by @LeiWang1999 in #26
[Doc] Create a workflow to host docs using GitHub Pages. by @xwhzz in #28
[CI][Test] Add test cases for tilelang transform InjectSoftwarePipeline and FrontendLegalize by @Cunxiao2002 in #30
[Bugfix] Replace thread binding detector in LayoutInference Pass by @LeiWang1999 in #31
[CI] Comprehensive Test cases Implementation of Matmul Dequantize by @LeiWang1999 in #32
[Doc] Update GitHub Actions workflow for documentation deployment and add CNAME file. by @xwhzz in #33
[Refactor] Simplify interface via replacing argument thread binding of intrinsics with KernelFrame.Current by @LeiWang1999 in #34
[Bugfix] Reorder Passes: Place Vectorize Loop Before StorageFlatten and FlattenBuffer to Prevent Redundant Allocations by @LeiWang1999 in #37
[Doc] Update documentation structure and content by @LeiWang1999 in #39
[Doc][CI] Update GitHub Actions workflow for documentation build and deployment. by @xwhzz in #42
[CI] Allow manual triggering of documentation workflow in addition to… by @xwhzz in #43
[CI][Test] Add test cases for tilelang transform PipelinePlanning by @Cunxiao2002 in #44
[CI][Test] Add test cases for tilelang transform LayoutInference and LowerTileOp on loop tail split functionality by @tzj-fxz in #29
[Debug] Introduce T.print for buffer and variables logging on frontend by @LeiWang1999 in #45
[CI] Change pull request trigger to pull_request_target for documen… by @xwhzz in #48
[Dev] Add FlashDecoding example by @chengyupku in #46
[Doc] update README that tilelang has been used in AttentionEngine by @smallscientist1 in #50
[Doc] Remove unnecessary layout annotation by @LeiWang1999 in #49
[CI][Test] Add test cases for tilelang kernel convolution by @chengyupku in #51
[Dev] Implement test case for tilelang transformations by @LeiWang1999 in #53
[CI][Test] Add test cases for tilelang kernel FlashAttention by @chengyupku in #54
[CI][Test] Add test cases for element_add by @Cunxiao2002 in #47
[CI] Clean up target repository before publishing documentation. by @xwhzz in #55
[CI][Test] Add test cases for tilelang transform ClusterPlanning by @chengyupku in #57
[Doc] Append debug relevant testing and documentations by @LeiWang1999 in #58
[CI][Test] Add test cases for tilelang transform LowerHopperIntrin by @chengyupku in #59
[Doc] Add matmul kernel tutorial with tile library by @LeiWang1999 in #60
[Dev] Separate LoopVectorize Pass from upstream tvm by @LeiWang1999 in #62
[Dev] Support FP8 Codegen for cuda backend by @LeiWang1999 in #64
[Dev] Add test case for bfloat16 and int4 gemm with mma by @LeiWang1999 in #65
[CI][Test] Add test cases for tilelang transform InjectFenceProxy by @chengyupku in #66
[Tools] Introduce plot_layout to visualize the fragment layout by @LeiWang1999 in #68
[Dev] Remove unnecessary python dependencies by @LeiWang1999 in #69
[Carver] Introduce a tile-structure based cost model for auto tuning by @LeiWang1999 in #70
[Bugfix] bug fix for bitblas dependency by @LeiWang1999 in #71
[CI][Test] Add test cases for tilelang transform MultiVersionBuffer and WarpSpecialized by @chengyupku in #72
[CostModel][Carver] Support Hint Recommend for Shared memory Kernel Fusion by @LeiWang1999 in #73
[Carver] Remove legacy todo items in carver's readme by @LeiWang1999 in #74
[Dev] Add mha backward example by @chengyupku in #77
[Release] Bump version into v0.1.0 by @LeiWang1999 in #76

New Contributors

@xwhzz made their first contribution in #21
@Cunxiao2002 made their first contribution in #22
@tzj-fxz made their first contribution in #29
@chengyupku made their first contribution in #46
@smallscientist1 made their first contribution in #50

Full Changelog: v0.0.1...v0.1.0

Contributors

LeiWang1999, chengyupku, and 4 other contributors

Assets 7

0 Join discussion

Releases: tile-ai/tilelang

v0.1.6.post2

What's Changed

Contributors

Uh oh!

v0.1.6.post1

Uh oh!

v0.1.6

What's Changed

Contributors

Uh oh!

v0.1.5

What's Changed

Contributors

Uh oh!

v0.1.4

What's Changed

Contributors

Uh oh!

v0.1.3

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.2.post1

Why we need this post release?

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.2

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.1

What's Changed

Contributors

Uh oh!

v0.1.0

What's Changed

New Contributors

Contributors

Uh oh!