v0.1.4
What's Changed
- [Bugfix] Support
T.clearfor let binding by @LeiWang1999 in #268 - [Bugfix] Add TMA and Producer Buffer Analysis in Warp Specialized Rewriter by @chengyupku in #269
- [Refactor] Improve flash attention example and layout comparison logic by @LeiWang1999 in #270
- [Bugfix]Add CUDA availability check in CtypesKernelAdapter by @XueSongTap in #267
- [CI] Add gemm performance test by @xwhzz in #274
- [Language] Introduce
T.ptrandT.Tensorby @LeiWang1999 in #276 - [Refactor] Enhance Autotune by @yyttt6 in #266
- [Refactor] Update cache key generation in KernelCache by @LeiWang1999 in #283
- [Docs][Tutorial] Add tutorial for auto-tuning by @yyttt6 in #285
- [Refactor] Deprecated
T.Bufferas arguments and rename related calls intoT.Tensorby @LeiWang1999 in #281 - [Doc] Update README.md to correct documentation link for TileLang debug tools by @chengyupku in #286
- [Feature] Introduce NoSetMaxNReg for warp specialization by @chengyupku in #289
- [Language] Proxy tvm ir to make linter happy by @LeiWang1999 in #287
- [Bugfix] Enable bfloat16 atomic operations only for CUDA architectures greater than 7.5 by @LeiWang1999 in #291
- [Doc] Update Python API docs generation by @xwhzz in #278
- [Doc] Remove citation page by @LeiWang1999 in #292
- [Dev] Correcting cxx compiler by @penguin-wwy in #294
- [doc/example] add gemv doc and examples by @botbw in #293
- [Feature] Implement ParallelLoopTransformer for enhanced loop analysis by @LeiWang1999 in #295
- [Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h by @LeiWang1999 in #297
- [Refactor] Improve documentation and add detailed docstrings across multiple modules by @LeiWang1999 in #298
- [Bugfix] Correct method call for block reduction check when analyzing memory footprint by @NaOHCC in #299
- [Dynamic Symbolic] Refactor passes with dynamic symbolic and check shape bound precisely by @tzj-fxz in #302
- Add autotune to conv example by @yyttt6 in #301
- [Bugfix] Resolve autotuner bugs for blocksparse GEMM example by @tth37 in #300
- [Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError by @LeslinD in #305
- [Enhancement] Add support for CUDA architecture 8.9 in GEMM template by @LeiWang1999 in #304
- [BugFix] Fix unintended Git config overrides in CI runners by @xwhzz in #306
- [Cache] Implement in-memory cache by @LeiWang1999 in #308
- [Bugfix] Updated autotune usage in the examples to align with the latest changes by @LeiWang1999 in #309
- [Bugfix] Fix dynamic axis with variable extent by @LeiWang1999 in #311
- [Bugfix] Fix layout conflict issue for gqa decoding examples by @LeiWang1999 in #314
- [Bugfix] Fixed the handling logic of IfThenElseNode in if_stmt_binding by @chengyupku in #315
- [Bugfix] Fix logic error in ReduceOp when handling CUDA architecture by @chengyupku in #316
- [CostModel] Introduce cuda driver api to get precise shared memory capacity by @LeiWang1999 in #317
- [Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support by @chengyupku in #320
- [Tools] Summarize TFLOPS Information from a tilelang program by @yyttt6 in #321
- Support block_N sizes that are 2^n in deepgemm example by @zcnrex in #319
- [Feat] Enhance CUDA Property Handling by @LeiWang1999 in #322
- [Bugfix] add a patch to fix T.abs on float16 by @botbw in #325
- [AMD] Adapt rocm and support
T.gemmwith transpose_b=False for amd backend by @LeiWang1999 in #327 - [Dynamic Symbolic] Adaptively vectorize with different condition expressions by @tzj-fxz in #326
- [Bugfix] Fix fragment layout annotation in example gqa decode by @LeiWang1999 in #329
- [AMD] Support
Transpose_A=Trueand GEMM_RS for hip backend by @LeiWang1999 in #331 - [Refactor] Optimize RMS normalization kernel in rms_norm.py by @chengyupku in #333
- [AMD] Fix for missing composable kernel include path when compile kernels on amd gpus by @LeiWang1999 in #334
- [Example] Add sparse gqa decode example by @xiayuqing0622 in #332
- [Enhancement] Enhance FP8/FP4 type handling in CUDA codegen by @LeiWang1999 in #323
- [Doc] Fix typo and heading level in GEMV tutorial by @yeh-sudo in #337
- [Dev] Add Group Cast FP8 Example by @chengyupku in #338
- [Enhancement] Support region padding when convert buffer load to buffer region by @LeiWang1999 in #342
- [Example] Add triton block sparse gqa decode by @YizhaoGao in #341
- [Enhancement] Support index bit width configuration by @LeiWang1999 in #343
- [Bugfix] Fix X_amax Correctness Issue in Group Cast FP8 by @chengyupku in #345
- [Bugfix] Fix Transposed Fragment Layout for amd GEMM_RS matrix core by @LeiWang1999 in #346
- [AutoTune] Refactor AutoTuneArtifact to utilize kernel as context instead of profiler by @LeiWang1999 in #344
- [Bugfix] Compile/"cached" still not loading cached kernel for example in example_mha_bwd by @Alex4210987 in #339
- [Refactor] Implement thread-local storage for FrameStack in frame.py and kernel.py by @LeiWang1999 in #352
- [Typo] Replace
kernel.funcwithkernelin mla benchmark scripts by @LeiWang1999 in #354 - [AMD][Docker] Create Dockerfile for ROCm environment setup by @LeiWang1999 in #355
- [Enhancement] Update group_per_split_token_cast_to_fp8 to support multiple data types by @chengyupku in #356
- [Enhancement] Support pass config
disable_warp_specializeto disable auto specialization on hopper by @LeiWang1999 in #357 - [Example] Introduce autotuning example for GEMM with enhanced configuration options by @chengyupku in #360
- [Example] Handle Scenarios in Which a Threadblock is Assigned Only Invalid Block Indices for Sparse Attention by @xiayuqing0622 in #361
- [Bugfix] Correct dynamic shared memory size error handling in HIP by @LeiWang1999 in #362
- [AMD] Implement Deepseek MLA for AMD by @LeiWang1999 in #363
- [Bugfix] Fix compilation issues for amd cdna element size check by @LeiWang1999 in #364
- [AMD] Support FlashMLA with num split template for AMD gpus by @LeiWang1999 in #366
- [MLA][AMD] Add amd mla benchmarking by @LeiWang1999 in #367
- [Bugfix] Adjust Autotuner threadpool
max_workerslimit to available CPUs by @tth37 in #368 - [Language] Introduce
T.any_ofandT.all_ofto reduce a bool arrary by @LeiWang1999 in #371 - [AMD][Setup] Support HIP in setup.py by @zhhangBian in #369
- [Typo] Remove debug print by @LeiWang1999 in #373
- [Docs] Add AMD Flash MLA Documentation to Tutorials Section by @LeiWang1999 in #376
- [Bugfix] Add filelock for cython build by @LeiWang1999 in #377
- [Typo] Remove unused comments generated by copilot by @LeiWang1999 in #379
- [Doc] Add deepseek_mla to documentation index by @LeiWang1999 in #380
- [Refactor] Remove debug message in pass legalize_safe_memory_access by @LeiWang1999 in #381
- [Enhancement][Pipeline] More precise copy code block detection in pipeline by @LeiWang1999 in #384
- [Revert] Revert modifications for pass FlattenBuffer by @LeiWang1999 in #385
- [Dynamic Symbolic] Add pass_config to customize vectorization and tail split by @tzj-fxz in #383
- [Pytest Fix] Wrap tests in dynamic benchmark by @tzj-fxz in #387
- [Doc] Update README.md for deepseek_mla on AMD by @LeiWang1999 in #389
- [Pipeline][Enhancement] Add copy_prepare stage to support mask and index caching by @LeiWang1999 in #392
- [Refactor] Refactor warp_specialized_rewriter to support multiple acquire/release patterns by @chengyupku in #391
- [Enhancement] Report Error Body in ParallelOp Layout Inference by @chengyupku in #394
- [Bugfix] Support
T.Parallelwith local register assignment by @LeiWang1999 in #395 - [Enhancement] Introduce a smarter warp partition strategy by @LeiWang1999 in #396
- [Example] Add bitnet-1.58b examples by @LeiWang1999 in #399
- Bump transformers from 4.40 to 4.48.0 in /examples/bitnet-1.58b by @dependabot in #400
- [BugFix] Address should aligned with access size in tail split by @tzj-fxz in #401
- [Enhancement] Move T.any_of and T.all_of op registration from python into cpp by @Cunxiao2002 in #398
- Add preliminary support for bf16 for AMD by @OscarSavolainen in #388
- [BugFix] Conditions Robustness in dynamic vectorize by @tzj-fxz in #404
- [CI] Update CI configuration to run pytest with automatic parallelization by @LeiWang1999 in #393
- [Documentation] Fix Installation Documentation by @andyluo03 in #405
New Contributors
- @XueSongTap made their first contribution in #267
- @botbw made their first contribution in #293
- @NaOHCC made their first contribution in #299
- @tth37 made their first contribution in #300
- @LeslinD made their first contribution in #305
- @zcnrex made their first contribution in #319
- @xiayuqing0622 made their first contribution in #332
- @yeh-sudo made their first contribution in #337
- @YizhaoGao made their first contribution in #341
- @zhhangBian made their first contribution in #369
- @dependabot made their first contribution in #400
- @OscarSavolainen made their first contribution in #388
- @andyluo03 made their first contribution in #405
Full Changelog: v0.1.3...v0.1.4