Release v0.1.4 · tile-ai/tilelang

What's Changed

[Bugfix] Support T.clear for let binding by @LeiWang1999 in #268
[Bugfix] Add TMA and Producer Buffer Analysis in Warp Specialized Rewriter by @chengyupku in #269
[Refactor] Improve flash attention example and layout comparison logic by @LeiWang1999 in #270
[Bugfix]Add CUDA availability check in CtypesKernelAdapter by @XueSongTap in #267
[CI] Add gemm performance test by @xwhzz in #274
[Language] Introduce T.ptr and T.Tensor by @LeiWang1999 in #276
[Refactor] Enhance Autotune by @yyttt6 in #266
[Refactor] Update cache key generation in KernelCache by @LeiWang1999 in #283
[Docs][Tutorial] Add tutorial for auto-tuning by @yyttt6 in #285
[Refactor] Deprecated T.Buffer as arguments and rename related calls into T.Tensor by @LeiWang1999 in #281
[Doc] Update README.md to correct documentation link for TileLang debug tools by @chengyupku in #286
[Feature] Introduce NoSetMaxNReg for warp specialization by @chengyupku in #289
[Language] Proxy tvm ir to make linter happy by @LeiWang1999 in #287
[Bugfix] Enable bfloat16 atomic operations only for CUDA architectures greater than 7.5 by @LeiWang1999 in #291
[Doc] Update Python API docs generation by @xwhzz in #278
[Doc] Remove citation page by @LeiWang1999 in #292
[Dev] Correcting cxx compiler by @penguin-wwy in #294
[doc/example] add gemv doc and examples by @botbw in #293
[Feature] Implement ParallelLoopTransformer for enhanced loop analysis by @LeiWang1999 in #295
[Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h by @LeiWang1999 in #297
[Refactor] Improve documentation and add detailed docstrings across multiple modules by @LeiWang1999 in #298
[Bugfix] Correct method call for block reduction check when analyzing memory footprint by @NaOHCC in #299
[Dynamic Symbolic] Refactor passes with dynamic symbolic and check shape bound precisely by @tzj-fxz in #302
Add autotune to conv example by @yyttt6 in #301
[Bugfix] Resolve autotuner bugs for blocksparse GEMM example by @tth37 in #300
[Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError by @LeslinD in #305
[Enhancement] Add support for CUDA architecture 8.9 in GEMM template by @LeiWang1999 in #304
[BugFix] Fix unintended Git config overrides in CI runners by @xwhzz in #306
[Cache] Implement in-memory cache by @LeiWang1999 in #308
[Bugfix] Updated autotune usage in the examples to align with the latest changes by @LeiWang1999 in #309
[Bugfix] Fix dynamic axis with variable extent by @LeiWang1999 in #311
[Bugfix] Fix layout conflict issue for gqa decoding examples by @LeiWang1999 in #314
[Bugfix] Fixed the handling logic of IfThenElseNode in if_stmt_binding by @chengyupku in #315
[Bugfix] Fix logic error in ReduceOp when handling CUDA architecture by @chengyupku in #316
[CostModel] Introduce cuda driver api to get precise shared memory capacity by @LeiWang1999 in #317
[Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support by @chengyupku in #320
[Tools] Summarize TFLOPS Information from a tilelang program by @yyttt6 in #321
Support block_N sizes that are 2^n in deepgemm example by @zcnrex in #319
[Feat] Enhance CUDA Property Handling by @LeiWang1999 in #322
[Bugfix] add a patch to fix T.abs on float16 by @botbw in #325
[AMD] Adapt rocm and support T.gemm with transpose_b=False for amd backend by @LeiWang1999 in #327
[Dynamic Symbolic] Adaptively vectorize with different condition expressions by @tzj-fxz in #326
[Bugfix] Fix fragment layout annotation in example gqa decode by @LeiWang1999 in #329
[AMD] Support Transpose_A=True and GEMM_RS for hip backend by @LeiWang1999 in #331
[Refactor] Optimize RMS normalization kernel in rms_norm.py by @chengyupku in #333
[AMD] Fix for missing composable kernel include path when compile kernels on amd gpus by @LeiWang1999 in #334
[Example] Add sparse gqa decode example by @xiayuqing0622 in #332
[Enhancement] Enhance FP8/FP4 type handling in CUDA codegen by @LeiWang1999 in #323
[Doc] Fix typo and heading level in GEMV tutorial by @yeh-sudo in #337
[Dev] Add Group Cast FP8 Example by @chengyupku in #338
[Enhancement] Support region padding when convert buffer load to buffer region by @LeiWang1999 in #342
[Example] Add triton block sparse gqa decode by @YizhaoGao in #341
[Enhancement] Support index bit width configuration by @LeiWang1999 in #343
[Bugfix] Fix X_amax Correctness Issue in Group Cast FP8 by @chengyupku in #345
[Bugfix] Fix Transposed Fragment Layout for amd GEMM_RS matrix core by @LeiWang1999 in #346
[AutoTune] Refactor AutoTuneArtifact to utilize kernel as context instead of profiler by @LeiWang1999 in #344
[Bugfix] Compile/"cached" still not loading cached kernel for example in example_mha_bwd by @Alex4210987 in #339
[Refactor] Implement thread-local storage for FrameStack in frame.py and kernel.py by @LeiWang1999 in #352
[Typo] Replace kernel.func with kernel in mla benchmark scripts by @LeiWang1999 in #354
[AMD][Docker] Create Dockerfile for ROCm environment setup by @LeiWang1999 in #355
[Enhancement] Update group_per_split_token_cast_to_fp8 to support multiple data types by @chengyupku in #356
[Enhancement] Support pass config disable_warp_specialize to disable auto specialization on hopper by @LeiWang1999 in #357
[Example] Introduce autotuning example for GEMM with enhanced configuration options by @chengyupku in #360
[Example] Handle Scenarios in Which a Threadblock is Assigned Only Invalid Block Indices for Sparse Attention by @xiayuqing0622 in #361
[Bugfix] Correct dynamic shared memory size error handling in HIP by @LeiWang1999 in #362
[AMD] Implement Deepseek MLA for AMD by @LeiWang1999 in #363
[Bugfix] Fix compilation issues for amd cdna element size check by @LeiWang1999 in #364
[AMD] Support FlashMLA with num split template for AMD gpus by @LeiWang1999 in #366
[MLA][AMD] Add amd mla benchmarking by @LeiWang1999 in #367
[Bugfix] Adjust Autotuner threadpool max_workers limit to available CPUs by @tth37 in #368
[Language] Introduce T.any_of and T.all_of to reduce a bool arrary by @LeiWang1999 in #371
[AMD][Setup] Support HIP in setup.py by @zhhangBian in #369
[Typo] Remove debug print by @LeiWang1999 in #373
[Docs] Add AMD Flash MLA Documentation to Tutorials Section by @LeiWang1999 in #376
[Bugfix] Add filelock for cython build by @LeiWang1999 in #377
[Typo] Remove unused comments generated by copilot by @LeiWang1999 in #379
[Doc] Add deepseek_mla to documentation index by @LeiWang1999 in #380
[Refactor] Remove debug message in pass legalize_safe_memory_access by @LeiWang1999 in #381
[Enhancement][Pipeline] More precise copy code block detection in pipeline by @LeiWang1999 in #384
[Revert] Revert modifications for pass FlattenBuffer by @LeiWang1999 in #385
[Dynamic Symbolic] Add pass_config to customize vectorization and tail split by @tzj-fxz in #383
[Pytest Fix] Wrap tests in dynamic benchmark by @tzj-fxz in #387
[Doc] Update README.md for deepseek_mla on AMD by @LeiWang1999 in #389
[Pipeline][Enhancement] Add copy_prepare stage to support mask and index caching by @LeiWang1999 in #392
[Refactor] Refactor warp_specialized_rewriter to support multiple acquire/release patterns by @chengyupku in #391
[Enhancement] Report Error Body in ParallelOp Layout Inference by @chengyupku in #394
[Bugfix] Support T.Parallel with local register assignment by @LeiWang1999 in #395
[Enhancement] Introduce a smarter warp partition strategy by @LeiWang1999 in #396
[Example] Add bitnet-1.58b examples by @LeiWang1999 in #399
Bump transformers from 4.40 to 4.48.0 in /examples/bitnet-1.58b by @dependabot in #400
[BugFix] Address should aligned with access size in tail split by @tzj-fxz in #401
[Enhancement] Move T.any_of and T.all_of op registration from python into cpp by @Cunxiao2002 in #398
Add preliminary support for bf16 for AMD by @OscarSavolainen in #388
[BugFix] Conditions Robustness in dynamic vectorize by @tzj-fxz in #404
[CI] Update CI configuration to run pytest with automatic parallelization by @LeiWang1999 in #393
[Documentation] Fix Installation Documentation by @andyluo03 in #405

New Contributors

@XueSongTap made their first contribution in #267
@botbw made their first contribution in #293
@NaOHCC made their first contribution in #299
@tth37 made their first contribution in #300
@LeslinD made their first contribution in #305
@zcnrex made their first contribution in #319
@xiayuqing0622 made their first contribution in #332
@yeh-sudo made their first contribution in #337
@YizhaoGao made their first contribution in #341
@zhhangBian made their first contribution in #369
@dependabot made their first contribution in #400
@OscarSavolainen made their first contribution in #388
@andyluo03 made their first contribution in #405

Full Changelog: v0.1.3...v0.1.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.1.4

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!