Release v0.1.5 · tile-ai/tilelang

What's Changed

[Release] Bump version from 0.1.3 into 0.1.4 by @LeiWang1999 in #375
[Enhancement] Remove redundant recursive rewrite rule for FloorDiv in RewriteSimplifier by @LeiWang1999 in #408
[Docker] cu128 Support by @andyluo03 in #410
[Refactor] Phaseout python dependency attrs and decorator by @LeiWang1999 in #411
[Language] make linter and type checker happy with mocking by @YouJiacheng in #407
[Bugfix] Support larger than 256 box size tma copy by @LeiWang1999 in #413
[Enhancement] Add get_nvcc_compiler function to retrieve nvcc path by @LeiWang1999 in #414
Update lower.py to set default value for params by @Alex4210987 in #416
[Enhancement] Support Auto Layout Inference and Parallelism with variable constraint by @LeiWang1999 in #417
[Enhancement] Support to find Cython path more automatically by @FrozenGene in #418
[Refactor] Enhance layout inference logic in ParallelOp by @chengyupku in #420
[BugFix] Fix tvm simplify pass by @smallscientist1 in #421
[Enhancement] Add TMA+WS support in pipeline planning logic by @chengyupku in #422
[Language] Support tile operator T.cumsum by @LeiWang1999 in #423
Delete testing/python/language/test_tilelang_language_reduce_sum.py by @LeiWang1999 in #424
[Bugfix] Fix a bug for simplifier by @LeiWang1999 in #425
[Layout] Enhance layout inference pass by @LeiWang1999 in #427
[Enhancement] Remove DeReplicate during parallel loop layout inference by @LeiWang1999 in #430
[Bugfix] Fix the test data distribution of cumsum by @LeiWang1999 in #432
[Enhancement] Support cute mma tile mxn8ky by @LeiWang1999 in #434
[Bugfix] Removed the behavior that treated global -> local as a copy operation. by @LeiWang1999 in #435
[Language] Support accumulative T.reduce_sum by @LeiWang1999 in #436
[Bugfix] fix the unexpected keyword error of autotune by @yyttt6 in #438
[Testing] Add atomic add test by @LeiWang1999 in #439
[Typo] Rename warp_source to wrap_source by @lucifer1004 in #440
[Refactor] Update KernelLaunch to clarify block name by @LeiWang1999 in #441
[Enhancement] Reduce CPU overhead during kernel execution by @Cunxiao2002 in #437
[Enhancement] Improve layout inference accuracy in ParallelOp by @LeiWang1999 in #442
[Bugfix] Fix layout inference for free fragment buffer by @LeiWang1999 in #443
Bump transformers from 4.48.0 to 4.50.0 in /examples/bitnet-1.58b by @dependabot in #444
[Language] Support explicit programming for identified warp groups by @LeiWang1999 in #445
[Bugfix] Fix safe memory legalization for fragment store by @LeiWang1999 in #446
[Refactor] Separate warp specialize rewriter and tma barrier injector pass by @LeiWang1999 in #447
[Enhancement] Add new examples for warp specialization and TMA integration by @LeiWang1999 in #448
[Refactor] Phaseout torch>=2.2.0 dependency by @LeiWang1999 in #451
[Feature] Add TILELANG_CHECK_LAST_ERROR macro for improved error handling in CUDA and HIP by @LeiWang1999 in #450
[Enhancement] Introduce pass_configs parameter for kernel Caching by @LeiWang1999 in #452
[Feature] Add cache directory management functions in tilelang.cache by @LeiWang1999 in #453
[Bugfix] Fix get_swizzle_layout implementation. by @cherichy in #455
[Refactor] Update barrier functions and add new example for GEMM with warp specialization by @LeiWang1999 in #456
[Refactor] Include examples in CI by @LeiWang1999 in #457
docs: add llvm version info to installation.md. by @AsakusaRinne in #459
[CI] Add elementwise and gemv examples to CI. by @Cunxiao2002 in #458
[Bugfix] Fix for T.copy with dynamic range by @LeiWang1999 in #462
[Bugfix] Fix copy region automation for dynamic extent by @LeiWang1999 in #465
[Feature] Implement fast integer power operation and related API by @LeiWang1999 in #466
[Typo] Rename power_of_int with pow_of_int for consistency by @LeiWang1999 in #468
[CI] Add BlocksparseGemm, Dynamic, and Cast examples to CI by @tzj-fxz in #467
[Refactor] Update set_compile_args to allow None for out_idx parameter by @LeiWang1999 in #469
[Refactor] Simplify buffer_region_to_tile_region function in copy.py by @LeiWang1999 in #470
[CI] Add Convolution example to CI by @xwhzz in #473
[BugFix] Correct argparse for example_convolution test by @xwhzz in #474
[Refactor] set USE_LLVM to optional. by @hyx1999 in #476
[CI] Add Analyzer and blocksparse_attention examples to CI by @yyttt6 in #472
[Refactor] Skip patchelf if not installed by @LeiWang1999 in #477
[Refactor] Improve layout equality checks and error messaging by @LeiWang1999 in #471
[Doc] Update version retrieval in conf.py to read from VERSION file by @xwhzz in #478
Fix Device Consistency in Autotuner Threads and Add Manual Profiler Check by @yuanjypku in #481
[Bugfix] Check CUDA target before checking for TMA by @gau-nernst in #482
[Bugfix] Use AutoTune cache_input_tensors properly by @yyttt6 in #483
Revert "[Bugfix] Use AutoTune cache_input_tensors properly" by @LeiWang1999 in #488
[Enhancement] Support register input for gemm when trans_a or trans_b is true by @LeiWang1999 in #490
[CI] Add flash_decoding example to CI by @xuchangtolearn in #487
[CI] Add Reminder Bot for pull request contributions by @xwhzz in #491
[Refactor] Introduce quantize components of TileLang and add testing for dequant gemm exmaple by @LeiWang1999 in #494
[Enhancement] Introduce flag to visualize shared memory merge plan by @LeiWang1999 in #496
[Refactor] Update main function structure in example scripts and add tests by @chengyupku in #475
[Bugfix] Fix Hopper GEMM layout for small tile size by @LeiWang1999 in #497
[Enhancement] Fallback transposed_ldmatrix into SM75_U16x4_LDSM_N when warp_n is 8 by @LeiWang1999 in #498
[Bugfix] Rename SM75_U16x8_LDSM_N to SM75_U16x8_LDSM_T to reflect correct matrix type by @LeiWang1999 in #499
[Refactor] Update GEMM layout and operand traits for improved CUDA compatibility by @LeiWang1999 in #500
[Refactor] Update JIT kernel functions and streamline GEMM tests by @LeiWang1999 in #501
Fix AMD Docker issues related to conda environment setup by @Hamerlate in #503
[Refactor] Refactor jit to _JitImplementation to support @tilelang.jit by @LeiWang1999 in #502
[Refactor] Adjust in fragment GEMM layout by @LeiWang1999 in #504
[Refactor] Update GlobalMemChecker to Detect Lower Bound illegal memory access automatically by @LeiWang1999 in #505
[Enhancement] Enhance ReduceOp and JITKernel for improved dimension handling and initialization by @LeiWang1999 in #507
[Refactor] Update buffer handling in layout transformation to support layout on T.view by @LeiWang1999 in #509
[Bugfix] Enhance smem copy selector for uncommon shape by @LeiWang1999 in #510
[Enhancement] Introduce padding annotation and improve legalize safe memory access pass by @LeiWang1999 in #511
[Refactor] Enhance MergeSharedMemoryAllocations Pass for Improved Liveness Analysis and Scope Management by @LeiWang1999 in #508
[Dev] Add grouped GEMM example with TileLang and PyTorch integration by @chengyupku in #514
[Dev] Add grouped GEMM backward example scripts by @chengyupku in #515
Fix deepgemm exmaple by @benenzhu in #513
[Refactor] Support auto index bitwidth casting by @LeiWang1999 in #517
[Enhancement] Support auto synchronization for global memory access by @LeiWang1999 in #519
[Refactor] Replace default fp8 dtype with cute to perform fast cast by @LeiWang1999 in #520
[Enhancement] Add atomicAdd for FLOAT16x2 and FLOAT16x4 by @LeiWang1999 in #522
[Refactor] Reorganize Thread Synchronization Steps to make sure global synchronization can be correctly lowered by @LeiWang1999 in #521
[Enhancement] Add commit ID to versioning and improve logging initialization by @LeiWang1999 in #524
[Enhancement] Add warp specialization attribute handling in IR and rewriter by @chengyupku in #518
[CI] Add gemm and gemm_fp8 example to CI by @LeslinD in #516
[Refactor] Refactor convolution example to streamline configuration and remove unused code by @LeiWang1999 in #530
[Autotune] Introduce cache mechanism for auto tuner by @LeiWang1999 in #527
[Refactor]: add autotune example to convolution examples by @yyttt6 in #536
[Refactor] Disable legacy vectorization for buffer allocation by @LeiWang1999 in #535
[Language] Support T.annotate_l2_hit_ratio via cudaStreamSetAttribute by @LeiWang1999 in #539
[Bugfix] Fix a bug when simplifying warp combination for T.gemm by @LeiWang1999 in #540
[AMD] Support float8 matrix core by @LeiWang1999 in #537
[Doc] Include DeepWiki badge in README by @xwhzz in #541
[CI] Add hadamard example to CI by @Rachmanino in #549
[chore] set default build type to release if not provided by @botbw in #548
[Refactor] Include several examples into ci by @LeiWang1999 in #531
[AMD][Enhancement] Add support for Vectorized FP8 DataPacking by @LeiWang1999 in #542
[Bugfix] Enhance layout inference pass for flexibility by @LeiWang1999 in #550
[Autotune] Remove the out_idx argument from the autotune cache by @LeiWang1999 in #553
[CI] Add linear attention examples to CI by @Rachmanino in #552
[CI]Add norm and layout_plot by @Alex4210987 in #534

New Contributors

@FrozenGene made their first contribution in #418
@lucifer1004 made their first contribution in #440
@AsakusaRinne made their first contribution in #459
@yuanjypku made their first contribution in #481
@gau-nernst made their first contribution in #482
@xuchangtolearn made their first contribution in #487
@Hamerlate made their first contribution in #503
@benenzhu made their first contribution in #513
@Rachmanino made their first contribution in #549

Full Changelog: v0.1.4...v0.1.5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.1.5

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!