Skip to content

v0.1.5

Choose a tag to compare

@LeiWang1999 LeiWang1999 released this 05 Jun 08:32
· 927 commits to main since this release
a32009b

What's Changed

  • [Release] Bump version from 0.1.3 into 0.1.4 by @LeiWang1999 in #375
  • [Enhancement] Remove redundant recursive rewrite rule for FloorDiv in RewriteSimplifier by @LeiWang1999 in #408
  • [Docker] cu128 Support by @andyluo03 in #410
  • [Refactor] Phaseout python dependency attrs and decorator by @LeiWang1999 in #411
  • [Language] make linter and type checker happy with mocking by @YouJiacheng in #407
  • [Bugfix] Support larger than 256 box size tma copy by @LeiWang1999 in #413
  • [Enhancement] Add get_nvcc_compiler function to retrieve nvcc path by @LeiWang1999 in #414
  • Update lower.py to set default value for params by @Alex4210987 in #416
  • [Enhancement] Support Auto Layout Inference and Parallelism with variable constraint by @LeiWang1999 in #417
  • [Enhancement] Support to find Cython path more automatically by @FrozenGene in #418
  • [Refactor] Enhance layout inference logic in ParallelOp by @chengyupku in #420
  • [BugFix] Fix tvm simplify pass by @smallscientist1 in #421
  • [Enhancement] Add TMA+WS support in pipeline planning logic by @chengyupku in #422
  • [Language] Support tile operator T.cumsum by @LeiWang1999 in #423
  • Delete testing/python/language/test_tilelang_language_reduce_sum.py by @LeiWang1999 in #424
  • [Bugfix] Fix a bug for simplifier by @LeiWang1999 in #425
  • [Layout] Enhance layout inference pass by @LeiWang1999 in #427
  • [Enhancement] Remove DeReplicate during parallel loop layout inference by @LeiWang1999 in #430
  • [Bugfix] Fix the test data distribution of cumsum by @LeiWang1999 in #432
  • [Enhancement] Support cute mma tile mxn8ky by @LeiWang1999 in #434
  • [Bugfix] Removed the behavior that treated global -> local as a copy operation. by @LeiWang1999 in #435
  • [Language] Support accumulative T.reduce_sum by @LeiWang1999 in #436
  • [Bugfix] fix the unexpected keyword error of autotune by @yyttt6 in #438
  • [Testing] Add atomic add test by @LeiWang1999 in #439
  • [Typo] Rename warp_source to wrap_source by @lucifer1004 in #440
  • [Refactor] Update KernelLaunch to clarify block name by @LeiWang1999 in #441
  • [Enhancement] Reduce CPU overhead during kernel execution by @Cunxiao2002 in #437
  • [Enhancement] Improve layout inference accuracy in ParallelOp by @LeiWang1999 in #442
  • [Bugfix] Fix layout inference for free fragment buffer by @LeiWang1999 in #443
  • Bump transformers from 4.48.0 to 4.50.0 in /examples/bitnet-1.58b by @dependabot in #444
  • [Language] Support explicit programming for identified warp groups by @LeiWang1999 in #445
  • [Bugfix] Fix safe memory legalization for fragment store by @LeiWang1999 in #446
  • [Refactor] Separate warp specialize rewriter and tma barrier injector pass by @LeiWang1999 in #447
  • [Enhancement] Add new examples for warp specialization and TMA integration by @LeiWang1999 in #448
  • [Refactor] Phaseout torch>=2.2.0 dependency by @LeiWang1999 in #451
  • [Feature] Add TILELANG_CHECK_LAST_ERROR macro for improved error handling in CUDA and HIP by @LeiWang1999 in #450
  • [Enhancement] Introduce pass_configs parameter for kernel Caching by @LeiWang1999 in #452
  • [Feature] Add cache directory management functions in tilelang.cache by @LeiWang1999 in #453
  • [Bugfix] Fix get_swizzle_layout implementation. by @cherichy in #455
  • [Refactor] Update barrier functions and add new example for GEMM with warp specialization by @LeiWang1999 in #456
  • [Refactor] Include examples in CI by @LeiWang1999 in #457
  • docs: add llvm version info to installation.md. by @AsakusaRinne in #459
  • [CI] Add elementwise and gemv examples to CI. by @Cunxiao2002 in #458
  • [Bugfix] Fix for T.copy with dynamic range by @LeiWang1999 in #462
  • [Bugfix] Fix copy region automation for dynamic extent by @LeiWang1999 in #465
  • [Feature] Implement fast integer power operation and related API by @LeiWang1999 in #466
  • [Typo] Rename power_of_int with pow_of_int for consistency by @LeiWang1999 in #468
  • [CI] Add BlocksparseGemm, Dynamic, and Cast examples to CI by @tzj-fxz in #467
  • [Refactor] Update set_compile_args to allow None for out_idx parameter by @LeiWang1999 in #469
  • [Refactor] Simplify buffer_region_to_tile_region function in copy.py by @LeiWang1999 in #470
  • [CI] Add Convolution example to CI by @xwhzz in #473
  • [BugFix] Correct argparse for example_convolution test by @xwhzz in #474
  • [Refactor] set USE_LLVM to optional. by @hyx1999 in #476
  • [CI] Add Analyzer and blocksparse_attention examples to CI by @yyttt6 in #472
  • [Refactor] Skip patchelf if not installed by @LeiWang1999 in #477
  • [Refactor] Improve layout equality checks and error messaging by @LeiWang1999 in #471
  • [Doc] Update version retrieval in conf.py to read from VERSION file by @xwhzz in #478
  • Fix Device Consistency in Autotuner Threads and Add Manual Profiler Check by @yuanjypku in #481
  • [Bugfix] Check CUDA target before checking for TMA by @gau-nernst in #482
  • [Bugfix] Use AutoTune cache_input_tensors properly by @yyttt6 in #483
  • Revert "[Bugfix] Use AutoTune cache_input_tensors properly" by @LeiWang1999 in #488
  • [Enhancement] Support register input for gemm when trans_a or trans_b is true by @LeiWang1999 in #490
  • [CI] Add flash_decoding example to CI by @xuchangtolearn in #487
  • [CI] Add Reminder Bot for pull request contributions by @xwhzz in #491
  • [Refactor] Introduce quantize components of TileLang and add testing for dequant gemm exmaple by @LeiWang1999 in #494
  • [Enhancement] Introduce flag to visualize shared memory merge plan by @LeiWang1999 in #496
  • [Refactor] Update main function structure in example scripts and add tests by @chengyupku in #475
  • [Bugfix] Fix Hopper GEMM layout for small tile size by @LeiWang1999 in #497
  • [Enhancement] Fallback transposed_ldmatrix into SM75_U16x4_LDSM_N when warp_n is 8 by @LeiWang1999 in #498
  • [Bugfix] Rename SM75_U16x8_LDSM_N to SM75_U16x8_LDSM_T to reflect correct matrix type by @LeiWang1999 in #499
  • [Refactor] Update GEMM layout and operand traits for improved CUDA compatibility by @LeiWang1999 in #500
  • [Refactor] Update JIT kernel functions and streamline GEMM tests by @LeiWang1999 in #501
  • Fix AMD Docker issues related to conda environment setup by @Hamerlate in #503
  • [Refactor] Refactor jit to _JitImplementation to support @tilelang.jit by @LeiWang1999 in #502
  • [Refactor] Adjust in fragment GEMM layout by @LeiWang1999 in #504
  • [Refactor] Update GlobalMemChecker to Detect Lower Bound illegal memory access automatically by @LeiWang1999 in #505
  • [Enhancement] Enhance ReduceOp and JITKernel for improved dimension handling and initialization by @LeiWang1999 in #507
  • [Refactor] Update buffer handling in layout transformation to support layout on T.view by @LeiWang1999 in #509
  • [Bugfix] Enhance smem copy selector for uncommon shape by @LeiWang1999 in #510
  • [Enhancement] Introduce padding annotation and improve legalize safe memory access pass by @LeiWang1999 in #511
  • [Refactor] Enhance MergeSharedMemoryAllocations Pass for Improved Liveness Analysis and Scope Management by @LeiWang1999 in #508
  • [Dev] Add grouped GEMM example with TileLang and PyTorch integration by @chengyupku in #514
  • [Dev] Add grouped GEMM backward example scripts by @chengyupku in #515
  • Fix deepgemm exmaple by @benenzhu in #513
  • [Refactor] Support auto index bitwidth casting by @LeiWang1999 in #517
  • [Enhancement] Support auto synchronization for global memory access by @LeiWang1999 in #519
  • [Refactor] Replace default fp8 dtype with cute to perform fast cast by @LeiWang1999 in #520
  • [Enhancement] Add atomicAdd for FLOAT16x2 and FLOAT16x4 by @LeiWang1999 in #522
  • [Refactor] Reorganize Thread Synchronization Steps to make sure global synchronization can be correctly lowered by @LeiWang1999 in #521
  • [Enhancement] Add commit ID to versioning and improve logging initialization by @LeiWang1999 in #524
  • [Enhancement] Add warp specialization attribute handling in IR and rewriter by @chengyupku in #518
  • [CI] Add gemm and gemm_fp8 example to CI by @LeslinD in #516
  • [Refactor] Refactor convolution example to streamline configuration and remove unused code by @LeiWang1999 in #530
  • [Autotune] Introduce cache mechanism for auto tuner by @LeiWang1999 in #527
  • [Refactor]: add autotune example to convolution examples by @yyttt6 in #536
  • [Refactor] Disable legacy vectorization for buffer allocation by @LeiWang1999 in #535
  • [Language] Support T.annotate_l2_hit_ratio via cudaStreamSetAttribute by @LeiWang1999 in #539
  • [Bugfix] Fix a bug when simplifying warp combination for T.gemm by @LeiWang1999 in #540
  • [AMD] Support float8 matrix core by @LeiWang1999 in #537
  • [Doc] Include DeepWiki badge in README by @xwhzz in #541
  • [CI] Add hadamard example to CI by @Rachmanino in #549
  • [chore] set default build type to release if not provided by @botbw in #548
  • [Refactor] Include several examples into ci by @LeiWang1999 in #531
  • [AMD][Enhancement] Add support for Vectorized FP8 DataPacking by @LeiWang1999 in #542
  • [Bugfix] Enhance layout inference pass for flexibility by @LeiWang1999 in #550
  • [Autotune] Remove the out_idx argument from the autotune cache by @LeiWang1999 in #553
  • [CI] Add linear attention examples to CI by @Rachmanino in #552
  • [CI]Add norm and layout_plot by @Alex4210987 in #534

New Contributors

Full Changelog: v0.1.4...v0.1.5