[Testing] Move TMA 1D and test for its functionality #1167

tzj-fxz · 2025-10-31T08:48:22Z

As title

Summary by CodeRabbit

Tests
- Reorganized elementwise operation tests for improved clarity and maintainability.
- Removed a deprecated test case.
- Added a programmatic test flow that runs elementwise add across multiple parameter combinations.
- Consolidated test setup to simplify imports and ensure consistent execution and validation.

github-actions · 2025-10-31T08:48:32Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2025-10-31T08:48:32Z

Walkthrough

Removed one example-based test and refactored a language-level test to add a callable helper run_elementwise_add(M, N) and invoke it for three (M, N) configurations instead of using argparse-based CLI parsing.

Changes

Cohort / File(s)	Summary
Example tests removed `examples/elementwise/test_example_elementwise.py`	Removed the import and the `test_example_elementwise_add_tma_1d` test function; left other tests in that file unchanged.
Language test refactor / helper added `testing/python/language/test_tilelang_language_tma_1d.py`	Replaced argparse CLI flow with a new `run_elementwise_add(M, N)` function that configures and runs the elementwise-add kernel, validates results, and performs kernel-source checks; wired explicit calls for (128,128), (256,128), (256,256); removed redundant torch import.

Sequence Diagram(s)

sequenceDiagram
    actor Tester
    participant test_file as test_tilelang_language_tma_1d.py
    participant Kernel as ElementwiseAddKernel
    participant Reference as ReferenceCompute

    Tester->>test_file: import & invoke run_elementwise_add for (128,128)
    activate test_file
    test_file->>Kernel: configure + launch kernel
    Kernel-->>test_file: output tensor
    test_file->>Reference: compute expected result
    Reference-->>test_file: reference tensor
    test_file->>test_file: compare outputs & check kernel source
    deactivate test_file
    Note over test_file,Kernel: repeated for (256,128) and (256,256)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Areas to check:
- Correctness of run_elementwise_add verification (numerical comparison tolerances and shapes).
- Kernel-source/content assertions conditioned on block_N vs N.
- Any test teardown or device/context assumptions for repeated invocations.

Suggested reviewers

LeiWang1999

Poem

🐇 I hopped through tests both old and new,

Removed a file, then wrote a few,
A helper runs three sizes through the night,
Checking kernels, tensors tight —
Small paws, sharp eyes, tests glowing bright.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "[Testing] Move TMA 1D and test for its functionality" is directly related to the main changes in the pull request. The changeset shows removal of the TMA 1D test from `examples/elementwise/test_example_elementwise.py` and the introduction of a new `run_elementwise_add()` function in `testing/python/language/test_tilelang_language_tma_1d.py` with proper validation and testing harness wiring. This clearly represents a relocation of TMA 1D testing from examples to the testing folder, along with enhancements to test functionality. The title is concise, specific enough to convey the primary change, and appropriately uses the [Testing] prefix to indicate the category of work.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

testing/python/language/test_tilelang_language_tma_1d.py (1)
33-50: Consider adding clarifying comments for the code inspection logic.

The helper function is well-structured and correctly validates both functional behavior and TMA code generation. However, the assertion logic at lines 46-49 would benefit from a brief comment explaining that when N == block_N, the kernel uses a single-tile-width layout that doesn't require a CUtensorMap, whereas multi-tile-width scenarios do.

Example:
     code = kernel.get_kernel_source()
+    # Single-tile-width (N == block_N) uses simpler TMA without CUtensorMap
     if N == block_N:
         assert "tma_load" in code and "CUtensorMap" not in code
+    # Multi-tile-width requires CUtensorMap for tiling
     else:
         assert "tma_load" in code and "CUtensorMap" in code

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 10911e2 and eec7e3d.

📒 Files selected for processing (2)

examples/elementwise/test_example_elementwise.py (0 hunks)
testing/python/language/test_tilelang_language_tma_1d.py (2 hunks)

💤 Files with no reviewable changes (1)

examples/elementwise/test_example_elementwise.py

🔇 Additional comments (2)

testing/python/language/test_tilelang_language_tma_1d.py (2)

2-2: LGTM!

The import cleanup looks good.

52-55: LGTM!

The test cases provide good coverage of TMA behavior across single-tile and multi-tile scenarios.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

testing/python/language/test_tilelang_language_tma_1d.py (3)
37-38: Consider documenting why block sizes are fixed.

The hard-coded block sizes (128×128) are marked as "Default config" but the rationale isn't clear. A brief comment explaining that these specific dimensions are chosen to test TMA 1D optimization behavior (single vs. multiple blocks) would improve maintainability.

44-48: Kernel source validation is implementation-dependent.

String-based validation of generated kernel code is a common testing pattern for compiler optimizations, but it's brittle and tightly couples the test to current code generation patterns. If the tilelang compiler changes its code generation (e.g., renames variables, changes string formatting, or optimizes differently), this test will break even if the functionality remains correct.

Consider whether:

This validation is essential for verifying TMA 1D correctness, or

Functional correctness (line 42) is sufficient, with code inspection as a separate compiler test

If string checks remain, document what TMA 1D behavior is being validated (i.e., single-block dimensions use direct loads without tensor map descriptors).

32-58: Optional: Consider pytest integration.

Since this file is in the testing/python/language/ directory, integrating with pytest would enable:

Automatic test discovery

Better test reporting and failure diagnostics

Parameterized testing (e.g., @pytest.mark.parametrize for the three configurations)

Skipping tests when CUDA is unavailable

Example refactor:
import pytest

@pytest.mark.parametrize("M,N", [(128, 128), (256, 128), (256, 256)])
def test_elementwise_add_tma_1d(M, N):
    run_elementwise_add(M, N)
This would also allow adding @pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA required") for better CI/CD integration.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eec7e3d and 4779fc2.

📒 Files selected for processing (1)

testing/python/language/test_tilelang_language_tma_1d.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)

🔇 Additional comments (3)

testing/python/language/test_tilelang_language_tma_1d.py (3)

1-4: LGTM! Clean import structure.

The simplified imports align well with the refactored test approach.

51-54: LGTM! Well-chosen test configurations.

The three test cases effectively cover different TMA 1D scenarios: single-block (128×128), partial single-block (256×128), and multi-block (256×256) configurations.

57-58: LGTM!

Standard and correct entry point pattern.

* [Testing] Move TMA 1D and test for its functionality * [Lint]

* [Test] Add cp async to avoid register spill * [BugFix] GQA fwd and bwd - Fix the undefined behavior of -inf in acc_s - Fix the causal loop range in varlen scenario * [TMA] Move on to TMA and locate the register spill issue * [Debug] Not the reason of zero-assignment. Probably the combination of Parallel op & conditional qkT * [Debug] The SIMT copy in producer occupies too many registers * [BugFix] Use 3D lse and delta to avoid illegal instruction * [Perf] Relaxed order for dQ and SIMT store for dKdV * [Feat] For atomic add version * [Lint] * [Bugfix] Enable code lowering with producer‑copy‑only program (#1168) * bugfix * lint fix * Enhance warp group register allocation to handle missing consumer bodies gracefully. Updated logic to annotate producer side when consumer is absent, ensuring robustness in degenerate warp-specialized patterns. * Refactor VisitExpr_ method in inject_tma_barrier.cc for improved readability. Adjusted formatting and spacing for clarity in barrier handling logic. * Update barrier handling in inject_tma_barrier.cc to accommodate newly appended entries. Adjusted the size of the replace vector to ensure it covers the full needed length, and modified the logic for appending barriers based on the updated replace conditions. * [Bugfix] Support 16bits shfl_sync (#1169) * Add type-safe warp shuffle helpers for 16-bit float types in common.h - Introduced generic passthrough functions for warp shuffle operations: `shfl_xor_sync`, `shfl_down_sync`, `shfl_up_sync`, and `shfl_sync`. - Added specializations for `cutlass::half_t` and `cutlass::bfloat16_t` to ensure type safety during shuffle operations. - Updated `reduce.h` to utilize the new shuffle functions, enhancing code clarity and maintainability. * lint fix * [Testing] Move TMA 1D and test for its functionality (#1167) * [Testing] Move TMA 1D and test for its functionality * [Lint] * [Refactor]: Change the params in pytest to avoid oom error during ci (#1170) * [Refactor]: Change the params in pytest to avoid oom error during ci * format * fix * Update test_example_cast.py * Update parameters in test_example_cast * Update test_example_flash_attention.py * update * format * fix * fix * format * [Bugfix] Fix tvm import path for editable build (#1172) * [Language] Expose `T.warpgroup_fence_operand` for nvcc code motion (#986) * remove debug print * pipeline fix * use the correct buffer access scope * rs support * warp warpgroup_fence_operand * fix * fp8 dtype ptx enhance * mma fix * TCGEN05 Interface * tcgen05 support * rebase * update * Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors. * lint fix * Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module. * wgmma fix --------- Co-authored-by: Zhiwen Mo <[email protected]> * [Language] Add Correctness and performance check scripts for V2 (#1174) * fix * lint fix * fix * lint fix * fix * upd * [Bugfix] Legalize Datatype for mma intrinisc codegen (#1179) * fix * lint fix * Enhance CUDA code generation by updating register type handling for float data types. Introduced a workaround for TF32 type compatibility and improved the registration of MMA register types for A and B operands. * [Perf] Add layout and use_tma to boost performance * [Lint] * [Note] --------- Co-authored-by: Lei Wang <[email protected]> Co-authored-by: Yuqi Dong <[email protected]> Co-authored-by: Zhiwen Mo <[email protected]>

[Testing] Move TMA 1D and test for its functionality

eec7e3d

coderabbitai bot reviewed Oct 31, 2025

View reviewed changes

[Lint]

4779fc2

coderabbitai bot reviewed Oct 31, 2025

View reviewed changes

LeiWang1999 approved these changes Oct 31, 2025

View reviewed changes

LeiWang1999 merged commit 5c62d00 into tile-ai:main Nov 1, 2025
4 of 6 checks passed

kurisu6912 mentioned this pull request Nov 3, 2025

[Fix] fix type imcompatible error in #1115 #1180

Merged

tzj-fxz added a commit to tzj-fxz/tilelang that referenced this pull request Nov 3, 2025

[Testing] Move TMA 1D and test for its functionality (tile-ai#1167)

3311b94

* [Testing] Move TMA 1D and test for its functionality * [Lint]

This was referenced Nov 4, 2025

[Feat] Add swap like grammar in tuple assignment #1185

Merged

[Fix] Remove unsupported type params #1186

Merged

[Feat] Add support for T.serial with step and negative step #1188

Merged

[Feat] Add A Pass to Handle Negative Index #1192

Merged

This was referenced Nov 10, 2025

[Fix] Fix buffer re-import typo in tilelang.languge #1214

Merged

[Fix] Fix a type that make wrong T.macro backtrace #1234

Merged

[Language] Add type stubs for tir op #1239

Merged

This was referenced Nov 21, 2025

[Feat] Add missing support for uint32x2, add unsigned implicit cast in bitwise op, add T.Ref as macro annotation #1302

Closed

[Fix] Remove unused let_bindings_ in CodeGenC to fix #1300 #1305

Merged

[Fix] Fix frame scope error in T.macro #1308

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Testing] Move TMA 1D and test for its functionality #1167

[Testing] Move TMA 1D and test for its functionality #1167

tzj-fxz commented Oct 31, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Oct 31, 2025

Uh oh!

coderabbitai bot commented Oct 31, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Testing] Move TMA 1D and test for its functionality #1167

[Testing] Move TMA 1D and test for its functionality #1167

Conversation

tzj-fxz commented Oct 31, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Oct 31, 2025

Uh oh!

coderabbitai bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tzj-fxz commented Oct 31, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 31, 2025 •

edited

Loading