Skip to content

Conversation

@rdspring1
Copy link
Collaborator

@rdspring1 rdspring1 commented Sep 4, 2025

This PR creates a GitHub markdown / CPP hybrid tutorial that shows how to create CuTe Thread-Value Layouts via NvFuser.

TODO:

  • Extract stride of loop iterDomains from TensorIndexer.
  • Separate Thread and Value dimensions based on thread parallelization.
  • Diagrams have different colors for threads. Apply colors based on values and replication patterns.

@github-actions
Copy link

github-actions bot commented Sep 4, 2025

Review updated until commit 67d9c1e

Description

  • Add tutorial on CuTe Thread-Value layouts in NvFuser

  • Implement example TV layouts: simple, vectorized, WGMMA

  • Include visual diagrams for TV layout examples

  • Integrate tutorial into CMake test suite


Changes walkthrough 📝

Relevant files
Documentation
tutorial_cute_tv_layout.cpp
Add hybrid tutorial file link                                                       

tests/cpp/tutorial_cute_tv_layout.cpp

  • Added symlink to cute_tv_layout.md as a hybrid CPP/Markdown tutorial
  • Enables the file to be both compiled and rendered as documentation
  • +1/-0     
    cute_tv_layout.md
    Create CuTe TV layout tutorial                                                     

    doc/dev/cute_tv_layout.md

  • Created comprehensive tutorial on CuTe TV layouts
  • Added three example layouts: simple, vectorized, and Hopper WGMMA
  • Included SVG diagrams for visual explanation
  • Provided detailed code comments and layout algebra background
  • +461/-0 
    Configuration changes
    CMakeLists.txt
    Register new tutorial in build                                                     

    CMakeLists.txt

  • Added tutorial_cute_tv_layout.cpp to the list of tutorial sources
  • Ensures the new tutorial is included in the test suite
  • +1/-0     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Possible Issue

    The comment in the code indicates that the input and output tensors are column-major with shape (16, 64) and stride (1, 16), but the actual tensor dimensions defined in the test are (64, 24), which may lead to confusion or incorrect assumptions about the layout.

    // The input and output tensors are column-major with shape (16, 64) and stride (1, 16).
    __global__ void CUDAGeneratedKernel(Tensor<__bfloat, 2, 2> T0, Tensor<__bfloat, 2, 2> T1) {
    Possible Issue

    The PTX WGMMA comment references a (16, 64) shape, but the defined tensor dimensions are (64, 24), indicating a mismatch between the documented example and the actual implementation.

    // The input and output tensors are column-major with shape (16, 64) and stride (1, 16).
    __global__ void CUDAGeneratedKernel(Tensor<__bfloat, 2, 2> T0, Tensor<__bfloat, 2, 2> T1) {
      #pragma unroll
      for(nvfuser_index_t i0 = 0LL; i0 < 64LL; ++i0) {
        nvfuser_index_t i1;
        i1 = 16LL * i0;
        #pragma unroll
        for(nvfuser_index_t i2 = 0LL; i2 < 16LL; ++i2) {
          nvfuser_index_t i3;
          i3 = i1 + i2;
          T1[i3]
             = T0[i3];
        }
      }
    }
    */

    @rdspring1 rdspring1 force-pushed the cute_tv_layout branch 2 times, most recently from 072adf2 to 7d1afed Compare September 4, 2025 17:47
    Copy link
    Collaborator

    @zasdfgbnm zasdfgbnm left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Let's merge this. It's a good reference material for learning and understanding CuTe.

    @rdspring1
    Copy link
    Collaborator Author

    !build

    @rdspring1 rdspring1 merged commit b272c97 into main Sep 23, 2025
    17 checks passed
    @rdspring1 rdspring1 deleted the cute_tv_layout branch September 23, 2025 19:30
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants