Add CutlassCompiledKernel deriving CompiledKernelBase #5088

jacobhinkle · 2025-08-29T17:08:10Z

This refactors CompiledKernel to create a new base class CompiledKernelBase that mostly handles kernel naming and recording the generated CUDA code. Then it introduces CutlassCompiledKernel which actually compiles and executes our customized CUTLASS kernels. This will be used in CutlassExecutor in future PRs.

Stacked on #5087.

…dd CutlassCompiledKernel

github-actions · 2025-08-29T17:10:22Z

Review updated until commit fa97d4e

Description

Refactor CompiledKernel into base and derived classes
Introduce CutlassCompiledKernel for CUTLASS kernel execution
Add input validation for tensor contiguity in GEMM
Implement NVCC-based compilation for CUTLASS kernels

Changes walkthrough 📝

Relevant files

Enhancement

gemm.cpp `Add tensor validation and TensorArg abstraction` csrc/cutlass/gemm.cpp Add input validation to ensure tensor contiguity Introduce `TensorArg` struct for CUDA tensor representation Update function signatures to use `TensorArg` instead of `at::Tensor` Enhance kernel code generation with dimension checks	+63/-40
compiled_kernel.cpp `Refactor into CompiledKernelBase hierarchy` csrc/runtime/compiled_kernel.cpp Move common functionality to CompiledKernelBase Implement queryTargetGPUVersion as standalone function Update constructor to initialize base class properly Move kernel ID creation to base class	+59/-59
cutlass_compiled_kernel.cpp `Implement CUTLASS kernel compilation and execution` csrc/runtime/cutlass_compiled_kernel.cpp Implement CutlassCompiledKernel with NVCC compilation Add code generation and dynamic loading via dlopen Implement run method with TensorArg packing Add comprehensive error handling and logging	+357/-0
compiled_kernel.h `Define CompiledKernelBase hierarchy` csrc/runtime/compiled_kernel.h Introduce CompiledKernelBase as common base class Move shared members and methods to base class Update CompiledKernel to inherit from base Add queryTargetGPUVersion declaration	+151/-117
cutlass_compiled_kernel.h `Declare CutlassCompiledKernel interface` csrc/runtime/cutlass_compiled_kernel.h Declare CutlassCompiledKernel class Define interface for CUTLASS kernel compilation Include necessary headers for compilation Implement proper inheritance from CompiledKernelBase	+69/-0

Tests

test_cutlass_scheduler.cpp `Add comprehensive CutlassCompiledKernel test` tests/cpp/test_cutlass_scheduler.cpp Update test to use CutlassCompiledKernel Add environment check for CUTLASS_PATH Implement full end-to-end execution test Add proper tensor allocation and validation	+66/-6

Configuration changes

CMakeLists.txt `Include cutlass_compiled_kernel in build` CMakeLists.txt Add cutlass_compiled_kernel.cpp to source list Ensure proper compilation of new source file	+1/-0

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Possible Issue

The validation of tensor contiguity only checks the contiguity vector but does not verify if the tensor is actually contiguous in memory, which might lead to incorrect assumptions about memory layout.

// Validate that the inputs and scale factors are all contiguous
for (TensorView* tv : {a, b, a_scale, b_scale}) {
  for (const auto& c : tv->getContiguity()) {
    if (c.has_value()) {
      NVF_ERROR(
          c.value() == true,
          "We require input TensorView ",
          tv->toString(),
          " to be fully contiguous for the Cutlass executor.");
    }
  }
}

Performance Concern

The compile command generation includes multiple redundant flags like '--expt-relaxed-constexpr' and '-forward-unknown-to-host-compiler' which might impact compilation efficiency and should be reviewed for necessity.

compile_cmd = "nvcc";
compile_cmd += " -forward-unknown-to-host-compiler";

// Disable some warnings in host code
compile_cmd += " -Wno-conversion";

for (const std::string& path : include_paths) {
  compile_cmd += " -I" + path;
}

compile_cmd +=
    " -Xcudafe "
    "--diag_suppress=cc_clobber_ignored,"
    "--diag_suppress=field_without_dll_interface,"
    "--diag_suppress=base_class_has_different_dll_interface,"
    "--diag_suppress=dll_interface_conflict_none_assumed,"
    "--diag_suppress=dll_interface_conflict_dllexport_assumed,"
    "--diag_suppress=bad_friend_decl";

compile_cmd += " --expt-relaxed-constexpr --expt-extended-lambda";

compile_cmd += " -O3";

compile_cmd += " -std=c++17";

std::string arch = getComputeCapabilityString(compute_capability);
// 90 -> 90a  or  100 -> 100a
// Note that without this we get errors like
//   Trying to use TMA Descriptor Prefetch without CUTE_ARCH_TMA_SM90_ENABLED.
// TODO: This should be done properly
// https://github.com/nvidia/cutlass#target-architecture
arch += "a";
std::string compute_arch = "compute_" + arch;
std::string sm_arch = "sm_" + arch;
compile_cmd +=
    " \"--generate-code="
    "arch=" +
    compute_arch +
    ","
    "code=[" +
    compute_arch + "," + sm_arch + "]\"";

compile_cmd +=
    " -DCUTE_USE_PACKED_TUPLE=1"
    " -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1"
    " -DCUTLASS_VERSIONS_GENERATED"
    " -DCUTLASS_DEBUG_TRACE_LEVEL=0";

compile_cmd +=
    " --expt-relaxed-constexpr --expt-extended-lambda "
    "--threads=32";

Possible Issue

The compute capability string is hardcoded to append 'a' (e.g., 90 -> 90a) without proper validation for all architectures, which might cause compatibility issues with certain GPU architectures.

std::string arch = getComputeCapabilityString(compute_capability);
// 90 -> 90a  or  100 -> 100a
// Note that without this we get errors like
//   Trying to use TMA Descriptor Prefetch without CUTE_ARCH_TMA_SM90_ENABLED.
// TODO: This should be done properly
// https://github.com/nvidia/cutlass#target-architecture
arch += "a";
std::string compute_arch = "compute_" + arch;
std::string sm_arch = "sm_" + arch;
compile_cmd +=

jacobhinkle · 2025-08-29T18:33:29Z

!test

…odegen

Co-authored-by: Ryan Spring <[email protected]>

…o jh/cutlass_executor.compiled_kernel

…ompiled_kernel

…o jh/cutlass_executor.compiled_kernel

jacobhinkle · 2025-09-03T17:13:04Z

!test

…ompiled_kernel

jacobhinkle · 2025-09-04T17:12:14Z

This test is currently skipped in CI because $CUTLASS_PATH is not set in the CI environment. With this change, we will need to have that available at runtime, so at some point we will want to set up our CI to point to CUTLASS.

jacobhinkle · 2025-09-04T17:12:26Z

!test

jacobhinkle added 3 commits August 29, 2025 11:48

Add cutlass codegen with test

5911fdf

Fix test

a429efb

Refactor CompiledKernel into CompiledKernelBase and CompiledKernel, a…

d427a06

…dd CutlassCompiledKernel

jacobhinkle added 2 commits August 29, 2025 13:55

Fixes

a05f85a

Compiles and test passes

538a42f

jacobhinkle marked this pull request as ready for review August 29, 2025 18:33

jacobhinkle changed the title ~~[WIP] Add CutlassCompiledKernel deriving CompiledKernelBase~~ Add CutlassCompiledKernel deriving CompiledKernelBase Aug 29, 2025

jacobhinkle and others added 17 commits September 2, 2025 07:23

Merge remote-tracking branch 'origin/main' into jh/cutlass_executor.c…

c117b52

…odegen

Refactor into gemm.cpp

0fbbe87

Fix up helper functions

f5daa76

Apply suggestion from @rdspring1

614e26f

Co-authored-by: Ryan Spring <[email protected]>

Apply suggestion from @rdspring1

4838743

Co-authored-by: Ryan Spring <[email protected]>

Apply suggestion from @rdspring1

7425235

Co-authored-by: Ryan Spring <[email protected]>

Apply suggestion from @rdspring1

b1e740c

Co-authored-by: Ryan Spring <[email protected]>

Apply suggestion from @rdspring1

08453fe

Co-authored-by: Ryan Spring <[email protected]>

Fix test

4faef83

Set layouts using mapLayoutToCutlass

d8718aa

Merge remote-tracking branch 'origin/jh/cutlass_executor.codegen' int…

052661b

…o jh/cutlass_executor.compiled_kernel

Fix test

ef31e85

Fix typo

27f3110

Merge branch 'jh/cutlass_executor.codegen' into jh/cutlass_executor.c…

d32b821

…ompiled_kernel

Merge branch 'main' into jh/cutlass_executor.codegen

7258406

Merge remote-tracking branch 'origin/jh/cutlass_executor.codegen' int…

cc7ccf0

…o jh/cutlass_executor.compiled_kernel

lintrunner

630f68b

Base automatically changed from jh/cutlass_executor.codegen to main September 3, 2025 17:34

jacobhinkle added 2 commits September 4, 2025 11:53

Merge remote-tracking branch 'origin/main' into jh/cutlass_executor.c…

55a9384

…ompiled_kernel

Remove the need for TORCH_PATH

4a9c042

Skip test unless CUTLASS_PATH is set

b673e9e

Merge branch 'main' into jh/cutlass_executor.compiled_kernel

fa97d4e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CutlassCompiledKernel deriving CompiledKernelBase #5088

Add CutlassCompiledKernel deriving CompiledKernelBase #5088

Uh oh!

jacobhinkle commented Aug 29, 2025

Uh oh!

github-actions bot commented Aug 29, 2025 •

edited

Loading

Uh oh!

jacobhinkle commented Aug 29, 2025

Uh oh!

jacobhinkle commented Sep 3, 2025

Uh oh!

jacobhinkle commented Sep 4, 2025

Uh oh!

jacobhinkle commented Sep 4, 2025

Uh oh!

Uh oh!

Add CutlassCompiledKernel deriving CompiledKernelBase #5088

Are you sure you want to change the base?

Add CutlassCompiledKernel deriving CompiledKernelBase #5088

Uh oh!

Conversation

jacobhinkle commented Aug 29, 2025

Uh oh!

github-actions bot commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

jacobhinkle commented Aug 29, 2025

Uh oh!

jacobhinkle commented Sep 3, 2025

Uh oh!

jacobhinkle commented Sep 4, 2025

Uh oh!

jacobhinkle commented Sep 4, 2025

Uh oh!

Uh oh!

github-actions bot commented Aug 29, 2025 •

edited

Loading