Skip to content

Conversation

jacobhinkle
Copy link
Collaborator

This refactors CompiledKernel to create a new base class CompiledKernelBase that mostly handles kernel naming and recording the generated CUDA code. Then it introduces CutlassCompiledKernel which actually compiles and executes our customized CUTLASS kernels. This will be used in CutlassExecutor in future PRs.

Stacked on #5087.

Copy link

github-actions bot commented Aug 29, 2025

Review updated until commit fa97d4e

Description

  • Refactor CompiledKernel into base and derived classes

  • Introduce CutlassCompiledKernel for CUTLASS kernel execution

  • Add input validation for tensor contiguity in GEMM

  • Implement NVCC-based compilation for CUTLASS kernels


Changes walkthrough 📝

Relevant files
Enhancement
gemm.cpp
Add tensor validation and TensorArg abstraction                   

csrc/cutlass/gemm.cpp

  • Add input validation to ensure tensor contiguity
  • Introduce TensorArg struct for CUDA tensor representation
  • Update function signatures to use TensorArg instead of at::Tensor
  • Enhance kernel code generation with dimension checks
  • +63/-40 
    compiled_kernel.cpp
    Refactor into CompiledKernelBase hierarchy                             

    csrc/runtime/compiled_kernel.cpp

  • Move common functionality to CompiledKernelBase
  • Implement queryTargetGPUVersion as standalone function
  • Update constructor to initialize base class properly
  • Move kernel ID creation to base class
  • +59/-59 
    cutlass_compiled_kernel.cpp
    Implement CUTLASS kernel compilation and execution             

    csrc/runtime/cutlass_compiled_kernel.cpp

  • Implement CutlassCompiledKernel with NVCC compilation
  • Add code generation and dynamic loading via dlopen
  • Implement run method with TensorArg packing
  • Add comprehensive error handling and logging
  • +357/-0 
    compiled_kernel.h
    Define CompiledKernelBase hierarchy                                           

    csrc/runtime/compiled_kernel.h

  • Introduce CompiledKernelBase as common base class
  • Move shared members and methods to base class
  • Update CompiledKernel to inherit from base
  • Add queryTargetGPUVersion declaration
  • +151/-117
    cutlass_compiled_kernel.h
    Declare CutlassCompiledKernel interface                                   

    csrc/runtime/cutlass_compiled_kernel.h

  • Declare CutlassCompiledKernel class
  • Define interface for CUTLASS kernel compilation
  • Include necessary headers for compilation
  • Implement proper inheritance from CompiledKernelBase
  • +69/-0   
    Tests
    test_cutlass_scheduler.cpp
    Add comprehensive CutlassCompiledKernel test                         

    tests/cpp/test_cutlass_scheduler.cpp

  • Update test to use CutlassCompiledKernel
  • Add environment check for CUTLASS_PATH
  • Implement full end-to-end execution test
  • Add proper tensor allocation and validation
  • +66/-6   
    Configuration changes
    CMakeLists.txt
    Include cutlass_compiled_kernel in build                                 

    CMakeLists.txt

  • Add cutlass_compiled_kernel.cpp to source list
  • Ensure proper compilation of new source file
  • +1/-0     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Possible Issue

    The validation of tensor contiguity only checks the contiguity vector but does not verify if the tensor is actually contiguous in memory, which might lead to incorrect assumptions about memory layout.

    // Validate that the inputs and scale factors are all contiguous
    for (TensorView* tv : {a, b, a_scale, b_scale}) {
      for (const auto& c : tv->getContiguity()) {
        if (c.has_value()) {
          NVF_ERROR(
              c.value() == true,
              "We require input TensorView ",
              tv->toString(),
              " to be fully contiguous for the Cutlass executor.");
        }
      }
    }
    Performance Concern

    The compile command generation includes multiple redundant flags like '--expt-relaxed-constexpr' and '-forward-unknown-to-host-compiler' which might impact compilation efficiency and should be reviewed for necessity.

    compile_cmd = "nvcc";
    compile_cmd += " -forward-unknown-to-host-compiler";
    
    // Disable some warnings in host code
    compile_cmd += " -Wno-conversion";
    
    for (const std::string& path : include_paths) {
      compile_cmd += " -I" + path;
    }
    
    compile_cmd +=
        " -Xcudafe "
        "--diag_suppress=cc_clobber_ignored,"
        "--diag_suppress=field_without_dll_interface,"
        "--diag_suppress=base_class_has_different_dll_interface,"
        "--diag_suppress=dll_interface_conflict_none_assumed,"
        "--diag_suppress=dll_interface_conflict_dllexport_assumed,"
        "--diag_suppress=bad_friend_decl";
    
    compile_cmd += " --expt-relaxed-constexpr --expt-extended-lambda";
    
    compile_cmd += " -O3";
    
    compile_cmd += " -std=c++17";
    
    std::string arch = getComputeCapabilityString(compute_capability);
    // 90 -> 90a  or  100 -> 100a
    // Note that without this we get errors like
    //   Trying to use TMA Descriptor Prefetch without CUTE_ARCH_TMA_SM90_ENABLED.
    // TODO: This should be done properly
    // https://github.com/nvidia/cutlass#target-architecture
    arch += "a";
    std::string compute_arch = "compute_" + arch;
    std::string sm_arch = "sm_" + arch;
    compile_cmd +=
        " \"--generate-code="
        "arch=" +
        compute_arch +
        ","
        "code=[" +
        compute_arch + "," + sm_arch + "]\"";
    
    compile_cmd +=
        " -DCUTE_USE_PACKED_TUPLE=1"
        " -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1"
        " -DCUTLASS_VERSIONS_GENERATED"
        " -DCUTLASS_DEBUG_TRACE_LEVEL=0";
    
    compile_cmd +=
        " --expt-relaxed-constexpr --expt-extended-lambda "
        "--threads=32";
    Possible Issue

    The compute capability string is hardcoded to append 'a' (e.g., 90 -> 90a) without proper validation for all architectures, which might cause compatibility issues with certain GPU architectures.

    std::string arch = getComputeCapabilityString(compute_capability);
    // 90 -> 90a  or  100 -> 100a
    // Note that without this we get errors like
    //   Trying to use TMA Descriptor Prefetch without CUTE_ARCH_TMA_SM90_ENABLED.
    // TODO: This should be done properly
    // https://github.com/nvidia/cutlass#target-architecture
    arch += "a";
    std::string compute_arch = "compute_" + arch;
    std::string sm_arch = "sm_" + arch;
    compile_cmd +=

    @jacobhinkle
    Copy link
    Collaborator Author

    !test

    @jacobhinkle jacobhinkle marked this pull request as ready for review August 29, 2025 18:33
    @jacobhinkle jacobhinkle changed the title [WIP] Add CutlassCompiledKernel deriving CompiledKernelBase Add CutlassCompiledKernel deriving CompiledKernelBase Aug 29, 2025
    @jacobhinkle
    Copy link
    Collaborator Author

    !test

    Base automatically changed from jh/cutlass_executor.codegen to main September 3, 2025 17:34
    @jacobhinkle
    Copy link
    Collaborator Author

    This test is currently skipped in CI because $CUTLASS_PATH is not set in the CI environment. With this change, we will need to have that available at runtime, so at some point we will want to set up our CI to point to CUTLASS.

    @jacobhinkle
    Copy link
    Collaborator Author

    !test

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    1 participant