Create `grouped_mm` for bf16 and fp16 inputs on Blackwell #5101

rdspring1 · 2025-09-02T01:41:48Z

Cutlass support for Issue Better fallback for bf16 GroupedMmaOp #5007

github-actions · 2025-09-02T01:43:12Z

Review updated until commit 34cc5f5

Description

Add grouped_mm support for bf16 and fp16 on Blackwell
Implement CUDA kernel for grouped GEMM memory layout
Add test suite for grouped_mm with multiple configurations
Update build system and headers for new kernel

Changes walkthrough 📝

Relevant files

Enhancement

cutlass.cpp `Add grouped_mm Python binding` python/python_direct/cutlass.cpp Add Python binding for `grouped_mm` function Expose `grouped_mm` with full parameter documentation	+10/-0
group_mm.cu `Implement grouped_mm CUDA kernel` cutlass/group_mm.cu Implement `grouped_mm` CUDA kernel for Blackwell Support bf16 and fp16 through template specialization Include memory layout and offset computation Validate inputs and handle error cases	+524/-0
nvf_cutlass.h `Declare grouped_mm in header` cutlass/nvf_cutlass.h Add `grouped_mm` function declaration Document parameters and return value	+26/-0

Tests

test_cutlass_gemm.py `Add grouped_mm test suite` tests/python/direct/test_cutlass_gemm.py Add comprehensive test for `grouped_mm` Test multiple configs and dtypes (bf16, fp16) Validate output against PyTorch reference	+76/-0

Formatting

nvfp4_scaled_group_mm.cu `Clean up includes and comments` cutlass/nvfp4_scaled_group_mm.cu Remove unused includes Fix comment formatting and parameter descriptions	+10/-23
nvfp4_scaled_mm_blockscale.cu `Remove unnecessary includes` cutlass/nvfp4_scaled_mm_blockscale.cu Remove unused includes Clean up header dependencies	+0/-4

Configuration changes

CMakeLists.txt `Include group_mm.cu in build` CMakeLists.txt Add `group_mm.cu` to build system	+1/-0

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Potential Overflow Risk

The kernel get_group_gemm_starts uses threadIdx.x as the expert_id without checking if it is within the valid range of experts, which could lead to out-of-bounds memory access when the number of experts exceeds the thread count.

int64_t expert_id = threadIdx.x;
if (expert_id >= gridDim.x * blockDim.x) {
  return;
}
// Upcast from int32_t to int64_t to avoid overflow during offset calculations
int64_t expert_offset = static_cast<int64_t>(expert_offsets[expert_id]);
int64_t n = static_cast<int64_t>(problem_sizes_as_shapes[expert_id * 3 + 1]);
int64_t k = static_cast<int64_t>(problem_sizes_as_shapes[expert_id * 3 + 2]);
assert((n == N && k == K) && "Unexpected problem sizes");

Incomplete Validation

The function validateInputsGroupMm does not validate the ab_strides and c_strides tensors, which are critical for correct memory access patterns in the grouped GEMM operation.

void validateInputsGroupMm(
    const torch::Tensor& a,
    const torch::Tensor& b,
    const torch::Tensor& problem_sizes,
    const torch::Tensor& expert_offsets) {
  // Check data types
  NVF_CHECK(
      a.scalar_type() == at::ScalarType::BFloat16 ||
          a.scalar_type() == at::ScalarType::Half,
      "Expected BFloat16 or Half for Operand A.")
  NVF_CHECK(
      b.scalar_type() == at::ScalarType::BFloat16 ||
          b.scalar_type() == at::ScalarType::Half,
      "Expected BFloat16 or Half for Operand B.")

  // Check CUDA device
  NVF_CHECK(a.is_cuda(), "Expected CUDA tensor for Operand A.")
  NVF_CHECK(b.is_cuda(), "Expected CUDA tensor for Operand B.")

  // Check contiguity
  NVF_CHECK(a.is_contiguous(), "Expected contiguous tensor for Operand A.")
  NVF_CHECK(b.is_contiguous(), "Expected contiguous tensor for Operand B.")

  // Check shapes
  NVF_CHECK(problem_sizes.dim() == 2, "problem_sizes must be  a 2D tensor");
  NVF_CHECK(
      problem_sizes.size(1) == 3,
      "problem_sizes must have the shape (num_experts, 3)");
  NVF_CHECK(
      problem_sizes.size(0) == expert_offsets.size(0),
      "Number of experts in problem_sizes must match expert_offsets");
  NVF_CHECK(
      problem_sizes.dtype() == torch::kInt32, "problem_sizes must be int32.");
}

Missing Error Handling

The Python binding for grouped_mm does not include error handling or validation of input tensor properties beyond what is done in the C++ implementation, potentially allowing invalid inputs to reach the CUDA kernel.

cutlass.def(
    "grouped_mm",
    &cutlass_kernels::grouped_mm,
    R"(Computes grouped matmul and returns bf16 or fp16 output tensor.
       grouped_mm(Tensor a,
                  Tensor b,
                  Tensor ab_strides,
                  Tensor c_strides,
                  Tensor problem_sizes,
                  Tensor expert_offsets) -> Tensor output)");

jjsjann123

Same comment as in the other PR.
On top of that, some nitpick about comments that should have dropped all the nvfp4 related stuff that's no longer relevant with 16bit types.

cutlass/group_mm.cu

cutlass/nvf_cutlass.h

cutlass/group_mm.cu

rdspring1 · 2025-09-06T17:09:21Z

!test

jjsjann123

lgtm~

jjsjann123 · 2025-09-08T22:41:08Z

cutlass/nvfp4_scaled_group_mm.cu

-#include "cutlass/util/reference/host/tensor_compare.h"
-#include "cutlass/util/reference/host/tensor_fill.h"
-#include "cutlass/util/reference/host/tensor_norm.h"
-#include "cutlass/util/tensor_view_io.h"


Thanks for cleaning up the header~~~

jacobhinkle

LGTM. Left some minor comments. These comments apply to the scaled input version as well. I just didn't notice previously.

jacobhinkle · 2025-09-09T12:25:58Z

cutlass/group_mm.cu

+template <typename T, bool is_single_sm>
+struct KernelTraits;
+
+// Kernel traits for FP16 output
+template <>
+struct KernelTraits<cutlass::half_t, true> {
+  using MmaTileShape = Shape<_128, _256, Int<128 / sizeof(cutlass::half_t)>>;
+  using ClusterShape = Shape<_1, _1, _1>;
+  using KernelSchedule =
+      cutlass::gemm::KernelPtrArrayTmaWarpSpecialized1SmSm100;
+  using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecialized1Sm;
+};
+
+// Kernel traits for BFloat16 output
+template <>
+struct KernelTraits<cutlass::bfloat16_t, true> {
+  using MmaTileShape =
+      Shape<_128, _256, Int<128 / sizeof(cutlass::bfloat16_t)>>;
+  using ClusterShape = Shape<_1, _1, _1>;
+  using KernelSchedule =
+      cutlass::gemm::KernelPtrArrayTmaWarpSpecialized1SmSm100;
+  using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecialized1Sm;
+};


I guess we could just handle this in a single template definition using sizeof(T) and using

std::conditional<is_single_sm, cutlass::gemm::KernelPtrArrayTmaWarpSpecialized1SmSm100, cutlass::gemm::KernelPtrArrayTmaWarpSpecialized2SmSm100>::type

jacobhinkle · 2025-09-09T12:32:51Z

cutlass/group_mm.cu

+  run_get_group_gemm_starts(
+      a_ptrs,
+      b_ptrs,
+      out_ptrs,
+      a,
+      b,
+      output,
+      expert_offsets,
+      problem_sizes,
+      M,
+      N,
+      K,
+      stream);


Is it normal to need to launch two kernels for every invocation? I would imagine for common use cases we'd have a static shape meaning all the offsets could be precomputed even if they live on the GPU and we'd just change the base data pointer for each invocation. I wonder what is done for small problems or inference use cases..

Yes. Can the CutlassExecutor cache based on the shapes to avoid calling the run_get_group_gemm_starts?

Good question. Maybe we could if we know the offsets are constant, but that data is on the device right?

jacobhinkle · 2025-09-09T12:52:00Z

cutlass/group_mm.cu

+  scheduler.raster_order = RasterOrderOptions::AlongM;
+  hw_info.device_id = a.get_device();
+  static std::unordered_map<int, int> cached_sm_counts;
+  if (cached_sm_counts.find(hw_info.device_id) == cached_sm_counts.end()) {
+    cached_sm_counts[hw_info.device_id] =
+        cutlass::KernelHardwareInfo::query_device_multiprocessor_count(
+            hw_info.device_id);
+  }
+  hw_info.sm_count = min(cached_sm_counts[hw_info.device_id], INT_MAX);


I assume we don't need to set hw_info.max_active_clusters since the cluster size is 1,1,1. If we make this adjustable I guess we'd need to update that here.

BTW I think we can also just call KernelHardwareInfo<GemmKernel>::make_kernel_hardware_info which will automatically initialize these.

* Add support for bf16 and fp16 for Issue 5007

rdspring1 · 2025-09-10T21:44:22Z

!build

)" This reverts commit 515a337.

rdspring1 requested review from jacobhinkle and jjsjann123 September 2, 2025 01:41

rdspring1 added Direct Bindings Python extension with direct mapping to NvFuser CPP objects. Cutlass labels Sep 2, 2025

jjsjann123 reviewed Sep 3, 2025

View reviewed changes

rdspring1 force-pushed the cutlass_grouped_gemm_refactor branch from 4fcd1a3 to f43d0a5 Compare September 4, 2025 02:25

Base automatically changed from cutlass_grouped_gemm_refactor to main September 5, 2025 01:58

rdspring1 force-pushed the cutlass_grouped_gemm_bf16 branch 2 times, most recently from 3e2daf1 to 3eba5ae Compare September 5, 2025 23:19

rdspring1 requested a review from jjsjann123 September 6, 2025 17:04

rdspring1 force-pushed the cutlass_grouped_gemm_bf16 branch from ff39085 to 21f32c9 Compare September 6, 2025 17:09

jjsjann123 approved these changes Sep 8, 2025

View reviewed changes

jacobhinkle approved these changes Sep 9, 2025

View reviewed changes

rdspring1 added 2 commits September 10, 2025 14:43

Create grouped_mm for blackwell

688f3db

* Add support for bf16 and fp16 for Issue 5007

Remove unnecessary imports and fix comments in cutlass kernels

34cc5f5

rdspring1 force-pushed the cutlass_grouped_gemm_bf16 branch from 21f32c9 to 34cc5f5 Compare September 10, 2025 21:44

rdspring1 merged commit 515a337 into main Sep 10, 2025
17 checks passed

rdspring1 deleted the cutlass_grouped_gemm_bf16 branch September 10, 2025 23:18

wujingyue added a commit that referenced this pull request Sep 11, 2025

Revert "Create grouped_mm for bf16 and fp16 inputs on Blackwell (#5101

a290ecf

)" This reverts commit 515a337.

This was referenced Sep 11, 2025

Revert "Create grouped_mm for bf16 and fp16 inputs on Blackwell" #5154

Closed

Disable a failing test #5155

Merged

Create grouped_mm for bf16 and fp16 inputs on Blackwell #5101

Create grouped_mm for bf16 and fp16 inputs on Blackwell #5101

Conversation

rdspring1 commented Sep 2, 2025

Uh oh!

github-actions bot commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdspring1 commented Sep 6, 2025

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

jacobhinkle left a comment

Choose a reason for hiding this comment

Uh oh!

jacobhinkle Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

jacobhinkle Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

rdspring1 Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

jacobhinkle Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

jacobhinkle Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

jacobhinkle Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

rdspring1 commented Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Create `grouped_mm` for bf16 and fp16 inputs on Blackwell #5101

Create `grouped_mm` for bf16 and fp16 inputs on Blackwell #5101

github-actions bot commented Sep 2, 2025 •

edited

Loading