Skip to content

Conversation

@liqiangxl
Copy link
Collaborator

@liqiangxl liqiangxl commented Sep 30, 2025

Summary

  • Enable clustered inner reductions on Hopper+ for large-vocab workloads (cross-entropy).
  • [Minor refactor] Centralize CUDA context initialization and ensure a valid context before occupancy queries.

Motivation

Large vocabulary reductions benefit from CTA clustering on Hopper. This introduces a targeted heuristic and scheduling path to leverage clusters where profitable, preserving behavior on pre-Hopper GPUs.

Key Changes

  • Scheduler/Heuristics:

    • ReductionParams.cross_cluster_reduction and clustered splitting/mapping in scheduleReductionTV.
    • Propagate clustered blocks in parallelizeAllLike.
    • scheduler_utils::getMaxClusterSize() (Hopper-aware, occupancy-based) and validation checks for clustered reductions.
    • Extend normalization persistent buffer sizing and heuristics for clusters.
  • Runtime/Init:

    • executor_utils::initializeCudaContext() extracted; used by compiled_kernel.cpp, executor.cpp, and before occupancy in matmul_utils.cpp.
    • Register cuCtxGetCurrent in driver_api.h.
  • Tests:

    • Clustered autoscheduler test and invalid cluster-size runtime check.
    • Skip certain SMEM-persistent tests on Hopper+; adjust Welford translation sizes.

Compatibility

  • Pre-Hopper (SM < 90): getMaxClusterSize() returns 1; existing SMEM paths preserved.
  • No user-visible API changes; scheduling additions are internal.

Performance

Assuming #5337 is merged.
See https://docs.google.com/document/d/1zrAoWMXm8pC5lG2STFzY7bKkcL1ZzdioxGe0nriiHhE/edit?usp=sharing

@liqiangxl liqiangxl changed the title inner persistent scheudler uses cluster reduction inner persistent scheduler uses cluster reduction Sep 30, 2025
@liqiangxl liqiangxl force-pushed the llu/cluster_reduction_scheduler branch from d010e02 to 89153ce Compare September 30, 2025 00:27
@liqiangxl
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Sep 30, 2025

Review updated until commit ed220ce

Description

  • Enable clustered reductions on Hopper+ GPUs for large-vocab workloads

  • Centralize CUDA context initialization for occupancy and runtime safety

  • Add validation for clustered reduction domains and cluster size limits

  • Extend inner persistent scheduler with cluster-aware heuristics and buffer sizing


Changes walkthrough 📝

Relevant files
Enhancement
10 files
validation.cpp
Add validation for clustered reductions                                   
+59/-0   
compiled_kernel.cpp
Use centralized CUDA context init                                               
+2/-26   
executor.cpp
Use centralized CUDA context init                                               
+1/-1     
executor_utils.cpp
Extract CUDA context initialization                                           
+15/-0   
normalization_inner.cpp
Add cluster heuristic for large reductions                             
+83/-7   
reduction_utils.cpp
Schedule reductions with cluster support                                 
+24/-0   
utils.cpp
Add getMaxClusterSize and propagate clusters                         
+22/-0   
executor_utils.h
Declare context initialization                                                     
+1/-0     
reduction_heuristic.h
Add cross-cluster reduction flag                                                 
+6/-0     
utils.h
Declare getMaxClusterSize                                                               
+3/-0     
Bug fix
1 files
matmul_utils.cpp
Ensure CUDA context for occupancy                                               
+9/-3     
Tests
3 files
test_cluster.cpp
Add autoscheduler and invalid size tests                                 
+106/-0 
test_persistent_buffer.cpp
Skip SMEM tests on Hopper+                                                             
+13/-0   
test_welford.cpp
Adjust Welford test sizes for clusters                                     
+8/-4     
Dependencies
1 files
driver_api.h
Register cuCtxGetCurrent                                                                 
+2/-1     

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review

Heuristic Assumption

The heuristic in innerPersistentHeuristicCluster assumes 2 blocks per SM and fixed register usage, which may not hold across different GPU architectures or workloads, potentially leading to suboptimal performance or resource exhaustion.

const int64_t register_per_block =
    scheduler_utils::register_file_size_bit / 2;
int64_t blocks_per_cluster =
    ceilDiv(properties.max_persistent_buffer_size_bit, register_per_block);
blocks_per_cluster = scheduler_utils::roundUpPow2(blocks_per_cluster);
blocks_per_cluster =
    std::min(blocks_per_cluster, scheduler_utils::getMaxClusterSize());
Validation Scope

The validation for clustered reductions is only applied when a clustered domain is detected, but there is no check ensuring that clustering is only used on supported hardware (e.g., Hopper+), which could lead to runtime errors on unsupported devices.

if (std::any_of(
        out->getLoopDomain().begin(),
        out->getLoopDomain().end(),
        [](IterDomain* id) { return id->isClusteredBlockDim(); })) {
  validateClusterReduction(rop);
Cluster Size Calculation

The getMaxClusterSize function uses a hardcoded cluster size of 16 and reduces by half until a valid size is found, which may not explore all possible cluster configurations and could miss optimal sizes for certain workloads.

int64_t getMaxClusterSize() {
  // return 1 for pre-Hopper devices
  if (at::cuda::getCurrentDeviceProperties()->major < 9) {
    return 1;
  }
  int cluster_size = 16;
  while (cluster_size > 1 &&
         matmul_utils::getMaxActiveClusters(
             MatmulParams::ClusterDims{cluster_size, 1}) < 1) {
    cluster_size /= 2;
  }
  return cluster_size;

@liqiangxl
Copy link
Collaborator Author

!test

2 similar comments
@liqiangxl
Copy link
Collaborator Author

!test

@liqiangxl
Copy link
Collaborator Author

!test

@liqiangxl liqiangxl force-pushed the llu/cluster_reduction_scheduler branch from f624c88 to b2e89a2 Compare October 2, 2025 16:05
@liqiangxl
Copy link
Collaborator Author

!test

@liqiangxl
Copy link
Collaborator Author

!test

@liqiangxl liqiangxl changed the base branch from main to llu/bm_quack October 10, 2025 13:28
@liqiangxl liqiangxl changed the base branch from llu/bm_quack to main October 10, 2025 13:29
@liqiangxl
Copy link
Collaborator Author

!test

@liqiangxl
Copy link
Collaborator Author

!test

1 similar comment
@liqiangxl
Copy link
Collaborator Author

!test

@liqiangxl
Copy link
Collaborator Author

!test

@liqiangxl liqiangxl marked this pull request as ready for review October 13, 2025 12:56
@liqiangxl liqiangxl requested a review from naoyam October 14, 2025 14:24
@liqiangxl
Copy link
Collaborator Author

!test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants