inner persistent scheduler uses cluster reduction #5268

liqiangxl · 2025-09-30T00:21:24Z

Summary

Enable clustered inner reductions on Hopper+ for large-vocab workloads (cross-entropy).
[Minor refactor] Centralize CUDA context initialization and ensure a valid context before occupancy queries.

Motivation

Large vocabulary reductions benefit from CTA clustering on Hopper. This introduces a targeted heuristic and scheduling path to leverage clusters where profitable, preserving behavior on pre-Hopper GPUs.

Key Changes

Scheduler/Heuristics:
- ReductionParams.cross_cluster_reduction and clustered splitting/mapping in scheduleReductionTV.
- Propagate clustered blocks in parallelizeAllLike.
- scheduler_utils::getMaxClusterSize() (Hopper-aware, occupancy-based) and validation checks for clustered reductions.
- Extend normalization persistent buffer sizing and heuristics for clusters.
Runtime/Init:
- executor_utils::initializeCudaContext() extracted; used by compiled_kernel.cpp, executor.cpp, and before occupancy in matmul_utils.cpp.
- Register cuCtxGetCurrent in driver_api.h.
Tests:
- Clustered autoscheduler test and invalid cluster-size runtime check.
- Skip certain SMEM-persistent tests on Hopper+; adjust Welford translation sizes.

Compatibility

Pre-Hopper (SM < 90): getMaxClusterSize() returns 1; existing SMEM paths preserved.
No user-visible API changes; scheduling additions are internal.

Performance

Assuming #5337 is merged.
See https://docs.google.com/document/d/1zrAoWMXm8pC5lG2STFzY7bKkcL1ZzdioxGe0nriiHhE/edit?usp=sharing

liqiangxl · 2025-09-30T01:02:41Z

!test

github-actions · 2025-09-30T14:08:16Z

Review updated until commit ed220ce

Description

Enable clustered reductions on Hopper+ GPUs for large-vocab workloads
Centralize CUDA context initialization for occupancy and runtime safety
Add validation for clustered reduction domains and cluster size limits
Extend inner persistent scheduler with cluster-aware heuristics and buffer sizing

Changes walkthrough 📝

Relevant files

Enhancement

10 files

validation.cpp `Add validation for clustered reductions`	+59/-0
compiled_kernel.cpp `Use centralized CUDA context init`	+2/-26
executor.cpp `Use centralized CUDA context init`	+1/-1
executor_utils.cpp `Extract CUDA context initialization`	+15/-0
normalization_inner.cpp `Add cluster heuristic for large reductions`	+83/-7
reduction_utils.cpp `Schedule reductions with cluster support`	+24/-0
utils.cpp `Add getMaxClusterSize and propagate clusters`	+22/-0
executor_utils.h `Declare context initialization`	+1/-0
reduction_heuristic.h `Add cross-cluster reduction flag`	+6/-0
utils.h `Declare getMaxClusterSize`	+3/-0

Bug fix

1 files

matmul_utils.cpp `Ensure CUDA context for occupancy`	+9/-3

Tests

3 files

test_cluster.cpp `Add autoscheduler and invalid size tests`	+106/-0
test_persistent_buffer.cpp `Skip SMEM tests on Hopper+`	+13/-0
test_welford.cpp `Adjust Welford test sizes for clusters`	+8/-4

Dependencies

1 files

driver_api.h `Register cuCtxGetCurrent`	+2/-1

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review Heuristic Assumption The heuristic in innerPersistentHeuristicCluster assumes 2 blocks per SM and fixed register usage, which may not hold across different GPU architectures or workloads, potentially leading to suboptimal performance or resource exhaustion. const int64_t register_per_block = scheduler_utils::register_file_size_bit / 2; int64_t blocks_per_cluster = ceilDiv(properties.max_persistent_buffer_size_bit, register_per_block); blocks_per_cluster = scheduler_utils::roundUpPow2(blocks_per_cluster); blocks_per_cluster = std::min(blocks_per_cluster, scheduler_utils::getMaxClusterSize()); Validation Scope The validation for clustered reductions is only applied when a clustered domain is detected, but there is no check ensuring that clustering is only used on supported hardware (e.g., Hopper+), which could lead to runtime errors on unsupported devices. if (std::any_of( out->getLoopDomain().begin(), out->getLoopDomain().end(), [](IterDomain* id) { return id->isClusteredBlockDim(); })) { validateClusterReduction(rop); Cluster Size Calculation The getMaxClusterSize function uses a hardcoded cluster size of 16 and reduces by half until a valid size is found, which may not explore all possible cluster configurations and could miss optimal sizes for certain workloads. int64_t getMaxClusterSize() { // return 1 for pre-Hopper devices if (at::cuda::getCurrentDeviceProperties()->major < 9) { return 1; } int cluster_size = 16; while (cluster_size > 1 && matmul_utils::getMaxActiveClusters( MatmulParams::ClusterDims{cluster_size, 1}) < 1) { cluster_size /= 2; } return cluster_size;

liqiangxl · 2025-09-30T15:13:52Z

!test

liqiangxl · 2025-09-30T18:09:17Z

!test

liqiangxl · 2025-10-02T13:43:53Z

!test

liqiangxl · 2025-10-02T16:05:30Z

!test

…nvidia/fuser into llu/cluster_reduction_scheduler

liqiangxl · 2025-10-02T18:47:25Z

!test

…ter_reduction_scheduler

…nvidia/fuser into llu/cluster_reduction_scheduler

…ter_reduction_scheduler

liqiangxl · 2025-10-10T13:55:37Z

!test

liqiangxl · 2025-10-10T14:32:41Z

!test

liqiangxl · 2025-10-10T14:37:35Z

!test

liqiangxl · 2025-10-10T18:43:53Z

!test

liqiangxl · 2025-10-16T12:40:45Z

!test

liqiangxl changed the title ~~inner persistent scheudler uses cluster reduction~~ inner persistent scheduler uses cluster reduction Sep 30, 2025

liqiangxl force-pushed the llu/cluster_reduction_scheduler branch from d010e02 to 89153ce Compare September 30, 2025 00:27

user cluster reduction in inner persistent scheduler

b2e89a2

liqiangxl force-pushed the llu/cluster_reduction_scheduler branch from f624c88 to b2e89a2 Compare October 2, 2025 16:05

Merge branch 'main' into llu/cluster_reduction_scheduler

e757920

liqiangxl added 3 commits October 2, 2025 10:53

clean

c89ef65

Merge branch 'llu/cluster_reduction_scheduler' of https://github.com/…

b99e742

…nvidia/fuser into llu/cluster_reduction_scheduler

fix

f0da5a7

liqiangxl and others added 15 commits October 2, 2025 14:04

comment

5f4183a

Merge branch 'main' into llu/cluster_reduction_scheduler

93a59a7

add non_all_reduce version of cluster reduction

6ed1427

clean

7cad2f3

use last block

d8b5c0f

user cluster reduction in inner persistent scheduler

ba97f4d

Merge branch 'main' into llu/add_non_all_reduce_cluster_reduction

379d472

revise template para

535e4e2

add quack to pybm

1958196

add nvf def

cff11e5

Merge branch 'main' into llu/add_non_all_reduce_cluster_reduction

5c6dd06

add validation

1d41ff1

Merge branch 'main' into llu/cluster_reduction_scheduler

8b4f17c

Merge branch 'llu/add_non_all_reduce_cluster_reduction' into llu/clus…

dbd0f2c

…ter_reduction_scheduler

Merge branch 'llu/bm_quack' into llu/cluster_reduction_scheduler

cfdc119

liqiangxl added 3 commits October 7, 2025 11:36

Merge branch 'llu/cluster_reduction_scheduler' of https://github.com/…

8aaa883

…nvidia/fuser into llu/cluster_reduction_scheduler

wip

5d60630

Merge branch 'llu/add_non_all_reduce_cluster_reduction' into llu/clus…

f8cf907

…ter_reduction_scheduler

liqiangxl changed the base branch from main to llu/bm_quack October 10, 2025 13:28

liqiangxl changed the base branch from llu/bm_quack to main October 10, 2025 13:29

liqiangxl and others added 2 commits October 10, 2025 09:29

Merge branch 'main' into llu/cluster_reduction_scheduler

d27c869

comment

7f5cb3b

restore fmaxf

5ae1a32

liqiangxl and others added 2 commits October 10, 2025 11:43

fix

d267a52

Merge branch 'main' into llu/cluster_reduction_scheduler

df4b2ae

liqiangxl marked this pull request as ready for review October 13, 2025 12:56

Merge branch 'main' into llu/cluster_reduction_scheduler

8205777

liqiangxl requested a review from naoyam October 14, 2025 14:24

Merge branch 'main' into llu/cluster_reduction_scheduler

ed220ce

inner persistent scheduler uses cluster reduction #5268

Are you sure you want to change the base?

inner persistent scheduler uses cluster reduction #5268

Uh oh!

Conversation

liqiangxl commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Key Changes

Compatibility

Performance

Uh oh!

liqiangxl commented Sep 30, 2025

Uh oh!

github-actions bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

liqiangxl commented Sep 30, 2025

Uh oh!

liqiangxl commented Sep 30, 2025

Uh oh!

liqiangxl commented Oct 2, 2025

Uh oh!

liqiangxl commented Oct 2, 2025

Uh oh!

liqiangxl commented Oct 2, 2025

Uh oh!

liqiangxl commented Oct 10, 2025

Uh oh!

liqiangxl commented Oct 10, 2025

Uh oh!

liqiangxl commented Oct 10, 2025

Uh oh!

liqiangxl commented Oct 10, 2025

Uh oh!

liqiangxl commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liqiangxl commented Sep 30, 2025 •

edited

Loading

github-actions bot commented Sep 30, 2025 •

edited

Loading