-
Notifications
You must be signed in to change notification settings - Fork 67
inner persistent scheduler uses cluster reduction #5268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
d010e02 to
89153ce
Compare
|
!test |
|
Review updated until commit ed220ce Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
!test |
2 similar comments
|
!test |
|
!test |
f624c88 to
b2e89a2
Compare
|
!test |
…nvidia/fuser into llu/cluster_reduction_scheduler
|
!test |
…ter_reduction_scheduler
…nvidia/fuser into llu/cluster_reduction_scheduler
…ter_reduction_scheduler
|
!test |
|
!test |
1 similar comment
|
!test |
|
!test |
|
!test |
Summary
Motivation
Large vocabulary reductions benefit from CTA clustering on Hopper. This introduces a targeted heuristic and scheduling path to leverage clusters where profitable, preserving behavior on pre-Hopper GPUs.
Key Changes
Scheduler/Heuristics:
ReductionParams.cross_cluster_reductionand clustered splitting/mapping inscheduleReductionTV.parallelizeAllLike.scheduler_utils::getMaxClusterSize()(Hopper-aware, occupancy-based) and validation checks for clustered reductions.Runtime/Init:
executor_utils::initializeCudaContext()extracted; used bycompiled_kernel.cpp,executor.cpp, and before occupancy inmatmul_utils.cpp.cuCtxGetCurrentindriver_api.h.Tests:
Compatibility
getMaxClusterSize()returns 1; existing SMEM paths preserved.Performance
Assuming #5337 is merged.
See https://docs.google.com/document/d/1zrAoWMXm8pC5lG2STFzY7bKkcL1ZzdioxGe0nriiHhE/edit?usp=sharing