Lowering scatter accumulate #4764

naoyam · 2025-07-10T01:32:18Z

Adds an optional accumulate parameter to ScatterOp so that it can be used both with and without accumulation.

I'll look into consolidating IndexPutAccumulateOp as well in the future.

github-actions · 2025-07-10T01:33:16Z

Review updated until commit 3411324

Description

Added optional accumulation to ScatterOp
Replaced indexPutAccumulate with scatter
Implemented atomic accumulation in codegen
Updated tests for scatter accumulation

Changes walkthrough 📝

Relevant files

Enhancement

7 files

codegen.cpp `Add scatter accumulation codegen`	+67/-7
index.cpp `Update ScatterOp lowering with accumulate`	+2/-2
nodes.cpp `Implement ScatterOp accumulation logic`	+51/-12
indexing.cpp `Update scatter to support accumulation`	+16/-12
test_moe.cpp `Replace indexPutAccumulate with scatter`	+27/-8
internal_nodes.h `Update ScatterOp interface for accumulation`	+9/-4
indexing.h `Update scatter API for accumulation`	+2/-9

Cleanup

2 files

type.cpp `Remove ScatterOpType output operator`	+0/-7
type.h `Remove unused ScatterOpType enum`	+0/-3

Tests

1 files

test_scatter.cpp `Add scatter accumulation tests`	+89/-0

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review Possible Issue The condition in the non-deterministic check uses a logical AND between the floating point type check and a condition that is always true due to incorrect operator precedence. This may lead to incorrect validation of allowed operations. const bool non_deterministic = isFloatingPointType(sop->src()->dtype()) && (sop->accumulateOp() != BinaryOpType::Max \|\| sop->accumulateOp() != BinaryOpType::Min); Possible Issue The evaluation path for scatter with accumulation does not support scalar src inputs, but the code allows scalar inputs in the non-accumulate case. This creates an inconsistency in supported functionality. Performance Concern The atomicAdd implementation for int64_t uses reinterpret_cast to unsigned long long, which may have undefined behavior if the value is negative. This could lead to incorrect accumulation results. code_ << "atomicAdd(" << "reinterpret_cast<unsigned long long*>(&" << dst << "), " << "static_cast<unsigned long long>(" << src << "));\n";

naoyam · 2025-07-10T01:38:08Z

!test

This PR introduces a preliminary support of lowering of the scatter operation rather than falling back to Aten. The primary motivation is to generate a single fused kernel for fusions like [SgLangMoeTest.ComputeArgSort](https://github.com/NVIDIA/Fuser/blob/main/tests/cpp/test_moe.cpp#L155-L197). It is not yet piped through FusionExecutorCache, so nothing should be impacted as long as nvFuser is used through FusionExecutorCache. Scatter is inherently in-place, which doesn't mix well with the overall semantics of the Fusion IR. Here, from the users' perspective, scatter is provided as an out-of-place operation, like below: ``` auto tv4 = scatter(tv2, 0, tv1, tv3); ``` https://github.com/NVIDIA/Fuser/pull/4742/files#diff-a50219bc583905a766ab511e0af91ba8af96a821a93bb19f20d4b550c18a9f5cR49 Here, `tv2` and `tv4` are different tensors in the fusion. The user is free to use `tv2` and `tv4` separately in the fusion. However, when generating a CUDA kernel, we would want to implement the operation as an in-place operation, so at the time of lowering, it is [validated](https://github.com/NVIDIA/Fuser/pull/4742/files#diff-b8542908e49882b02549144d87bbf19225a253305e26a0f7ea1665a05cfc30f4R1338-R1342) such that the scatter input is only used by the scatter operation itself. This restriction should be enforced by the fusion segmenter. ~Before lowering, the loop domain of `tv4` is meant to be updated to use the logical domain of the index tensor. This is currently manually done as shown [here](https://github.com/NVIDIA/Fuser/pull/4742/files#diff-a50219bc583905a766ab511e0af91ba8af96a821a93bb19f20d4b550c18a9f5cR53-R54).~ Edit: I decided to do this from the beginning as part of the `TensorDomain` constructor. At the time of lowering, once the validation passes, a new lowering pass, `setInplaceAlias`, modifies the allocation nodes of the scatter input and output such that the output becomes an alias of the input (except when the output is also a fusion output, in that case the input becomes an alias of the output). I initially considered extending the existing memory reuse pass but decided to add a new separate pass for simplicity. Once the aliasing is done, then the rest is just a matter of some minor adjustments here and there. With this PR, the `ComputeArgSort` can be manually scheduled as shown [here](https://github.com/NVIDIA/Fuser/pull/4742/files#diff-e116aa2fb290f929889bb62f657b591fea932461e103a2073db6e75fbb45f6c4R231-R247). Similarly, the `ComputeProblemSizes` can also be lowered when the index size is 1 since in that case there's no accumulation. That should correspond to the decode pass. Note that this PR does not support scatter with multi-dimensional tensors. This is because in PyTorch scatter, non-indexed dimensions are not guaranteed to have the same extents between all the tensors, so there's no ID mapping, meaning there's no indexing path. I think we should represent this as a separate resize op, but not yet done. #4764 is a follow-up PR to extend this for the accumulation case. More thorough testing as well as actual automatic scheduling support should be done in future PRs.

naoyam · 2025-09-03T01:21:44Z

!test

naoyam · 2025-09-03T01:42:36Z

!test --diff

naoyam · 2025-09-03T22:30:03Z

tests/cpp/test_moe.cpp

+    KernelExecutor ke;
+    ke.compile(&fusion, {t0});
+
+    GTEST_SKIP() << "Missing predication. Fix pending: "


I'll update this part once #5107 is done.

jjsjann123

LGTM~

🚢

jjsjann123 · 2025-09-05T05:21:40Z

csrc/codegen.cpp

+    switch (sop->accumulateOp()) {
+      case BinaryOpType::Add:
+        if (sop->in()->dtype() == DataType::Int) {
+          // atomicAdd does not provide an overload for int64_t


😮‍💨

Out of curiosity, Do you happen to know the reason that atomicAdd only has uint64_t but not for int64_t? Yet the max/min version has that?

programming guide discusses only floating point types... https://docs.nvidia.com/cuda/cuda-c-programming-guide/#atomicadd

jjsjann123 · 2025-09-05T05:44:55Z

tests/cpp/test_moe.cpp

+
+    tv2->setMemoryType(MemoryType::Shared);
+    tv2->setAllocationDomain(tv2->getLogicalDomain(), true);
+    tv4_cache->setMemoryType(MemoryType::Shared);


In this scheduling, we are doing atomic write to shared memory, and then write to global memory afterwards.

For my own curiosity, I think we can also not add cacheBefore on tv4 and rely on atomicAdd directly to global memory and that should still work?!

Yes, that should work too.

naoyam · 2025-09-05T21:56:16Z

!test

naoyam added 25 commits July 7, 2025 15:28

WIP

7f634f5

cleanup

c654486

cleanup

8ef8e3e

enable codegen of argsort+scatter

549e407

Use IterDomain::merge instead of manually creating a Merge

ea81050

Merge branch 'main' into simplify_flatten

b32ccf3

Convert indexPutAccumulate to scatter when possible

8003a94

Merge remote-tracking branch 'origin/simplify_flatten' into scatter

21054f0

enable codegen of compute_problem_sizes

2a5379c

remove old test

bc6020d

scatter with shmem

d2d127b

cleanup

6a8c74e

Merge remote-tracking branch 'origin/main' into scatter

dce9246

cleanup

f725ad9

cleanup

96bb1d0

cleanup

0fc4aff

fix

d8291fc

cleanup

59d73b2

fix

09617ea

test fix

2fc5184

Moved the change of the loop domain to a scheduling routine

ee36099

bug fix

fd2b83b

cleanup

6cd1c3b

format

4180b03

WIP

0d9ed76

naoyam mentioned this pull request Jul 10, 2025

Lowering scatter #4742

Merged

Base automatically changed from scatter to main August 8, 2025 18:33

naoyam added 5 commits September 2, 2025 17:30

Merge branch 'main' into scatter-accumulate

ff6c3f3

cleanup

b4bea4a

cleanup

2e382ac

cleanup

4d8ac68

cleanup

d3fc769

naoyam added 2 commits September 2, 2025 18:42

fix

47c9d4d

replayce indexPutAccumualte with scatter

90b1217

naoyam added 3 commits September 2, 2025 18:45

format

2b2f524

Merge remote-tracking branch 'origin/main' into scatter-accumulate

bbcc433

test update

8c2a9e1

naoyam marked this pull request as ready for review September 3, 2025 22:29

naoyam requested a review from jjsjann123 September 3, 2025 22:29

naoyam commented Sep 3, 2025

View reviewed changes

jjsjann123 approved these changes Sep 5, 2025

View reviewed changes

Merge branch 'main' into scatter-accumulate

3411324

naoyam merged commit 6048003 into main Sep 6, 2025
51 of 52 checks passed

naoyam deleted the scatter-accumulate branch September 6, 2025 01:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lowering scatter accumulate #4764

Lowering scatter accumulate #4764

Uh oh!

naoyam commented Jul 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 10, 2025 •

edited

Loading

Uh oh!

naoyam commented Jul 10, 2025

Uh oh!

naoyam commented Sep 3, 2025

Uh oh!

naoyam commented Sep 3, 2025

Uh oh!

naoyam Sep 3, 2025

Uh oh!

jjsjann123 left a comment

Uh oh!

jjsjann123 Sep 5, 2025

Uh oh!

jjsjann123 Sep 5, 2025

Uh oh!

naoyam Sep 5, 2025

Uh oh!

jjsjann123 Sep 5, 2025

Uh oh!

naoyam Sep 5, 2025

Uh oh!

naoyam commented Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!

Lowering scatter accumulate #4764

Lowering scatter accumulate #4764

Uh oh!

Conversation

naoyam commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

naoyam commented Jul 10, 2025

Uh oh!

naoyam commented Sep 3, 2025

Uh oh!

naoyam commented Sep 3, 2025

Uh oh!

naoyam Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

naoyam Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

naoyam Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

naoyam commented Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!

naoyam commented Jul 10, 2025 •

edited

Loading

github-actions bot commented Jul 10, 2025 •

edited

Loading