Skip to content

Conversation

naoyam
Copy link
Collaborator

@naoyam naoyam commented Jul 10, 2025

Adds an optional accumulate parameter to ScatterOp so that it can be used both with and without accumulation.

I'll look into consolidating IndexPutAccumulateOp as well in the future.

Copy link

github-actions bot commented Jul 10, 2025

Review updated until commit 3411324

Description

  • Added optional accumulation to ScatterOp

  • Replaced indexPutAccumulate with scatter

  • Implemented atomic accumulation in codegen

  • Updated tests for scatter accumulation


Changes walkthrough 📝

Relevant files
Enhancement
7 files
codegen.cpp
Add scatter accumulation codegen                                                 
+67/-7   
index.cpp
Update ScatterOp lowering with accumulate                               
+2/-2     
nodes.cpp
Implement ScatterOp accumulation logic                                     
+51/-12 
indexing.cpp
Update scatter to support accumulation                                     
+16/-12 
test_moe.cpp
Replace indexPutAccumulate with scatter                                   
+27/-8   
internal_nodes.h
Update ScatterOp interface for accumulation                           
+9/-4     
indexing.h
Update scatter API for accumulation                                           
+2/-9     
Cleanup
2 files
type.cpp
Remove ScatterOpType output operator                                         
+0/-7     
type.h
Remove unused ScatterOpType enum                                                 
+0/-3     
Tests
1 files
test_scatter.cpp
Add scatter accumulation tests                                                     
+89/-0   

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review

Possible Issue

The condition in the non-deterministic check uses a logical AND between the floating point type check and a condition that is always true due to incorrect operator precedence. This may lead to incorrect validation of allowed operations.

const bool non_deterministic = isFloatingPointType(sop->src()->dtype()) &&
    (sop->accumulateOp() != BinaryOpType::Max ||
     sop->accumulateOp() != BinaryOpType::Min);
Possible Issue

The evaluation path for scatter with accumulation does not support scalar src inputs, but the code allows scalar inputs in the non-accumulate case. This creates an inconsistency in supported functionality.

Performance Concern

The atomicAdd implementation for int64_t uses reinterpret_cast to unsigned long long, which may have undefined behavior if the value is negative. This could lead to incorrect accumulation results.

code_ << "atomicAdd("
      << "reinterpret_cast<unsigned long long*>(&" << dst << "), "
      << "static_cast<unsigned long long>(" << src << "));\n";

@naoyam
Copy link
Collaborator Author

naoyam commented Jul 10, 2025

!test

@naoyam naoyam mentioned this pull request Jul 10, 2025
naoyam added a commit that referenced this pull request Aug 8, 2025
This PR introduces a preliminary support of lowering of the scatter
operation rather than falling back to Aten. The primary motivation is to
generate a single fused kernel for fusions like
[SgLangMoeTest.ComputeArgSort](https://github.com/NVIDIA/Fuser/blob/main/tests/cpp/test_moe.cpp#L155-L197).

It is not yet piped through FusionExecutorCache, so nothing should be
impacted as long as nvFuser is used through FusionExecutorCache.

Scatter is inherently in-place, which doesn't mix well with the overall
semantics of the Fusion IR. Here, from the users' perspective, scatter
is provided as an out-of-place operation, like below:

```
auto tv4 = scatter(tv2, 0, tv1, tv3);
```

https://github.com/NVIDIA/Fuser/pull/4742/files#diff-a50219bc583905a766ab511e0af91ba8af96a821a93bb19f20d4b550c18a9f5cR49

Here, `tv2` and `tv4` are different tensors in the fusion. The user is
free to use `tv2` and `tv4` separately in the fusion. However, when
generating a CUDA kernel, we would want to implement the operation as an
in-place operation, so at the time of lowering, it is
[validated](https://github.com/NVIDIA/Fuser/pull/4742/files#diff-b8542908e49882b02549144d87bbf19225a253305e26a0f7ea1665a05cfc30f4R1338-R1342)
such that the scatter input is only used by the scatter operation
itself. This restriction should be enforced by the fusion segmenter.

~Before lowering, the loop domain of `tv4` is meant to be updated to use
the logical domain of the index tensor. This is currently manually done
as shown
[here](https://github.com/NVIDIA/Fuser/pull/4742/files#diff-a50219bc583905a766ab511e0af91ba8af96a821a93bb19f20d4b550c18a9f5cR53-R54).~
Edit: I decided to do this from the beginning as part of the
`TensorDomain` constructor.

At the time of lowering, once the validation passes, a new lowering
pass, `setInplaceAlias`, modifies the allocation nodes of the scatter
input and output such that the output becomes an alias of the input
(except when the output is also a fusion output, in that case the input
becomes an alias of the output). I initially considered extending the
existing memory reuse pass but decided to add a new separate pass for
simplicity.

Once the aliasing is done, then the rest is just a matter of some minor
adjustments here and there.

With this PR, the `ComputeArgSort` can be manually scheduled as shown
[here](https://github.com/NVIDIA/Fuser/pull/4742/files#diff-e116aa2fb290f929889bb62f657b591fea932461e103a2073db6e75fbb45f6c4R231-R247).
Similarly, the `ComputeProblemSizes` can also be lowered when the index
size is 1 since in that case there's no accumulation. That should
correspond to the decode pass.

Note that this PR does not support scatter with multi-dimensional
tensors. This is because in PyTorch scatter, non-indexed dimensions are
not guaranteed to have the same extents between all the tensors, so
there's no ID mapping, meaning there's no indexing path. I think we
should represent this as a separate resize op, but not yet done.

#4764 is a follow-up PR to extend this for the accumulation case.

More thorough testing as well as actual automatic scheduling support
should be done in future PRs.
Base automatically changed from scatter to main August 8, 2025 18:33
@naoyam
Copy link
Collaborator Author

naoyam commented Sep 3, 2025

!test

@naoyam
Copy link
Collaborator Author

naoyam commented Sep 3, 2025

!test --diff

@naoyam naoyam marked this pull request as ready for review September 3, 2025 22:29
@naoyam naoyam requested a review from jjsjann123 September 3, 2025 22:29
KernelExecutor ke;
ke.compile(&fusion, {t0});

GTEST_SKIP() << "Missing predication. Fix pending: "
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update this part once #5107 is done.

Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM~

🚢

switch (sop->accumulateOp()) {
case BinaryOpType::Add:
if (sop->in()->dtype() == DataType::Int) {
// atomicAdd does not provide an overload for int64_t
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮‍💨

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, Do you happen to know the reason that atomicAdd only has uint64_t but not for int64_t? Yet the max/min version has that?

programming guide discusses only floating point types... https://docs.nvidia.com/cuda/cuda-c-programming-guide/#atomicadd

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea.


tv2->setMemoryType(MemoryType::Shared);
tv2->setAllocationDomain(tv2->getLogicalDomain(), true);
tv4_cache->setMemoryType(MemoryType::Shared);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this scheduling, we are doing atomic write to shared memory, and then write to global memory afterwards.

For my own curiosity, I think we can also not add cacheBefore on tv4 and rely on atomicAdd directly to global memory and that should still work?!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should work too.

@naoyam
Copy link
Collaborator Author

naoyam commented Sep 5, 2025

!test

@naoyam naoyam merged commit 6048003 into main Sep 6, 2025
51 of 52 checks passed
@naoyam naoyam deleted the scatter-accumulate branch September 6, 2025 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants