-
Notifications
You must be signed in to change notification settings - Fork 69
Lowering scatter accumulate #4764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Review updated until commit 3411324 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
!test |
This PR introduces a preliminary support of lowering of the scatter operation rather than falling back to Aten. The primary motivation is to generate a single fused kernel for fusions like [SgLangMoeTest.ComputeArgSort](https://github.com/NVIDIA/Fuser/blob/main/tests/cpp/test_moe.cpp#L155-L197). It is not yet piped through FusionExecutorCache, so nothing should be impacted as long as nvFuser is used through FusionExecutorCache. Scatter is inherently in-place, which doesn't mix well with the overall semantics of the Fusion IR. Here, from the users' perspective, scatter is provided as an out-of-place operation, like below: ``` auto tv4 = scatter(tv2, 0, tv1, tv3); ``` https://github.com/NVIDIA/Fuser/pull/4742/files#diff-a50219bc583905a766ab511e0af91ba8af96a821a93bb19f20d4b550c18a9f5cR49 Here, `tv2` and `tv4` are different tensors in the fusion. The user is free to use `tv2` and `tv4` separately in the fusion. However, when generating a CUDA kernel, we would want to implement the operation as an in-place operation, so at the time of lowering, it is [validated](https://github.com/NVIDIA/Fuser/pull/4742/files#diff-b8542908e49882b02549144d87bbf19225a253305e26a0f7ea1665a05cfc30f4R1338-R1342) such that the scatter input is only used by the scatter operation itself. This restriction should be enforced by the fusion segmenter. ~Before lowering, the loop domain of `tv4` is meant to be updated to use the logical domain of the index tensor. This is currently manually done as shown [here](https://github.com/NVIDIA/Fuser/pull/4742/files#diff-a50219bc583905a766ab511e0af91ba8af96a821a93bb19f20d4b550c18a9f5cR53-R54).~ Edit: I decided to do this from the beginning as part of the `TensorDomain` constructor. At the time of lowering, once the validation passes, a new lowering pass, `setInplaceAlias`, modifies the allocation nodes of the scatter input and output such that the output becomes an alias of the input (except when the output is also a fusion output, in that case the input becomes an alias of the output). I initially considered extending the existing memory reuse pass but decided to add a new separate pass for simplicity. Once the aliasing is done, then the rest is just a matter of some minor adjustments here and there. With this PR, the `ComputeArgSort` can be manually scheduled as shown [here](https://github.com/NVIDIA/Fuser/pull/4742/files#diff-e116aa2fb290f929889bb62f657b591fea932461e103a2073db6e75fbb45f6c4R231-R247). Similarly, the `ComputeProblemSizes` can also be lowered when the index size is 1 since in that case there's no accumulation. That should correspond to the decode pass. Note that this PR does not support scatter with multi-dimensional tensors. This is because in PyTorch scatter, non-indexed dimensions are not guaranteed to have the same extents between all the tensors, so there's no ID mapping, meaning there's no indexing path. I think we should represent this as a separate resize op, but not yet done. #4764 is a follow-up PR to extend this for the accumulation case. More thorough testing as well as actual automatic scheduling support should be done in future PRs.
!test |
!test --diff |
KernelExecutor ke; | ||
ke.compile(&fusion, {t0}); | ||
|
||
GTEST_SKIP() << "Missing predication. Fix pending: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update this part once #5107 is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM~
🚢
switch (sop->accumulateOp()) { | ||
case BinaryOpType::Add: | ||
if (sop->in()->dtype() == DataType::Int) { | ||
// atomicAdd does not provide an overload for int64_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😮💨
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, Do you happen to know the reason that atomicAdd only has uint64_t but not for int64_t? Yet the max/min version has that?
programming guide discusses only floating point types... https://docs.nvidia.com/cuda/cuda-c-programming-guide/#atomicadd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea.
|
||
tv2->setMemoryType(MemoryType::Shared); | ||
tv2->setAllocationDomain(tv2->getLogicalDomain(), true); | ||
tv4_cache->setMemoryType(MemoryType::Shared); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this scheduling, we are doing atomic write to shared memory, and then write to global memory afterwards.
For my own curiosity, I think we can also not add cacheBefore on tv4 and rely on atomicAdd directly to global memory and that should still work?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that should work too.
!test |
Adds an optional accumulate parameter to
ScatterOp
so that it can be used both with and without accumulation.I'll look into consolidating
IndexPutAccumulateOp
as well in the future.