-
Notifications
You must be signed in to change notification settings - Fork 69
Lowering scatter accumulate #4764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
7f634f5
c654486
8ef8e3e
549e407
ea81050
b32ccf3
8003a94
21054f0
2a5379c
bc6020d
d2d127b
6a8c74e
dce9246
f725ad9
96bb1d0
0fc4aff
d8291fc
59d73b2
09617ea
2fc5184
ee36099
fd2b83b
6cd1c3b
4180b03
0d9ed76
ff6c3f3
b4bea4a
2e382ac
4d8ac68
d3fc769
47c9d4d
90b1217
2b2f524
bbcc433
8c2a9e1
3411324
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -59,10 +59,6 @@ class SgLangMoETest : public NVFuserFixtureParamTest<MoEConfig> { | |
}; | ||
|
||
TEST_P(SgLangMoETest, ComputeProblemSizes) { | ||
if (manual_scheduling) { | ||
GTEST_SKIP() << "No manual scheduling implemented"; | ||
} | ||
|
||
auto fusion_ptr = std::make_unique<Fusion>(); | ||
Fusion& fusion = *fusion_ptr.get(); | ||
FusionGuard fg(&fusion); | ||
|
@@ -78,16 +74,39 @@ TEST_P(SgLangMoETest, ComputeProblemSizes) { | |
|
||
auto tv3 = ones({IrBuilder::create<Val>(num_tokens * topk)}, DataType::Int); | ||
|
||
auto tv4 = indexPutAccumulate(tv2, tv1, tv3); | ||
auto tv4 = scatter(tv2, 0, tv1, tv3, BinaryOpType::Add); | ||
|
||
fusion.addOutput(tv4); | ||
|
||
auto options = at::TensorOptions().dtype(at::kLong).device(at::kCUDA, 0); | ||
auto t0 = at::randint(0, num_experts, {num_tokens, topk}, options); | ||
|
||
FusionExecutorCache executor_cache(std::move(fusion_ptr)); | ||
auto outputs = executor_cache.runFusionWithInputs({t0}); | ||
testValidate(executor_cache.fusion(), outputs, {t0}, __LINE__, __FILE__); | ||
if (manual_scheduling) { | ||
auto tv4_cache = tv4->cacheBefore(); | ||
|
||
// Scheduling all tensors as 1D tensors | ||
for (auto tv : fusion.allTvs()) { | ||
tv->flatten(); | ||
tv->axis(0)->parallelize(ParallelType::TIDx); | ||
} | ||
|
||
tv2->setMemoryType(MemoryType::Shared); | ||
tv2->setAllocationDomain(tv2->getLogicalDomain(), true); | ||
tv4_cache->setMemoryType(MemoryType::Shared); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In this scheduling, we are doing atomic write to shared memory, and then write to global memory afterwards. For my own curiosity, I think we can also not add cacheBefore on tv4 and rely on atomicAdd directly to global memory and that should still work?! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that should work too. |
||
tv4_cache->setAllocationDomain(tv4_cache->getLogicalDomain(), true); | ||
|
||
KernelExecutor ke; | ||
ke.compile(&fusion, {t0}); | ||
|
||
GTEST_SKIP() << "Missing predication. Fix pending: " | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll update this part once #5107 is done. |
||
"https://github.com/NVIDIA/Fuser/pull/5107"; | ||
auto outputs = ke.run({t0}); | ||
testValidate(&fusion, outputs, {t0}, __LINE__, __FILE__); | ||
} else { | ||
FusionExecutorCache executor_cache(std::move(fusion_ptr)); | ||
auto outputs = executor_cache.runFusionWithInputs({t0}); | ||
testValidate(executor_cache.fusion(), outputs, {t0}, __LINE__, __FILE__); | ||
} | ||
} | ||
|
||
TEST_P(SgLangMoETest, ComputeExpertOffsets) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😮💨
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, Do you happen to know the reason that atomicAdd only has uint64_t but not for int64_t? Yet the max/min version has that?
programming guide discusses only floating point types... https://docs.nvidia.com/cuda/cuda-c-programming-guide/#atomicadd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea.