-
Notifications
You must be signed in to change notification settings - Fork 48
Improve signal/wait performance and fix barrier issue #499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.
Files not reviewed (1)
- CMakeLists.txt: Language not supported
Hi @Binyang2014 the pull request looks good to me. However, I noticed changing I am unable to benchmark the code. I am curious if memoryOrderRelaxed atomicLoad is faster than memoryOrderAcquire one. If so, maybe it is more efficient to use memoryOrderRelaxed atomicLoad in the
|
Yes, it's faster. But I am not sure the correctness for this case. If we get the flag via memoryOrderRelaxed atomicLoad, can system guarantee the following atomicLoad memoryOrderAcquire will see same value? I don't find any document for this. Also I don't see any compile issue for pip install, are you using mscclpp container image? |
I am suggesting something similar to #497 that we keep the memoryOrderRelaxed atomicStore and atomicLoad, but add fence before and after. For documentation from cuda, please refer to #497 (comment), especially item 3 of 8.8. Basically, if one thread writes a value that is read by another thread and there are fences before the write and after the read, then the two threads have release-acquire relation even if the read and write are relaxed. We should establish a release-acquire relation between any two threadblocks participating in the DeviceSyncer. I think we should probably also add fence2 to #497 like the following: fence_acq_rel_gpu(); // fence1
unsigned int tmp = preFlag_ ^ 1;
if (atomicInc(&count_, maxOldCnt) == maxOldCnt) {
fence_acq_rel_gpu(); // fence2
atomicStore(&flag_, tmp, memoryOrderRelaxed);
} else {
POLL_MAYBE_JAILBREAK((atomicLoad(&flag_, memoryOrderRelaxed) != tmp), maxSpinCount);
}
preFlag_ = tmp;
fence_acq_rel_gpu(); // fence3 Suppose the threadblocks reach the atomicInc in order 1, ..., n. Then from any block x to block y, the release-acquire relation has two cases:
And of course, this only establishes the release-acquire between thread0 of all blocks. The __syncthreads before and after all of these will transition the release-acquire to between any two threads of all blocks. |
|
Looks good to me. The release atomicFetchAdd and acquire atomicLoad, both on the same memory location, can establish mutual release-acquire between threadblocks. Please try replacing the memoryOrderAcquire atomicLoad in the poll loop with a memoryOrderRelaxed one and add an acquire fence after poll loop ends, if this brings performance improvement. The correctness is guaranteed by item 3 of 8.8 acquire pattern. |
Thanks @liangyuRain, I tried with |
/azp run |
Azure Pipelines successfully started running 3 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 3 pipeline(s). |
Remove __assert_fail for release build. This will reduce the number of PTX instructions inside the loop. Also Trying to resolve this issue reported in #497. Reduce the number of PTX instructions from 8 to 6.
8 ranks signal/wait will reduce from 3.2us->2.8us on NDv5
Also NDEBUG flag is confused here, sometime it will not be set. Use customized flag for debug build.
Here is current PTX:
If we change to
asm volatile("ld.global.acquire.sys.b64 %0, [%1];" : "=l"(flag) : "l"(flag_addr));
will reduce to 4 instructions. We can get 2.1 us for 8 ranks signal/wait