Improve signal/wait performance and fix barrier issue #499

Binyang2014 · 2025-04-11T21:43:41Z

Remove __assert_fail for release build. This will reduce the number of PTX instructions inside the loop. Also Trying to resolve this issue reported in #497. Reduce the number of PTX instructions from 8 to 6.
8 ranks signal/wait will reduce from 3.2us->2.8us on NDv5
Also NDEBUG flag is confused here, sometime it will not be set. Use customized flag for debug build.

Here is current PTX:

      ld.u64  %rd12, [%rd2+-24];
      mov.u64         %rd13, %rd12;
      mov.u64         %rd11, %rd13;
      ld.acquire.sys.b64 %rd10,[%rd11];
      setp.lt.u64     %p1, %rd10, %rd3;
      @%p1 bra        $L__BB2_1;

If we change to asm volatile("ld.global.acquire.sys.b64 %0, [%1];" : "=l"(flag) : "l"(flag_addr)); will reduce to 4 instructions. We can get 2.1 us for 8 ranks signal/wait

        ld.u64  %rd9, [%rd1+-24];
        ld.global.acquire.sys.b64 %rd8, [%rd9];
        setp.lt.u64     %p1, %rd8, %rd2;
        @%p1 bra        $L__BB2_1;

Copilot

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Files not reviewed (1)

CMakeLists.txt: Language not supported

include/mscclpp/poll_device.hpp

liangyuRain · 2025-04-13T21:23:14Z

Hi @Binyang2014 the pull request looks good to me. However, I noticed changing NDEBUG to DEBUG_BUILD causes compile error for any assert in my program. It looks like when building mscclpp with pip install, there are macros still using NDEBUG. I am cherry picking the commits on top of 0f21ed4 given #496.

I am unable to benchmark the code. I am curious if memoryOrderRelaxed atomicLoad is faster than memoryOrderAcquire one. If so, maybe it is more efficient to use memoryOrderRelaxed atomicLoad in the POLL_MAYBE_JAILBREAK and have an acquire fence only once after the loop.

mscclpp/include/mscclpp/concurrency_device.hpp

Line 35 in 43b4441

    
           POLL_MAYBE_JAILBREAK((atomicLoad<unsigned int, scopeDevice>(&flag_, memoryOrderAcquire) != tmp), maxSpinCount);

include/mscclpp/concurrency_device.hpp

test/unit/compile_tests.cu

Binyang2014 · 2025-04-14T17:04:17Z

Hi @Binyang2014 the pull request looks good to me. However, I noticed changing NDEBUG to DEBUG_BUILD causes compile error for any assert in my program. It looks like when building mscclpp with pip install, there are macros still using NDEBUG. I am cherry picking the commits on top of 0f21ed4 given #496.

I am unable to benchmark the code. I am curious if memoryOrderRelaxed atomicLoad is faster than memoryOrderAcquire one. If so, maybe it is more efficient to use memoryOrderRelaxed atomicLoad in the POLL_MAYBE_JAILBREAK and have an acquire fence only once after the loop.

mscclpp/include/mscclpp/concurrency_device.hpp

Line 35 in 43b4441

POLL_MAYBE_JAILBREAK((atomicLoad<unsigned int, scopeDevice>(&flag_, memoryOrderAcquire) != tmp), maxSpinCount);

Yes, it's faster. But I am not sure the correctness for this case. If we get the flag via memoryOrderRelaxed atomicLoad, can system guarantee the following atomicLoad memoryOrderAcquire will see same value? I don't find any document for this.

Also I don't see any compile issue for pip install, are you using mscclpp container image?

liangyuRain · 2025-04-14T19:47:04Z

I am suggesting something similar to #497 that we keep the memoryOrderRelaxed atomicStore and atomicLoad, but add fence before and after. For documentation from cuda, please refer to #497 (comment), especially item 3 of 8.8. Basically, if one thread writes a value that is read by another thread and there are fences before the write and after the read, then the two threads have release-acquire relation even if the read and write are relaxed. We should establish a release-acquire relation between any two threadblocks participating in the DeviceSyncer. I think we should probably also add fence2 to #497 like the following:

fence_acq_rel_gpu(); // fence1
unsigned int tmp = preFlag_ ^ 1;
if (atomicInc(&count_, maxOldCnt) == maxOldCnt) {
  fence_acq_rel_gpu(); // fence2
  atomicStore(&flag_, tmp, memoryOrderRelaxed);
} else {
  POLL_MAYBE_JAILBREAK((atomicLoad(&flag_, memoryOrderRelaxed) != tmp), maxSpinCount);
}
preFlag_ = tmp;
fence_acq_rel_gpu(); // fence3

Suppose the threadblocks reach the atomicInc in order 1, ..., n. Then from any block x to block y, the release-acquire relation has two cases:

x<y:
- block x fence1 + atomicInc release -> block y atomicInc + fence3 acquire
x>y:
- block x fence1 + atomicInc release -> block n atomicInc + fence2 acquire
- block n fence2 + atomicStore release -> block y atomicLoad + fence3 acquire

And of course, this only establishes the release-acquire between thread0 of all blocks. The __syncthreads before and after all of these will transition the release-acquire to between any two threads of all blocks.

chhwang · 2025-04-14T20:14:17Z

I am suggesting something similar to #497 that we keep the memoryOrderRelaxed atomicStore and atomicLoad, but add fence before and after. For documentation from cuda, please refer to #497 (comment), especially item 3 of 8.8. Basically, if one thread writes a value that is read by another thread and there are fences before the write and after the read, then the two threads have release-acquire relation even if the read and write are relaxed. We should establish a release-acquire relation between any two threadblocks participating in the DeviceSyncer. I think we should probably also add fence2 to #497 like the following:
fence_acq_rel_gpu(); // fence1
unsigned int tmp = preFlag_ ^ 1;
if (atomicInc(&count_, maxOldCnt) == maxOldCnt) {
  fence_acq_rel_gpu(); // fence2
  atomicStore(&flag_, tmp, memoryOrderRelaxed);
} else {
  POLL_MAYBE_JAILBREAK((atomicLoad(&flag_, memoryOrderRelaxed) != tmp), maxSpinCount);
}
preFlag_ = tmp;
fence_acq_rel_gpu(); // fence3
Suppose the threadblocks reach the atomicInc in order 1, ..., n. Then from any block x to block y, the release-acquire relation has two cases:

x<y:

block x fence1 + atomicInc release -> block y atomicInc + fence3 acquire

x>y:

block x fence1 + atomicInc release -> block n atomicInc + fence2 acquire

block n fence2 + atomicStore release -> block y atomicLoad + fence3 acquire

And of course, this only establishes the release-acquire between thread0 of all blocks. The __syncthreads before and after all of these will transition the release-acquire to between any two threads of all blocks.

atomicInc is only for picking the latest block. I don't think we need release-acquire relation there.

liangyuRain · 2025-04-15T06:30:00Z

Looks good to me. The release atomicFetchAdd and acquire atomicLoad, both on the same memory location, can establish mutual release-acquire between threadblocks. Please try replacing the memoryOrderAcquire atomicLoad in the poll loop with a memoryOrderRelaxed one and add an acquire fence after poll loop ends, if this brings performance improvement. The correctness is guaranteed by item 3 of 8.8 acquire pattern.

Binyang2014 · 2025-04-15T17:38:19Z

Looks good to me. The release atomicFetchAdd and acquire atomicLoad, both on the same memory location, can establish mutual release-acquire between threadblocks. Please try replacing the memoryOrderAcquire atomicLoad in the poll loop with a memoryOrderRelaxed one and add an acquire fence after poll loop ends, if this brings performance improvement. The correctness is guaranteed by item 3 of 8.8 acquire pattern.

Thanks @liangyuRain, I tried with asm volatile("fence.gpu;");, the performance even worse. Only spin loop with memoryOrderRelaxed and followed by an atomic memoryOrderAquire shows the performance gain, but I am not sure the correctness.

Binyang2014 · 2025-04-15T20:28:50Z

/azp run

azure-pipelines · 2025-04-15T20:29:10Z

Azure Pipelines successfully started running 3 pipeline(s).

Binyang2014 · 2025-04-16T21:17:52Z

/azp run

azure-pipelines · 2025-04-16T21:18:09Z

Azure Pipelines successfully started running 3 pipeline(s).

Binyang2014 added 2 commits April 11, 2025 17:40

WIP

426eb12

WIP

d7e93eb

Binyang2014 changed the title ~~Binyli/signal~~ Improve signal/wait performance and fix barrier issue Apr 11, 2025

WIP

86c61cd

Binyang2014 mentioned this pull request Apr 11, 2025

Memory ordering #497

Closed

Binyang2014 added 2 commits April 11, 2025 22:13

WIP

a240075

WIP

13358fe

Binyang2014 marked this pull request as ready for review April 11, 2025 23:53

Binyang2014 requested review from chhwang and Copilot April 11, 2025 23:54

Copilot AI reviewed Apr 11, 2025

View reviewed changes

include/mscclpp/poll_device.hpp Outdated Show resolved Hide resolved

Binyang2014 requested review from seagater and caiomcbr April 11, 2025 23:56

Binyang2014 added 2 commits April 11, 2025 23:57

fix

6c4eba4

Fix

43b4441

chhwang requested changes Apr 14, 2025

View reviewed changes

include/mscclpp/concurrency_device.hpp Outdated Show resolved Hide resolved

test/unit/compile_tests.cu Outdated Show resolved Hide resolved

update

68a4774

Binyang2014 added 4 commits April 14, 2025 20:34

fix

2405e22

update

2f6a6af

fix

3eb5f1a

fix

ce83aca

fix perf issue

051a23c

chhwang approved these changes Apr 16, 2025

View reviewed changes

Merge branch 'main' into binyli/signal

32e762e

Binyang2014 merged commit e412804 into main Apr 16, 2025
14 of 25 checks passed

Binyang2014 deleted the binyli/signal branch April 16, 2025 21:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve signal/wait performance and fix barrier issue #499

Improve signal/wait performance and fix barrier issue #499

Uh oh!

Binyang2014 commented Apr 11, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

liangyuRain commented Apr 13, 2025

Uh oh!

Uh oh!

Uh oh!

Binyang2014 commented Apr 14, 2025 •

edited

Loading

Uh oh!

liangyuRain commented Apr 14, 2025 •

edited

Loading

Uh oh!

chhwang commented Apr 14, 2025

Uh oh!

liangyuRain commented Apr 15, 2025

Uh oh!

Binyang2014 commented Apr 15, 2025

Uh oh!

Binyang2014 commented Apr 15, 2025

Uh oh!

azure-pipelines bot commented Apr 15, 2025

Uh oh!

Binyang2014 commented Apr 16, 2025

Uh oh!

azure-pipelines bot commented Apr 16, 2025

Uh oh!

Uh oh!

Uh oh!

Improve signal/wait performance and fix barrier issue #499

Improve signal/wait performance and fix barrier issue #499

Uh oh!

Conversation

Binyang2014 commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

liangyuRain commented Apr 13, 2025

Uh oh!

Uh oh!

Uh oh!

Binyang2014 commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liangyuRain commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chhwang commented Apr 14, 2025

Uh oh!

liangyuRain commented Apr 15, 2025

Uh oh!

Binyang2014 commented Apr 15, 2025

Uh oh!

Binyang2014 commented Apr 15, 2025

Uh oh!

azure-pipelines bot commented Apr 15, 2025

Uh oh!

Binyang2014 commented Apr 16, 2025

Uh oh!

azure-pipelines bot commented Apr 16, 2025

Uh oh!

Uh oh!

Uh oh!

Binyang2014 commented Apr 11, 2025 •

edited

Loading

Binyang2014 commented Apr 14, 2025 •

edited

Loading

liangyuRain commented Apr 14, 2025 •

edited

Loading