Skip to content

Improve signal/wait performance and fix barrier issue #499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Apr 16, 2025
Merged

Conversation

Binyang2014
Copy link
Contributor

@Binyang2014 Binyang2014 commented Apr 11, 2025

Remove __assert_fail for release build. This will reduce the number of PTX instructions inside the loop. Also Trying to resolve this issue reported in #497. Reduce the number of PTX instructions from 8 to 6.
8 ranks signal/wait will reduce from 3.2us->2.8us on NDv5
Also NDEBUG flag is confused here, sometime it will not be set. Use customized flag for debug build.

Here is current PTX:

      ld.u64  %rd12, [%rd2+-24];
      mov.u64         %rd13, %rd12;
      mov.u64         %rd11, %rd13;
      ld.acquire.sys.b64 %rd10,[%rd11];
      setp.lt.u64     %p1, %rd10, %rd3;
      @%p1 bra        $L__BB2_1;

If we change to asm volatile("ld.global.acquire.sys.b64 %0, [%1];" : "=l"(flag) : "l"(flag_addr)); will reduce to 4 instructions. We can get 2.1 us for 8 ranks signal/wait

        ld.u64  %rd9, [%rd1+-24];
        ld.global.acquire.sys.b64 %rd8, [%rd9];
        setp.lt.u64     %p1, %rd8, %rd2;
        @%p1 bra        $L__BB2_1;

@Binyang2014 Binyang2014 changed the title Binyli/signal Improve signal/wait performance and fix barrier issue Apr 11, 2025
@Binyang2014 Binyang2014 mentioned this pull request Apr 11, 2025
@Binyang2014 Binyang2014 marked this pull request as ready for review April 11, 2025 23:53
@Binyang2014 Binyang2014 requested review from chhwang and Copilot April 11, 2025 23:54
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Files not reviewed (1)
  • CMakeLists.txt: Language not supported

@liangyuRain
Copy link

Hi @Binyang2014 the pull request looks good to me. However, I noticed changing NDEBUG to DEBUG_BUILD causes compile error for any assert in my program. It looks like when building mscclpp with pip install, there are macros still using NDEBUG. I am cherry picking the commits on top of 0f21ed4 given #496.

I am unable to benchmark the code. I am curious if memoryOrderRelaxed atomicLoad is faster than memoryOrderAcquire one. If so, maybe it is more efficient to use memoryOrderRelaxed atomicLoad in the POLL_MAYBE_JAILBREAK and have an acquire fence only once after the loop.

POLL_MAYBE_JAILBREAK((atomicLoad<unsigned int, scopeDevice>(&flag_, memoryOrderAcquire) != tmp), maxSpinCount);

@Binyang2014
Copy link
Contributor Author

Binyang2014 commented Apr 14, 2025

Hi @Binyang2014 the pull request looks good to me. However, I noticed changing NDEBUG to DEBUG_BUILD causes compile error for any assert in my program. It looks like when building mscclpp with pip install, there are macros still using NDEBUG. I am cherry picking the commits on top of 0f21ed4 given #496.

I am unable to benchmark the code. I am curious if memoryOrderRelaxed atomicLoad is faster than memoryOrderAcquire one. If so, maybe it is more efficient to use memoryOrderRelaxed atomicLoad in the POLL_MAYBE_JAILBREAK and have an acquire fence only once after the loop.

POLL_MAYBE_JAILBREAK((atomicLoad<unsigned int, scopeDevice>(&flag_, memoryOrderAcquire) != tmp), maxSpinCount);

Yes, it's faster. But I am not sure the correctness for this case. If we get the flag via memoryOrderRelaxed atomicLoad, can system guarantee the following atomicLoad memoryOrderAcquire will see same value? I don't find any document for this.

Also I don't see any compile issue for pip install, are you using mscclpp container image?

@liangyuRain
Copy link

liangyuRain commented Apr 14, 2025

I am suggesting something similar to #497 that we keep the memoryOrderRelaxed atomicStore and atomicLoad, but add fence before and after. For documentation from cuda, please refer to #497 (comment), especially item 3 of 8.8. Basically, if one thread writes a value that is read by another thread and there are fences before the write and after the read, then the two threads have release-acquire relation even if the read and write are relaxed. We should establish a release-acquire relation between any two threadblocks participating in the DeviceSyncer. I think we should probably also add fence2 to #497 like the following:

fence_acq_rel_gpu(); // fence1
unsigned int tmp = preFlag_ ^ 1;
if (atomicInc(&count_, maxOldCnt) == maxOldCnt) {
  fence_acq_rel_gpu(); // fence2
  atomicStore(&flag_, tmp, memoryOrderRelaxed);
} else {
  POLL_MAYBE_JAILBREAK((atomicLoad(&flag_, memoryOrderRelaxed) != tmp), maxSpinCount);
}
preFlag_ = tmp;
fence_acq_rel_gpu(); // fence3

Suppose the threadblocks reach the atomicInc in order 1, ..., n. Then from any block x to block y, the release-acquire relation has two cases:

  • x<y:
    • block x fence1 + atomicInc release -> block y atomicInc + fence3 acquire
  • x>y:
    • block x fence1 + atomicInc release -> block n atomicInc + fence2 acquire
    • block n fence2 + atomicStore release -> block y atomicLoad + fence3 acquire

And of course, this only establishes the release-acquire between thread0 of all blocks. The __syncthreads before and after all of these will transition the release-acquire to between any two threads of all blocks.

@chhwang
Copy link
Contributor

chhwang commented Apr 14, 2025

I am suggesting something similar to #497 that we keep the memoryOrderRelaxed atomicStore and atomicLoad, but add fence before and after. For documentation from cuda, please refer to #497 (comment), especially item 3 of 8.8. Basically, if one thread writes a value that is read by another thread and there are fences before the write and after the read, then the two threads have release-acquire relation even if the read and write are relaxed. We should establish a release-acquire relation between any two threadblocks participating in the DeviceSyncer. I think we should probably also add fence2 to #497 like the following:

fence_acq_rel_gpu(); // fence1
unsigned int tmp = preFlag_ ^ 1;
if (atomicInc(&count_, maxOldCnt) == maxOldCnt) {
  fence_acq_rel_gpu(); // fence2
  atomicStore(&flag_, tmp, memoryOrderRelaxed);
} else {
  POLL_MAYBE_JAILBREAK((atomicLoad(&flag_, memoryOrderRelaxed) != tmp), maxSpinCount);
}
preFlag_ = tmp;
fence_acq_rel_gpu(); // fence3

Suppose the threadblocks reach the atomicInc in order 1, ..., n. Then from any block x to block y, the release-acquire relation has two cases:

  • x<y:

    • block x fence1 + atomicInc release -> block y atomicInc + fence3 acquire
  • x>y:

    • block x fence1 + atomicInc release -> block n atomicInc + fence2 acquire
    • block n fence2 + atomicStore release -> block y atomicLoad + fence3 acquire

And of course, this only establishes the release-acquire between thread0 of all blocks. The __syncthreads before and after all of these will transition the release-acquire to between any two threads of all blocks.

atomicInc is only for picking the latest block. I don't think we need release-acquire relation there.

@liangyuRain
Copy link

Looks good to me. The release atomicFetchAdd and acquire atomicLoad, both on the same memory location, can establish mutual release-acquire between threadblocks. Please try replacing the memoryOrderAcquire atomicLoad in the poll loop with a memoryOrderRelaxed one and add an acquire fence after poll loop ends, if this brings performance improvement. The correctness is guaranteed by item 3 of 8.8 acquire pattern.

@Binyang2014
Copy link
Contributor Author

Looks good to me. The release atomicFetchAdd and acquire atomicLoad, both on the same memory location, can establish mutual release-acquire between threadblocks. Please try replacing the memoryOrderAcquire atomicLoad in the poll loop with a memoryOrderRelaxed one and add an acquire fence after poll loop ends, if this brings performance improvement. The correctness is guaranteed by item 3 of 8.8 acquire pattern.

Thanks @liangyuRain, I tried with asm volatile("fence.gpu;");, the performance even worse. Only spin loop with memoryOrderRelaxed and followed by an atomic memoryOrderAquire shows the performance gain, but I am not sure the correctness.

@Binyang2014
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@Binyang2014
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@Binyang2014 Binyang2014 merged commit e412804 into main Apr 16, 2025
14 of 25 checks passed
@Binyang2014 Binyang2014 deleted the binyli/signal branch April 16, 2025 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants