Skip to content

Conversation

@wfaderhold21
Copy link
Collaborator

@wfaderhold21 wfaderhold21 commented Mar 17, 2025

What

Switch from using pSync array with atomic increment to TL/UCP barrier for synchronization

Why ?

There are multiple reason to switch to this: knomial barrier scales better and has better performance than atomic increment (see below) and, when PR #1070 is merged, this allows usage of this algorithm with memory handles.

Node Bandwidth
Tested on Thor with 32 nodes 1 PPN

Size This PR Current Algorithm
8 15.62 9.3
16 30.99 18.27
32 61.81 35.65
64 122.12 72.89
128 235.66 145.33
256 478.43 279.93
512 918.63 549.35
1024 1663.85 1022.45
2048 2919.6 1852.31
4096 4571.7 3151.6
8192 6527.59 4753.05
16384 8583.94 7749.33
32768 10381.33 9651.23
65536 11574.64 10882.64
131072 12039.43 11585.79
262144 12456.95 12065.97
524288 12151.77 12350.47
1048576 12785.3 12427.42

Tested on Thor with 32 nodes 32 PPN

Size This PR Current Algorithm
8 93.92 37.46
16 188.45 72.11
32 380.5 110.44
64 754.7 301.35
128 1500.11 460.3
256 2190.99 917.62
512 4178.51 1982.53
1024 7749.3 2867.44
2048 9093.23 4287.41
4096 9529.68 7078.85
8192 9858.2 8398.87
16384 9615.04 9826.92
32768 10004.92 10975.51
65536 10021.39 11901.37
131072 11287.29 11982.72
262144 11635.63 11803.15
524288 11623.31 11894.45
1048576 11621.41 11991.38

@swx-jenkins3
Copy link

Can one of the admins verify this patch?

@janjust
Copy link
Collaborator

janjust commented Mar 26, 2025

@wfaderhold21 didn't we say we were also going to change the test to reflect oshmem behavior?
edit: nvm, I just realized it's the other PR

@janjust
Copy link
Collaborator

janjust commented Apr 9, 2025

@wfaderhold21
(2) there can be instances where processes leave the alltoall collective before remote writes have been completed.
We had a discussion during our code review, and if I recall we concluded this is not the case in a 2-sided model, correct?
We still need the user to issue a flush on the symmetric heap
Please correct me if I misunderstood.

@wfaderhold21
Copy link
Collaborator Author

@wfaderhold21 (2) there can be instances where processes leave the alltoall collective before remote writes have been completed. We had a discussion during our code review, and if I recall we concluded this is not the case in a 2-sided model, correct? We still need the user to issue a flush on the symmetric heap Please correct me if I misunderstood.

@janjust This is correct. In order to ensure completion of writes to the remote processes, we need to issue a flush.

@nsarka
Copy link
Collaborator

nsarka commented Apr 10, 2025

In order to ensure completion of writes to the remote processes, we need to issue a flush.

Does the flush become a no-op (or just unnecessary) if RC is used? I'm just wondering how the transport changes this requirement (if at all)

@wfaderhold21
Copy link
Collaborator Author

In order to ensure completion of writes to the remote processes, we need to issue a flush.

Does the flush become a no-op (or just unnecessary) if RC is used? I'm just wondering how the transport changes this requirement (if at all)

I believe ordering should be maintained if using RC and a flush is not necessarily required as future PUTs, sends, AMOs should be completed after the PUT, but UCP will return with success on a PUT if only the source buffer is ready for reuse. There's no guarantee that the PUT has been completed at the remote target (e.g., buffered copy).

@wfaderhold21
Copy link
Collaborator Author

@janjust @Sergei-Lebedev @nsarka I believe this is finally in a reviewable state

@wfaderhold21 wfaderhold21 changed the title TL/UCP: transition to barrier for sync for onesided a2a TL/UCP: congestion aovidant onesided a2a Sep 10, 2025
@wfaderhold21 wfaderhold21 changed the title TL/UCP: congestion aovidant onesided a2a TL/UCP: congestion avoidant onesided a2a Sep 10, 2025
@manjugv
Copy link
Contributor

manjugv commented Sep 11, 2025

@Sergei-Lebedev Any comments on this? Let's push this out.

@wfaderhold21
Copy link
Collaborator Author

@janjust @Sergei-Lebedev @samnordmann @ikryukov I believe I have addressed the feedback in this PR. Please let me know if you have any other changes you wish to see.

@Sergei-Lebedev
Copy link
Contributor

ok to test

@dpressle
Copy link
Collaborator

dpressle commented Oct 8, 2025

bot:retest

@Sergei-Lebedev
Copy link
Contributor

@wfaderhold21
seems like relevant issue in CI

14:59:58  ===== UCC MPI TEST INFO =======
14:59:58  seed:         24653
14:59:58  collectives:  Barrier, Bcast, Reduce, Allreduce, Allgather, Allgatherv, Alltoall, Alltoallv, Reduce_scatter, Reduce_scatterv, Gather, Gatherv, Scatter, Scatterv
14:59:58  data types:   int16, int32, int64, uint16, uint32, uint64, float32, float64, float64_complex
14:59:58  memory types: Host, Cuda
14:59:58  teams:        world, reverse, half, odd_even
15:00:44  FAILURE in: tc=Onesided Alltoall team=half mtype=host msgsize=512 persistent=0 local_registration=0 dt=int16
15:00:44  
15:00:44  ===== UCC MPI TEST REPORT =====
15:00:44  collective                 tests    passed    failed   skipped
15:00:44  Allgather                   1008       936         0        72
15:00:44  Allgatherv                  1008       936         0        72
15:00:44  Allreduce                   2240      2240         0         0
15:00:44  Alltoall                     756       702         1        54
15:00:44  Alltoallv                   3024      2376         0       648
15:00:44  Barrier                        4         4         0         0
15:00:44  Bcast                       1008      1008         0         0
15:00:44  Gather                      2016      1872         0       144
15:00:44  Gatherv                     2016      1872         0       144
15:00:44  Reduce                      4480      4480         0         0
15:00:44  Reduce_scatter              2240      1656         0       584
15:00:44  Reduce_scatterv             2240      1904         0       336
15:00:44  Scatter                     2016        32         0      1984
15:00:44  Scatterv                    2016      1872         0       144
15:00:44   
15:00:44  ===== UCC MPI TEST SUMMARY =====
15:00:44  total tests:  26072
15:00:44  passed:       21890
15:00:44  skipped:      4182
15:00:44  failed:       1
15:00:44  elapsed:      48s

@wfaderhold21
Copy link
Collaborator Author

@Sergei-Lebedev I believe the changes I made should fix this error and ensure correctness for the PUT-based algorithm

TL/UCP: update ep flush with cb
@manjugv manjugv merged commit b3ea442 into openucx:master Oct 10, 2025
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants