Skip to content

Commit 3e871f0

Browse files
committed
Adding the circulant graph queued variable ring algorithm for Bcast.
This algorithm achieves better performance than existing algorithms for both small and large message sizes. The algorithms is based on the circulant graph abstraction and Jesper Larsson Traff's recent paper: https://dl.acm.org/doi/full/10.1145/3735139. It creates communication schedules around various rings in the circulant graph, then repeats the schedule to pipeline message chunks. We introduce a FIFO queue for overlapping sends and receives across communication rounds, which particularly benefits small messages. In the graph below, we show the algorithm's performance for a fixed chunk size (256k) and queue length (24) for various scales on ANL Aurora (N, PPN). The baseline for this graph is the best-performing algorithm currently in MPICH, so all speedups represent improvements over all algorithms currently in the library. We note that the performance drops around our selected chunk size (256k). By tuning the chunk size near this message size, it is possible to achieve a speedup across all message sizes for all scales.
1 parent 7fcdc20 commit 3e871f0

File tree

7 files changed

+463
-1
lines changed

7 files changed

+463
-1
lines changed

src/mpi/coll/bcast/Makefile.mk

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ mpi_core_sources += \
1313
src/mpi/coll/bcast/bcast_intra_binomial.c \
1414
src/mpi/coll/bcast/bcast_intra_scatter_recursive_doubling_allgather.c \
1515
src/mpi/coll/bcast/bcast_intra_scatter_ring_allgather.c \
16+
src/mpi/coll/bcast/bcast_intra_circ_qvring.c \
1617
src/mpi/coll/bcast/bcast_intra_smp.c \
1718
src/mpi/coll/bcast/bcast_intra_tree.c \
1819
src/mpi/coll/bcast/bcast_intra_pipelined_tree.c \

0 commit comments

Comments
 (0)