Meeting 22.11

Jump to bottom

roy-sc edited this page Nov 22, 2021 · 12 revisions

Algorithm Selection

Allreduce flavors

allreduce
allreduce_rabenseifner
allreduce_native_ring
allreduce_native_basic_linear
allreduce_native_rabenseifner
allreduce_native_nonoverlapping
allreduce_native_recursive_doubling
allreduce_native_segmented_ring

Allgather flavors

allgather
allgather-async
to be added: native variations

New Ideas

Know your topology and optimize based on that
How do other HPC applications handle that?
Refine cost model
implement asynchronous versions of promising versions
generalized rabenseifner
hierarchical computing?
Optimize Outer Product (Eigen? BLAS? Manual implementation [Vectorization, Loop unrolling, cache thoughts])

Todo / Questions

Does MPI_ALLREDUCE compute while receiving? (allreduce-ring is close to allreduce)
Adjust vector sizes (1000-8000? 500 increments?)
What are expected {vector sizes, node number}?
Only powers of 2 okay?
What did Saleh mean with asynchronous computation? Sending chunks and operate on them (pipelined)?