Skip to content

Conversation

@mjwilkins18
Copy link
Contributor

This algorithm achieves better performance than existing bcast algorithms for both small and large message sizes.

The algorithm is based on the circulant graph abstraction and Jesper Larsson Traff's recent paper: https://dl.acm.org/doi/full/10.1145/3735139. It creates communication schedules around various rings in the circulant graph, then repeats the schedule to pipeline message chunks. We introduce a FIFO queue for overlapping sends and receives across communication rounds, which particularly benefits small messages.

In the graph below, we show the algorithm's performance for a fixed chunk size (256k) and queue length (24) for various scales on ANL Aurora (N, PPN). The baseline for this graph is the best-performing algorithm currently in MPICH, so all speedups represent improvements over all algorithms currently in the library. We note that the performance drops around our selected chunk size (256k). By tuning the chunk size near this message size, it is possible to achieve a speedup across all message sizes for all scales.
bc

Pull Request Description

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

Comment on lines 25 to 29
static int all_blocks(int r, int r_, int s, int e, int k, int* buffer, struct sched_args_t* args);
static void gen_rsched(int r, int* buffer, struct sched_args_t* args);
static void gen_ssched(int r, struct sched_args_t* args);

static int get_baseblock(int r, struct sched_args_t* args);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use more descriptive variable names to make this easier to understand. aside from iterator variables (i,j,k), IMO we should avoid single letter naming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in general. In this case, the single letter variables align with single variable variables in the paper. The algorithm is complex, meaning readers will likely need to reference the paper to fully understand it, and changing the variable names may make this more difficult. Shall I change them anyways?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think short parameter names in local static functions can be relaxed if -

  • the names are consistent throughout the file. For example, if r here consistently refers to rank and only refers to rank.
  • To help, one can add a comment block listing common local variable names and its purposes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could live with a comment explaining the names. r_ in combination with r particularly offends me but I'll wait for the explainer before getting too riled up 😅.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added variable glossary comments to each of the helper functions and changed r_ to r_prime to avoid riling up @raffenet 😄

This algorithm achieves better performance than existing algorithms for both small and large message sizes.

The algorithms is based on the circulant graph abstraction and Jesper Larsson Traff's recent paper: https://dl.acm.org/doi/full/10.1145/3735139.
It creates communication schedules around various rings in the circulant graph, then repeats the schedule to pipeline message chunks.
We introduce a FIFO queue for overlapping sends and receives across communication rounds, which particularly benefits small messages.

In the graph below, we show the algorithm's performance for a fixed chunk size (256k) and queue length (24) for various scales on ANL Aurora (N, PPN).
The baseline for this graph is the best-performing algorithm currently in MPICH, so all speedups represent improvements over all algorithms currently in the library.
We note that the performance drops around our selected chunk size (256k).
By tuning the chunk size near this message size, it is possible to achieve a speedup across all message sizes for all scales.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants