-
Notifications
You must be signed in to change notification settings - Fork 308
coll: Circulant graph queued variable ring bcast algorithm #7539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| static int all_blocks(int r, int r_, int s, int e, int k, int* buffer, struct sched_args_t* args); | ||
| static void gen_rsched(int r, int* buffer, struct sched_args_t* args); | ||
| static void gen_ssched(int r, struct sched_args_t* args); | ||
|
|
||
| static int get_baseblock(int r, struct sched_args_t* args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use more descriptive variable names to make this easier to understand. aside from iterator variables (i,j,k), IMO we should avoid single letter naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree in general. In this case, the single letter variables align with single variable variables in the paper. The algorithm is complex, meaning readers will likely need to reference the paper to fully understand it, and changing the variable names may make this more difficult. Shall I change them anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think short parameter names in local static functions can be relaxed if -
- the names are consistent throughout the file. For example, if
rhere consistently refers to rank and only refers to rank. - To help, one can add a comment block listing common local variable names and its purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could live with a comment explaining the names. r_ in combination with r particularly offends me but I'll wait for the explainer before getting too riled up 😅.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added variable glossary comments to each of the helper functions and changed r_ to r_prime to avoid riling up @raffenet 😄
This algorithm achieves better performance than existing algorithms for both small and large message sizes. The algorithms is based on the circulant graph abstraction and Jesper Larsson Traff's recent paper: https://dl.acm.org/doi/full/10.1145/3735139. It creates communication schedules around various rings in the circulant graph, then repeats the schedule to pipeline message chunks. We introduce a FIFO queue for overlapping sends and receives across communication rounds, which particularly benefits small messages. In the graph below, we show the algorithm's performance for a fixed chunk size (256k) and queue length (24) for various scales on ANL Aurora (N, PPN). The baseline for this graph is the best-performing algorithm currently in MPICH, so all speedups represent improvements over all algorithms currently in the library. We note that the performance drops around our selected chunk size (256k). By tuning the chunk size near this message size, it is possible to achieve a speedup across all message sizes for all scales.
This algorithm achieves better performance than existing bcast algorithms for both small and large message sizes.
The algorithm is based on the circulant graph abstraction and Jesper Larsson Traff's recent paper: https://dl.acm.org/doi/full/10.1145/3735139. It creates communication schedules around various rings in the circulant graph, then repeats the schedule to pipeline message chunks. We introduce a FIFO queue for overlapping sends and receives across communication rounds, which particularly benefits small messages.
In the graph below, we show the algorithm's performance for a fixed chunk size (256k) and queue length (24) for various scales on ANL Aurora (N, PPN). The baseline for this graph is the best-performing algorithm currently in MPICH, so all speedups represent improvements over all algorithms currently in the library. We note that the performance drops around our selected chunk size (256k). By tuning the chunk size near this message size, it is possible to achieve a speedup across all message sizes for all scales.

Pull Request Description
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.