Skip to content

Conversation

@hzhou
Copy link
Contributor

@hzhou hzhou commented Aug 13, 2025

Pull Request Description

  • Adapt p2p tests to multi-pair tests. Simply run with more pairs processes. For example -
    mpiexec -n 8 ./p2p_bw.c

It will run 4-pair bandwidth tests between consecutive even/odd ranks.

  • Add p2p_self tests. It is comparable to a memcpy bandwidth test

  • rotate buffers in p2p tests. Avoid undesired caching effect esp. for small messages.

  • Adjust window size in bandwidth tests. More accurate and faster for large message measurements.

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2508_bench branch 3 times, most recently from 52c723a to 51a11c1 Compare August 13, 2025 14:37
When p2p bench tests are launched with more than 2 procs, form even/odd
pairs and perform pair-wise tests.

This patch adapts p2p_one for now.
hzhou added 6 commits August 18, 2025 14:21
Allocate a single global buffer. Benchmarks will rotate inside the
global buffer to discount the effect of memory caching for small
messages. Bandwidth measurements to use multiple buffers within the
global buffer for a more correct usage.
There isn't need to batch a window for large message bandwidth
measurement because having multiple message completion inside a single
progress poll won't be realistic. Rather, having multiple concurrent
large messages in progress may slow bandwidth if the underlying
algorithm such as pipelining may overwhelm progress.

Have window size reduce to 1 for large messages. This also makes
completion the benchmark faster so we can go up to larger MAX_BUFSIZE.
Similar to p2p_one and p2p_bw, add self_one and self_bw that tests
sending self messages within a single process. Self tests are useful
in testing local memcpy bandwidth.
Add the fast path in case the memory is from gpu.
Add more collective latency tests.
Rename bcast.def to coll_latency.def to add multiple collective latency
tests. Just add allreduce for now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant