Skip to content

Conversation

@rithwiktom
Copy link
Contributor

Pull Request Description

This PR updates/adds description for some MPICH CVARs

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@rithwiktom rithwiktom marked this pull request as ready for review August 6, 2025 18:18
`MPIR_CVAR_ENABLE_YAKSA_REDUCTION = 0`; this enables the fallback path
(host-based) for reduction.

* `MPIR_CVAR_YAKSA_REDUCTION_THRESHOLD`: This CVAR determines the threshold to
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this accurate? @hzhou

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the message size is smaller than MPIR_CVAR_YAKSA_REDUCTION_THRESHOLD, reduction collectives will directly pass the GPU data to the reduction algorithms assuming the internal yaksa engine can directly perform operations with GPU data, potentially using a GPU kernel. If the message size is larger than MPIR_CVAR_YAKSA_REDUCTION_THRESHOLD, reduction collectives will always pack the GPU data to host memory first before passing on to reduction algorithms. The motivation to set this CVAR is because the current reduction algorithms are optimized for host memories and under-performs with large GPU messages.


### 2.7. Fallback Behavior for Collective Algorithm

MPICH will fallback if the selected algorithm is not applicable to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we improve this description of when fallback occurs? I think we should state that fallback can occur from user-specified selections and default selections from the .json configurations and CVAR overrides. I think it is slightly counter-intuitive that even if you force a particular algorithm with a CVAR, you still may fallback, and that should be clear here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants