Skip to content

Conversation

@hzhou
Copy link
Contributor

@hzhou hzhou commented May 16, 2025

When an unexpected message arrive before fi_av_insert, the av_index won't be available and was set to FI_ADDR_NOTAVAIL. It eventually may get mismatched to a wrong receive entry. Store the raw ip address for this case and try to recover the av_index at match time.

This bug is triggered in MPICH recent CI testing after we revamped support for MPI Sessions. With world model, we have a barrier at the end of MPI_Init, preventing messages to arrive before all fi_av_inserts completes. With sessions, we no longer have such barriers.

When an unexpected message arrive before fi_av_insert, the av_index
won't be available and was set to FI_ADDR_NOTAVAIL. It eventually may
get mismatched to a wrong recv entry. Store the raw ip address for this
case and try to recover the av_index at match time.

Signed-off-by: Hui Zhou <[email protected]>
}
rx_posted = sock_rx_get_entry(rx_ctx, rx_buffered->addr,
rx_buffered->tag,
rx_buffered->is_tagged);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: do we allow FI_ADDR_NOTAVAIL to be matched? If so, what is the use case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be allowed. The posted recv may not care who the sender is (i.e. rx_entry->addr == FI_ADDR_UNSPEC).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants