Skip to content

Conversation

shijin-aws
Copy link
Contributor

@shijin-aws shijin-aws commented Sep 19, 2025

efa-direct offloads the rma request to rdma WQEs in efa device directly. But today efa device doesn't support > 1 sge per WQE. This patch makes efa_post_read and efa_post_write fail early with clear error message for this scenario to avoid unexpected crashing.

@shijin-aws shijin-aws requested a review from a team September 19, 2025 22:31

assert(msg->iov_count > 0 &&
msg->iov_count <= base_ep->domain->info->tx_attr->iov_limit);
if (OFI_UNLIKELY(msg->iov_count > 1)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is "max_sge_rd" inside efa_device.ibv_attr that should hold this value.
We should use it to be future compatible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I will use it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Contributor Author

@shijin-aws shijin-aws Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think about it more, the current data path direct code is already hard coded to only support 1 sge https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/efa_io_defs.h#L125-L129. Though it is the same code as rdma-core and the firmware code today, if rdma-core and firmware change in the future and make max_sge_rd returned by rdma-core increase in the future, we will need to accommodate in libfabric to follow up, otherwise we will get unexpected crash. In that regard maybe having a hard-coded 1 limit is better.

WDYT @amitrad-aws @mrgolin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per offline discussion, we will introduce a defined macro for libfabric internally and use the min of the macro and the rdma-core return value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do this only when using Libfabric datapath implementation, otherwise it should be transparent.

Copy link
Contributor Author

@shijin-aws shijin-aws Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will make non-datapath-direct not fail early with the same warning message, though rdma-core itself is checking it and return EINVAL, but it's not in libfabric's logging system and users won't know what happened immediately

@shijin-aws
Copy link
Contributor Author

shijin-aws commented Sep 24, 2025

The latest change broke unit test on platforms that don't support rdma, will update

@Zhenye-Na
Copy link

bot:aws:retest

1 similar comment
@shijin-aws
Copy link
Contributor Author

bot:aws:retest

Copy link
Contributor

@sunkuamzn sunkuamzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fi_read can only provide a single buffer. fi_readv and fi_readmsg can provide multiple IOVs

So I think we should just implement our readv and readmsg to iterate over all IOV and post multiple read/write requests. Similarly for the write APIs.


assert(msg->iov_count > 0 &&
msg->iov_count <= base_ep->domain->info->tx_attr->iov_limit);
if (OFI_UNLIKELY(msg->iov_count > base_ep->domain->device->ibv_attr.max_sge_rd)) {
Copy link
Contributor

@sunkuamzn sunkuamzn Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check happens even for fi_read calls which are guaranteed to only provide a single IOV. If we don't want to iterate and want the check, it should be in efa_rma_readv, efa_rma_readmsg and the equivalent functions for write and RDM protocol paths

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RDM protocol path doesn't have such issue, it currently only construct 1 WQE per IOV

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RDM path looks messy

We have asserts like this

assert(ope->iov_count > 0 && ope->iov_count <= efa_rdm_ep_domain(ep)->info->tx_attr->iov_limit);

But we also copy the iov_cnt

msg.iov_count = msg_rma->iov_count;

txe->iov_count = msg->iov_count;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you read the logic below in the while loop

write_once_len = MIN(ope->iov[iov_idx].iov_len - iov_offset,

It actually only run 1 iov per efa_rdm_pke_write, the code is really messy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The txe and msg can have multi iov, it will finally aggregate all the pkes from it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I saw that. Because we iterate and post a single IOV at a time, we should remove the asserts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the asserts you referred to are the ope/msg ones, which should still be restricted by the tx iov limit...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should they? I'm not sure why we should have asserts that are not enforced by the code that follows. That just leads to different behavior with debug and production builds

If we want to restrict the number of IOVs to tx_iov_limit, then we need a check in production builds too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they should be restricted as it is what we returned for the tx iov limit in fi_getinfo. We can add a check for sure

@shijin-aws shijin-aws force-pushed the rdma_iov_limit branch 2 times, most recently from da794db to 8c123b7 Compare September 29, 2025 21:52
efa-direct offloads the rma request to rdma wqes in efa device
directly. But today efa device doesn't support > 1 iov (queried as
ibv_device_attr.max_sge_rd). This
patch makes efa_post_read and efa_post_write fail early with
clear error message for this scenario to avoid unexpected crashing.

Signed-off-by: Shi Jin <[email protected]>
/* Add rma caps explicitly to ep->info to allow local test */
base_ep->info->caps |= FI_RMA;
/* Mock the max rdma sge to allow local test */
base_ep->domain->device->ibv_attr.max_sge_rd = EFA_DEVICE_MAX_RDMA_SGE;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you really want to test your minimum you shuold put a larger value

if (err)
return err;

err = efa_rma_check_iov_count(base_ep, msg->iov_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be called from efa_data_path_direct_wr_set_sge_list or adjacent to the calls to this function since 1) it's directly related to WQE format, 2) we should avoid enforcing Libfabric hardcoded limits when using rdma-core data path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants