-
Notifications
You must be signed in to change notification settings - Fork 445
prov/efa: Fail early in rma when > 1 iov is passed #11429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
prov/efa/src/efa_rma.c
Outdated
|
||
assert(msg->iov_count > 0 && | ||
msg->iov_count <= base_ep->domain->info->tx_attr->iov_limit); | ||
if (OFI_UNLIKELY(msg->iov_count > 1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is "max_sge_rd" inside efa_device.ibv_attr that should hold this value.
We should use it to be future compatible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I will use it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think about it more, the current data path direct code is already hard coded to only support 1 sge https://github.com/ofiwg/libfabric/blob/main/prov/efa/src/efa_io_defs.h#L125-L129. Though it is the same code as rdma-core and the firmware code today, if rdma-core and firmware change in the future and make max_sge_rd returned by rdma-core increase in the future, we will need to accommodate in libfabric to follow up, otherwise we will get unexpected crash. In that regard maybe having a hard-coded 1 limit is better.
WDYT @amitrad-aws @mrgolin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per offline discussion, we will introduce a defined macro for libfabric internally and use the min of the macro and the rdma-core return value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do this only when using Libfabric datapath implementation, otherwise it should be transparent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will make non-datapath-direct not fail early with the same warning message, though rdma-core itself is checking it and return EINVAL, but it's not in libfabric's logging system and users won't know what happened immediately
8e25874
to
59e87bf
Compare
The latest change broke unit test on platforms that don't support rdma, will update |
59e87bf
to
2e1bab2
Compare
bot:aws:retest |
1 similar comment
bot:aws:retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fi_read
can only provide a single buffer. fi_readv
and fi_readmsg
can provide multiple IOVs
So I think we should just implement our readv
and readmsg
to iterate over all IOV and post multiple read/write requests. Similarly for the write APIs.
prov/efa/src/efa_rma.c
Outdated
|
||
assert(msg->iov_count > 0 && | ||
msg->iov_count <= base_ep->domain->info->tx_attr->iov_limit); | ||
if (OFI_UNLIKELY(msg->iov_count > base_ep->domain->device->ibv_attr.max_sge_rd)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check happens even for fi_read
calls which are guaranteed to only provide a single IOV. If we don't want to iterate and want the check, it should be in efa_rma_readv
, efa_rma_readmsg
and the equivalent functions for write and RDM protocol paths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RDM protocol path doesn't have such issue, it currently only construct 1 WQE per IOV
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RDM path looks messy
We have asserts like this
libfabric/prov/efa/src/rdm/efa_rdm_ope.c
Line 1478 in 223d0b6
assert(ope->iov_count > 0 && ope->iov_count <= efa_rdm_ep_domain(ep)->info->tx_attr->iov_limit); |
But we also copy the iov_cnt
libfabric/prov/efa/src/rdm/efa_rdm_rma.c
Line 66 in 223d0b6
msg.iov_count = msg_rma->iov_count; |
libfabric/prov/efa/src/rdm/efa_rdm_ope.c
Line 42 in 223d0b6
txe->iov_count = msg->iov_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if you read the logic below in the while loop
libfabric/prov/efa/src/rdm/efa_rdm_ope.c
Line 1568 in 223d0b6
write_once_len = MIN(ope->iov[iov_idx].iov_len - iov_offset, |
It actually only run 1 iov per efa_rdm_pke_write, the code is really messy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The txe and msg can have multi iov, it will finally aggregate all the pkes from it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I saw that. Because we iterate and post a single IOV at a time, we should remove the asserts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the asserts you referred to are the ope/msg ones, which should still be restricted by the tx iov limit...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should they? I'm not sure why we should have asserts that are not enforced by the code that follows. That just leads to different behavior with debug and production builds
If we want to restrict the number of IOVs to tx_iov_limit
, then we need a check in production builds too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they should be restricted as it is what we returned for the tx iov limit in fi_getinfo. We can add a check for sure
2e1bab2
to
42f45ac
Compare
da794db
to
8c123b7
Compare
efa-direct offloads the rma request to rdma wqes in efa device directly. But today efa device doesn't support > 1 iov (queried as ibv_device_attr.max_sge_rd). This patch makes efa_post_read and efa_post_write fail early with clear error message for this scenario to avoid unexpected crashing. Signed-off-by: Shi Jin <[email protected]>
8c123b7
to
ddf8745
Compare
/* Add rma caps explicitly to ep->info to allow local test */ | ||
base_ep->info->caps |= FI_RMA; | ||
/* Mock the max rdma sge to allow local test */ | ||
base_ep->domain->device->ibv_attr.max_sge_rd = EFA_DEVICE_MAX_RDMA_SGE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you really want to test your minimum you shuold put a larger value
if (err) | ||
return err; | ||
|
||
err = efa_rma_check_iov_count(base_ep, msg->iov_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be called from efa_data_path_direct_wr_set_sge_list
or adjacent to the calls to this function since 1) it's directly related to WQE format, 2) we should avoid enforcing Libfabric hardcoded limits when using rdma-core data path.
efa-direct offloads the rma request to rdma WQEs in efa device directly. But today efa device doesn't support > 1 sge per WQE. This patch makes efa_post_read and efa_post_write fail early with clear error message for this scenario to avoid unexpected crashing.