Skip to content

Conversation

hershys-aws
Copy link
Contributor

Description of changes:
Cherry-picking commits from #983 to continue C++ refactor. Most of the changes are the same but slight changes have been done to simplify code/changes.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

* non-zero on error
*/
virtual int regMr(nccl_net_ofi_comm_t *comm, nccl_ofi_mr_ckey_ref ckey, int type,
void **mhandle) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have this return a nccl_net_ofi_mr_handle_t, so we have some semblance of type checking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would involve changing the function signature. I'm not sure if we also want to change that behavior in this commit? I think this should be done in a follow up PR.

nccl_net_ofi_recv_comm_t **recv_comm);
int (*close)(nccl_net_ofi_listen_comm_t *listen_comm);
};
class nccl_net_ofi_xfer_comm_t {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow why we want a xfer_comm_t base class, instead of separate send/recv base classes.

This has abstract methods that only make sense for one or the other, like send and recv.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was that each subclass could implement whichever functions it needed to and that if a send class tried to call function it wasn't supposed to we would end up with a runtime error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer having separate send/recv base classes as opposed to having a single xfer_comm_t base class.

uint16_t rail_id = 0;
nccl_net_ofi_rdma_recv_comm_rail_t *comm_rail = rdma_recv_comm_get_rail(r_comm, rail_id);
auto *comm_rail = r_comm->rdma_recv_comm_get_rail(rail_id);
RET_IF_NULLPTR(comm_rail, "recv comm rail", -EINVAL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these just be asserts? Seems a bit heavy for something we never expect to happen, and an unnecessary runtime check in release builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about just here or everywhere RET_IF_NULLPTR is called? Is there a place I can find out which places we expect a nullptr and which we don't. We have several spots in the plugin where we set pointer values to nullptr and I'm unsure where we would expect to see it and where we wouldn't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, barring bugs in the plugin, we only can have erroneous nullptrs as arguments coming from the NCCL net API (e.g., irecv). For something like comm_rail, which is initialized by us, we typically either omit null checks or make them assert. I guess I would lean toward not introducing extra checks like this, except as asserts (which are a bit more readable, and compiled out for release builds).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rauteric I have change the commit so that the points coming from the NET API would be checked and the internal ones would have an assert. Let me know if anything else needs to change.

AvivBenchorin and others added 5 commits October 14, 2025 16:21
Moves the MR functions (regMr, deregMr) from the base communicator types
to the base domain class. Also moves the stand-alone SENDRECV protocol
regMr and deregMr implementation functions to the SENDRECV protocol
domain class.

Signed-off-by: Aviv Benchorin <[email protected]>
Signed-off-by: Hershel Shah <[email protected]>
Changes the listen, receive, and send communicator structs into classes,
with the RDMA and SENDRECV transport communicators classes being derived
from a base listen comm and a base xfer comm (for recv and send comms).
Does not yet add constructor/destructor or convert function pointers to
virtual functions.

Signed-off-by: Aviv Benchorin <[email protected]>
Moves stand-alone member getter functions for RDMA and SENDRECV protocol
into inline member functions for communicator classes, e.g. getting the
communicator's endpoint member.

Signed-off-by: Aviv Benchorin <[email protected]>
To protect against static analysis null pointer dereference errors when
refactoring comm class function pointers into member functions, this
adds null pointer checking when trying to access recv and send comm
class members such as endpoint and comm rail pointers. Use assert() for
internal pointers (rails, comm_rail, ep) and RET_IF_NULLPTR() for
external input (r_comm, s_comm casts).

Signed-off-by: Aviv Benchorin <[email protected]>
Signed-off-by: Hershel Shah <[email protected]>
Refactors the base listen comm class function pointers into proper
virtual member functions that are overriden by the RDMA and SENDRECV
transport listen comm classes.

Signed-off-by: Aviv Benchorin <[email protected]>
@hershys-aws hershys-aws marked this pull request as ready for review October 14, 2025 16:22
@hershys-aws hershys-aws requested a review from a team as a code owner October 14, 2025 16:22
nccl_net_ofi_recv_comm_t **recv_comm);
int (*close)(nccl_net_ofi_listen_comm_t *listen_comm);
};
class nccl_net_ofi_xfer_comm_t {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer having separate send/recv base classes as opposed to having a single xfer_comm_t base class.

return ep->get_domain();
}

enum nccl_net_ofi_comm_type_t type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this type here ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants