-
Notifications
You must be signed in to change notification settings - Fork 74
Refactor communicator types to use C++ inheritance #983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
b709d7f
to
3046067
Compare
bot:aws:retest |
3046067
to
c89dc96
Compare
inline nccl_net_ofi_rdma_ep_t *rdma_send_comm_get_ep() | ||
{ | ||
return reinterpret_cast<nccl_net_ofi_rdma_ep_t *>(ep); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I have been thinking about sane ways to abstract the get_ep
RDMA and SENDRECV communicator functions and the get_rail
RDMA send and recv comm functions so that they can be defined once per protocol. I experimented with have another protocol-specific comm base class (e.g. nccl_net_ofi_rdma_comm_t
) that just defined a generic <protocol>_comm_get_ep
function that could be used across the listen/send/recv comm derived classes, but I wasn't able to get it working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we were probably too cute here and shouldn't have an type in the base class. If we need to get access to the base ep somewhere, we should just have a pure virtual get_base_ep() call in the base communicator class that everyone must implement (and pass their ep).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For moving the endpoint out of the base communicator class, were you envisioning having an endpoint ptr and get_base_ep
override in each of the specialized protocol-specific comms (e.g. separate nccl_net_ofi_rdma_ep_t * ep
in the RDMA listen, send, and recv comms)? Or adding a new "generic" protocol-specific comm with the endpoint ptr (e.g. a nccl_net_ofi_rdma_comm_t
class) that becomes a second parent for the specialized protocol-specific comm (e.g.:
class nccl_net_ofi_rdma_send_comm_t : public nccl_net_ofi_send_comm_t,
public nccl_net_ofi_rdma_comm_t {
).
src/nccl_ofi_rdma.cpp
Outdated
pthread_wrapper domain_lock(&domain->domain_lock); | ||
|
||
CHECK_DOMAIN_ACTIVE(domain, "reg_mr_send_comm"); | ||
CHECK_DOMAIN_ACTIVE(domain, "send regMr"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: to differentiate between regMr function calls by send comms and recv comms, I passed in "<send/recv> regMr" as an arg to CHECK_DOMAIN_ACTIVE (even though it doesn't exactly match the function name like before) since the arg is only used for a print.
include/nccl_ofi.h
Outdated
uint64_t dest, uint64_t mr_key, nccl_net_ofi_req_t **req); | ||
int (*write_inline)(nccl_net_ofi_send_comm_t *, void* src, size_t size, | ||
uint64_t dest, uint64_t mr_key, nccl_net_ofi_req_t **request); | ||
virtual int write(void* src, size_t size, void* src_mhandle, uint64_t dest, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is write() different than send()? I'd just make it pure virtual like the others.
inline nccl_net_ofi_rdma_ep_t *rdma_send_comm_get_ep() | ||
{ | ||
return reinterpret_cast<nccl_net_ofi_rdma_ep_t *>(ep); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we were probably too cute here and shouldn't have an type in the base class. If we need to get access to the base ep somewhere, we should just have a pure virtual get_base_ep() call in the base communicator class that everyone must implement (and pass their ep).
c89dc96
to
110a7ec
Compare
src/nccl_ofi_rdma.cpp
Outdated
bool recv_completion_optional) | ||
{ | ||
nccl_net_ofi_rdma_ep_t *ep = (nccl_net_ofi_rdma_ep_t *)r_comm->base.base.ep; | ||
nccl_net_ofi_rdma_ep_t *ep = (nccl_net_ofi_rdma_ep_t *)r_comm->ep; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have lots of mixed C and C++ typecastings in this PR. For example, at this line, you keep the C typecasting. However, at line 2964, you upgrade C typecasting to C++ typecasting: from (nccl_net_ofi_rdma_mr_handle_t *)mhandle
to static_cast<nccl_net_ofi_rdma_mr_handle_t *>(mhandle)
.
Do we have a plan to upgrade all C typecastings to C++ typecastings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thus far, my approach has been to avoid updating the typecast and NULL
--> nullptr
in functions I am not generally refactoring to keep my commits more self-contained. E.g. the code here is in the request type function insert_send_ctrl_req
, since I'm only updating the usage of r_comm
to reflect removing the base struct member variables I would wait on the typecast update until I properly refactor the request type and turn this function into a member function.
rdma_req_flush_data_t *flush_data = NULL; | ||
*ret_req = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we can replace NULL
with nullptr
at these two lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #983 (comment), since this code was in the request type function rdma_comm_alloc_flush_req
that I didn't generally refactor, I would wait on updating NULL
--> nullptr
until I refactor the request type and turn this function into a member function.
src/nccl_ofi_rdma.cpp
Outdated
|
||
free(l_comm); | ||
ret = ep->release_ep(false, false); | ||
free(this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be delete this
?
Also, is it a good practice to free an object by itself in its own function? After this function is called, the object will become a dangling pointer and we need to make sure it's not used any more.
61aff9f
to
6442737
Compare
include/nccl_ofi.h
Outdated
class nccl_net_ofi_comm_t { | ||
public: | ||
virtual ~nccl_net_ofi_comm_t() = default; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect anyone to ever instantiate an object of this class? If not, we should make this abstract by setting the destructor to be pure virtual.
include/nccl_ofi.h
Outdated
*/ | ||
struct nccl_net_ofi_listen_comm { | ||
nccl_net_ofi_comm_t base; | ||
class nccl_net_ofi_listen_comm_t : public nccl_net_ofi_comm_t { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be an understanding issue on my end or spin out into a bigger discussion, but I don't quite see benefits to having several abstract classes "chained" up like this. Why not just have 1 abstract class nccl_net_ofi_comm_t
which can be either. In the actual RDMA class where the implementation occurs, we could set the irrelevant functions to a no-op. This way we have a communicator "interface" and the actual implementation is left to the child classes (rdma etc). Seems cleaner, easier to understand and less code potentially. Not sure if there are other thoughts on this?
include/nccl_ofi_rdma.h
Outdated
* this struct. This allows casting between pointers of this | ||
* struct and its base struct. */ | ||
nccl_net_ofi_recv_comm_t base; | ||
class nccl_net_ofi_rdma_recv_comm_t : public nccl_net_ofi_recv_comm_t { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another point for discussion: do we have to stick with these lengthy names? Seems a great place for namespaces if we just call this recv_comm_t
under the nccl_net_ofi::rdma
namespace.
60adc31
to
d0d39cf
Compare
struct nccl_net_ofi_comm { | ||
class nccl_net_ofi_listen_comm_t { | ||
public: | ||
virtual ~nccl_net_ofi_listen_comm_t() = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is an abstract class this should be a pure virtual destructor.
enum nccl_net_ofi_comm_type_t type; | ||
nccl_net_ofi_ep_t *ep; | ||
int dev_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, is there a reasoning these should remain public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be made protected, I intended to wait on enforcing the access control until the follow-up comm refactor PR where all stand-alone functions associated with the comm classes are made members and can easily access these member variables without going through getters.
/* write and write_inline operation not support for SENDRECV protocol */ | ||
int write(void *src, size_t size, void *src_mhandle, uint64_t dest, | ||
uint64_t mr_key, nccl_net_ofi_req_t **req) override | ||
{ | ||
NCCL_OFI_WARN("Protocol does not support iwrite API function"); | ||
return -ENOTSUP; | ||
} | ||
|
||
int write_inline(void *src, size_t size, uint64_t dest, uint64_t mr_key, | ||
nccl_net_ofi_req_t **request) override | ||
{ | ||
NCCL_OFI_WARN("Protocol does not support iwrite_inline API function"); | ||
return -ENOTSUP; | ||
} | ||
|
||
/* recv/flush/read operations not supported in send communicator */ | ||
int recv(int n, void **data, size_t *sizes, int *tags, | ||
nccl_net_ofi_mr_handle_t **mhandles, nccl_net_ofi_req_t **req) override | ||
{ | ||
NCCL_OFI_WARN("Send comm does not support irecv API function"); | ||
return -EINVAL; | ||
} | ||
|
||
int flush(int n, void **data, int *sizes, | ||
nccl_net_ofi_mr_handle_t **mhandles, nccl_net_ofi_req_t **req) override | ||
{ | ||
NCCL_OFI_WARN("Send comm does not support iflush API function"); | ||
return -EINVAL; | ||
} | ||
int read(void *dest, size_t size, void *dest_mhandle, | ||
uint64_t src, uint64_t mr_key, nccl_net_ofi_req_t **req) override | ||
{ | ||
NCCL_OFI_WARN("Send comm does not support iread API function"); | ||
return -EINVAL; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe have a protected function in the based class called not supported which warns + returns -EINVAL
or -ENOTSUP
? I figure you will probably be writing this across all the child classes a good amount.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added protected function log_unsupported_operation
to the nccl_net_ofi_xfer_comm_t
base class to implement this.
{ | ||
nccl_net_ofi_send_comm_t *send_comm = | ||
(nccl_net_ofi_send_comm_t *)sendComm; | ||
nccl_net_ofi_xfer_comm_t *send_comm = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to stick with c-style castings here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discuss this in #983 (comment), but thus far I've avoided updating the C --> C++ style casts in functions I am not generally updating to limit how much I'm touching code outside the scope of the datatype I'm refactoring. I would wait on refactoring these to C++-style casts until broadly refactoring the API functions.
delete this; | ||
ret = ep_ptr->release_ep(false, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC calling delete on an object calls the destructors of its member variables provided they are not static....I guess I don't understand how this works and that we are sure the behavior is what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the communicator stores a raw pointer of the endpoint, the endpoint's destructor won't get automatically called by the comm's destructor. When changing the endpoint raw pointer to directly pointing the endpoint, however, you are right that this would not longer be valid. I will make a note to track cleaning up these deletes after migrating the transport datatype raw pointers to direct objects.
src/nccl_ofi_rdma.cpp
Outdated
if (r_comm == nullptr) { | ||
NCCL_OFI_WARN("Recv comm is null in post_rma_read"); | ||
return -EINVAL; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This or next might be a good PR to also add this as a macro and drop in places to reduce LOC of all the files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added macros RET_IF_NULLPTR
and GOTO_IF_NULLPTR
to try and reduce the line count footprint of the added nullptr checks.
5ff90aa
to
72f2661
Compare
Moves the MR functions (regMr, deregMr) from the base communicator types to the base domain class. Also moves the stand-alone SENDRECV protocol regMr and deregMr implementation functions to the SENDRECV protocol domain class. Signed-off-by: Aviv Benchorin <[email protected]>
Changes the listen, receive, and send communicator structs into classes, with the RDMA and SENDRECV transport communicators classes being derived from a base listen comm and a base xfer comm (for recv and send comms). Does not yet add constructor/destructor or convert function pointers to virtual functions. Signed-off-by: Aviv Benchorin <[email protected]>
Moves stand-alone member getter functions for RDMA and SENDRECV protocol into inline member functions for communicator classes, e.g. getting the communicator's endpoint member. Signed-off-by: Aviv Benchorin <[email protected]>
To protect against static analysis null pointer dereference errors when refactoring comm class function pointers into member functions, this adds null pointer checking when trying to access recv and send comm class members such as endpoint and comm rail pointers. Signed-off-by: Aviv Benchorin <[email protected]>
Refactors the base listen comm class function pointers into proper virtual member functions that are overriden by the RDMA and SENDRECV transport listen comm classes. Signed-off-by: Aviv Benchorin <[email protected]>
Refactors the base xfer comm class function pointers into proper virtual member functions that are overriden by the RDMA and SENDRECV transport protocol send and recv comm classes. Signed-off-by: Aviv Benchorin <[email protected]>
72f2661
to
a3c883a
Compare
Currently, the communicators data types have an inheritance hierarchy where the base communicator type is inherited by base listen, receive, and send communciator types, and the specialized base comm types are inherited by the transport-specific comm types. This PR replaces the C-style inheritance of having the parent comm type as the first member of the derived comm types with C++ inheritance, and refactors all function pointers for the base listen, receive, and send communicator classes into virtual member functions that the protocol-specific communicator classes implement.
This PR only handle the direct translation of the C-style comm structs into C++ classes with inheritance, and turning function pointers into member functions. A future PR will need to:
protected
keyword for members that shouldn't be publicly accessible)(Continuing C++ refactor started with #851, #938, and #950, #955, and #971).
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.