-
Notifications
You must be signed in to change notification settings - Fork 74
enh: Select platform optimizations at runtime #995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
ff81802
to
83b2db1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stopped after nccl_ofi_platform.cpp
, because there are some fairly significant architectural approaches to the problem.
The commit message needs a much more detailed description of the change and why you're making it.
83b2db1
to
00cb220
Compare
Split changes out into 2 PR's a refactor and the actual selection logic (this PR). Waiting on other PR to be merged but just posting this for review. Please review the latest commit ONLY as that is what is relevant to this PR. |
00cb220
to
22bfc33
Compare
22bfc33
to
c0e66a1
Compare
b13016c
to
540e588
Compare
41d7b69
to
63b0161
Compare
9c31d48
to
628554e
Compare
628554e
to
5e06486
Compare
5e06486
to
59f43f0
Compare
bot:aws:retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't make it through the full PR, will try to finish today.
public: | ||
const char* get_name() const override { return "TestPlatform"; } | ||
int get_priority() override { return 10; } | ||
int get_priority() override { return 200; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seem like a good design pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to keep this as part of the priority value approach...not sure how else to modify/change this?
e842618
to
4124747
Compare
Replace compile-time AWS platform detection with runtime EFA device detection using hwloc. This allows a single binary to work on both AWS and non-AWS environments, automatically enabling optimizations when EFA hardware is present. Removes autotools platform checks and always builds AWS platform code. Signed-off-by: Hershel Shah <[email protected]>
4124747
to
62d3961
Compare
echo "* Platform-specific optimizations: ${NCCL_OFI_PLATFORM}" | ||
AS_IF([test "${NCCL_OFI_PLATFORM}" = "none"], | ||
[echo "* Platform Optimizations: DISABLED"], | ||
[echo "* Platform Optimizations: ${NCCL_OFI_PLATFORM}"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite what I asked for. "DISABLED" is really wrong, since default will still be built. Just build up a list of the enabled platforms (including the default plugin, of course).
/* | ||
* Override platform selection. Valid options: "AWS", "Default", or empty string for auto-detection. | ||
*/ | ||
OFI_NCCL_PARAM(std::string, platform_override, "PLATFORM_OVERRIDE", ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does "override" make sense? It doesn't match all the other option we have (like OFI_NCCL_PROGRESS_MODEL above). This should just be OFI_NCCL_PLATFORM.
*/ | ||
int nccl_ofi_topo_write_nccl_topology(nccl_ofi_topo_t *topo, FILE *file); | ||
|
||
class TopologyManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly object to half-refactoring this code. There's also no need for a topology manager object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bwbarrett The reasoning for this was that this file is going to be refactored as part of the c++ refactor in the near future. As a result I figured might be a good time to get started on it. I can revert it and just have a static unique pointer which would get the global topology as an alternative. Given the C++ refactor should I stick with this approach or use the static pointer alternative?
Description of changes:
Replace compile-time AWS platform detection with runtime EFA device detection using hwloc. This allows a single binary to work on both AWS and non-AWS environments, automatically enabling optimizations when EFA hardware is present. Removes autotools platform checks and always builds AWS platform code.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.