Skip to content

Conversation

hershys-aws
Copy link
Contributor

@hershys-aws hershys-aws commented Sep 3, 2025

Description of changes:
Replace compile-time AWS platform detection with runtime EFA device detection using hwloc. This allows a single binary to work on both AWS and non-AWS environments, automatically enabling optimizations when EFA hardware is present. Removes autotools platform checks and always builds AWS platform code.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hershys-aws hershys-aws self-assigned this Sep 3, 2025
@hershys-aws hershys-aws requested review from bwbarrett and a team as code owners September 3, 2025 19:05
Copy link
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stopped after nccl_ofi_platform.cpp, because there are some fairly significant architectural approaches to the problem.

The commit message needs a much more detailed description of the change and why you're making it.

@hershys-aws hershys-aws changed the title enh: Optimize for AWS platform at runtime enh: Select platform optimizations at runtime Sep 12, 2025
@hershys-aws
Copy link
Contributor Author

Split changes out into 2 PR's a refactor and the actual selection logic (this PR). Waiting on other PR to be merged but just posting this for review. Please review the latest commit ONLY as that is what is relevant to this PR.

@hershys-aws hershys-aws force-pushed the always-enable-aws branch 3 times, most recently from b13016c to 540e588 Compare September 24, 2025 20:51
@hershys-aws hershys-aws force-pushed the always-enable-aws branch 2 times, most recently from 41d7b69 to 63b0161 Compare September 25, 2025 17:25
@hershys-aws hershys-aws force-pushed the always-enable-aws branch 2 times, most recently from 9c31d48 to 628554e Compare September 25, 2025 21:06
rauteric
rauteric previously approved these changes Sep 29, 2025
@hershys-aws
Copy link
Contributor Author

bot:aws:retest

Copy link
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't make it through the full PR, will try to finish today.

public:
const char* get_name() const override { return "TestPlatform"; }
int get_priority() override { return 10; }
int get_priority() override { return 200; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem like a good design pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to keep this as part of the priority value approach...not sure how else to modify/change this?

@hershys-aws hershys-aws force-pushed the always-enable-aws branch 2 times, most recently from e842618 to 4124747 Compare October 6, 2025 22:24
Replace compile-time AWS platform detection with runtime EFA device
detection using hwloc. This allows a single binary to work on both
AWS and non-AWS environments, automatically enabling optimizations
when EFA hardware is present. Removes autotools platform checks and
always builds AWS platform code.

Signed-off-by: Hershel Shah <[email protected]>
echo "* Platform-specific optimizations: ${NCCL_OFI_PLATFORM}"
AS_IF([test "${NCCL_OFI_PLATFORM}" = "none"],
[echo "* Platform Optimizations: DISABLED"],
[echo "* Platform Optimizations: ${NCCL_OFI_PLATFORM}"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite what I asked for. "DISABLED" is really wrong, since default will still be built. Just build up a list of the enabled platforms (including the default plugin, of course).

/*
* Override platform selection. Valid options: "AWS", "Default", or empty string for auto-detection.
*/
OFI_NCCL_PARAM(std::string, platform_override, "PLATFORM_OVERRIDE", "");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does "override" make sense? It doesn't match all the other option we have (like OFI_NCCL_PROGRESS_MODEL above). This should just be OFI_NCCL_PLATFORM.

*/
int nccl_ofi_topo_write_nccl_topology(nccl_ofi_topo_t *topo, FILE *file);

class TopologyManager {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly object to half-refactoring this code. There's also no need for a topology manager object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwbarrett The reasoning for this was that this file is going to be refactored as part of the c++ refactor in the near future. As a result I figured might be a good time to get started on it. I can revert it and just have a static unique pointer which would get the global topology as an alternative. Given the C++ refactor should I stick with this approach or use the static pointer alternative?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants