Skip to content

Conversation

nmazzilli3
Copy link

Problem:

  • Users was submitting fabtests without the --do-dmabuf-reg-for-hmem flag

Solution:

  • Added hmem cuda logic changes based on nvidia manual [https://docs.nvidia.com/cuda/gpudirect-rdma/ here]
  • Added check_dmabuf to init cuda internals and get dmabuf support information
  • Added conftest to validate these checks if a user specifies cuda command in fabtests
  • Updated fabtest logger statements to include timestamp

Testing:

  • Validated tests with --do-dmabuf-reg-for-hmem flag and without it on p6-gb200 cluster
  • All tests skipped with cuda command
  • make -j install && python3 install/bin/runfabtests.py --expression "cuda" -vvv -p /home/nmazzill/libfabric/fabtests/install/bin/ --junit-xml ft_git branch --show-currentgit rev-parse HEADall_junit.xml --nworkers 16 -b efa 10.0.123.149 10.0.121.190 | tee ftgit branch --show-currentgit rev-parse HEAD_all_stdout
  • All tests passed with this cuda command
  • make -j install && python3 install/bin/runfabtests.py --expression "cuda" --do-dmabuf-reg-for-hmem -vvv -p /home/nmazzill/libfabric/fabtests/install/bin/ --junit-xml ft_git branch --show-currentgit rev-parse HEADall_junit.xml --nworkers 16 -b efa 10.0.123.149 10.0.121.190 | tee ftgit branch --show-currentgit rev-parse HEAD_all_stdout

Sim Issue:

  • N/A

@darrylabbate
Copy link
Member

Piggybacking on @jiaxiyan's comments here (#11437 (comment)): we don't necessarily need to split commits based on which parts of the codebase are touched, but we do want to keep commits as atomic as possible.

@shijin-aws
Copy link
Contributor

Can you rephrase your commit title to be consistent with the PR title

@nmazzilli3
Copy link
Author

Can you rephrase your commit title to be consistent with the PR title
To ensure I'm understanding this correctly, you'd like me to add

fabtests/efa: cuda dmabuf validation logic to the top of the commit message and then keep everything else below it?

@shijin-aws
Copy link
Contributor

Can you rephrase your commit title to be consistent with the PR title
To ensure I'm understanding this correctly, you'd like me to add

fabtests/efa: cuda dmabuf validation logic to the top of the commit message and then keep everything else below it?

yes

test: Adding cuda dmabuf validation logic

Problem:
  - Users was submitting fabtests without the --do-dmabuf-reg-for-hmem flag

Solution:
  - Added hmem cuda logic changes based on nvidia manual [https://docs.nvidia.com/cuda/gpudirect-rdma/ here]
  - Added check_dmabuf to init cuda internals and get dmabuf support information
  - Added conftest to validate these checks if a user specifies cuda command in fabtests

Testing:
  - Validated tests with --do-dmabuf-reg-for-hmem flag and without it on p6-gb200 cluster
  - All tests skipped with cuda command
  - make -j install && python3 install/bin/runfabtests.py --expression "cuda" -vvv -p /home/nmazzill/libfabric/fabtests/install/bin/ --junit-xml ft_`git branch --show-current`_`git rev-parse HEAD`_all_junit.xml --nworkers 16 -b efa 10.0.123.149 10.0.121.190 | tee ft_`git branch --show-current`_`git rev-parse HEAD`_all_stdout
  - All tests passed with this cuda command
  - make -j install && python3 install/bin/runfabtests.py --expression "cuda" --do-dmabuf-reg-for-hmem -vvv -p /home/nmazzill/libfabric/fabtests/install/bin/ --junit-xml ft_`git branch --show-current`_`git rev-parse HEAD`_all_junit.xml --nworkers 16 -b efa 10.0.123.149 10.0.121.190 | tee ft_`git branch --show-current`_`git rev-parse HEAD`_all_stdout

Sim Issue:
  - N/A

Signed-off-by: Nick Mazzilli <[email protected]>
@shijin-aws shijin-aws requested a review from j-xiong October 2, 2025 23:02
@shijin-aws
Copy link
Contributor

@j-xiong can u review the fabtests common code change, thanks

has_cuda_mark = any(mark.name == 'cuda_memory' for mark in request.node.iter_markers())

if has_cuda_mark:
print("Running CUDA validation")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the print in this function? It will show up in all the test outputs, which is too verbose.

Copy link
Contributor

@j-xiong j-xiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the comments below, I would suggest breaking this into two commits -- one for the common code changes and one for the efa test changes.

Comment on lines +1 to +2
/* SPDX-License-Identifier: BSD-2-Clause OR GPL-2.0-only */
/* SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All rights reserved. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the convention of multiline comments here.

@@ -0,0 +1,35 @@
/* SPDX-License-Identifier: BSD-2-Clause OR GPL-2.0-only */
/* SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All rights reserved. */
#include <stdio.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an empty line before this.

Comment on lines +39 to +43
CUDA_MEMORY_SUPPORT__NOT_INITIALIZED = -1,
CUDA_MEMORY_SUPPORT__NOT_SUPPORTED = 0,
CUDA_MEMORY_SUPPORT__DMA_BUF_ONLY = 1,
CUDA_MEMORY_SUPPORT__GDR_ONLY= 2,
CUDA_MEMORY_SUPPORT__DMABUF_GDR_BOTH = 3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alignment is off, most like due to space being used instead of tabs.

CUDA_MEMORY_SUPPORT__DMA_BUF_ONLY = 1,
CUDA_MEMORY_SUPPORT__GDR_ONLY= 2,
CUDA_MEMORY_SUPPORT__DMABUF_GDR_BOTH = 3,
} cuda_memory_support_e;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use typedef here. Just define the enum, and use enum enum_name var_name to define variables.

I would use FT_CUDA_xxx instead of CUDA_xxx to avoid name space confusion. So for the lowercase names.

return CUDA_MEMORY_SUPPORT__NOT_SUPPORTED;
}

cuda_memory_support_e cuda_memory_support = dmabuf_viable_and_supported();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call ft_cuda_memory_support() directly?

Comment on lines +238 to +246
if (!gdr_supported && !dmabuf_supported) {
cuda_memory_support = CUDA_MEMORY_SUPPORT__NOT_SUPPORTED;
} else if (gdr_supported && dmabuf_supported) {
cuda_memory_support = CUDA_MEMORY_SUPPORT__DMABUF_GDR_BOTH;
} else if (dmabuf_supported) {
cuda_memory_support = CUDA_MEMORY_SUPPORT__DMA_BUF_ONLY;
} else {
cuda_memory_support = CUDA_MEMORY_SUPPORT__GDR_ONLY;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove all the braces here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants