fabtests/efa: cuda dmabuf validation logic #11443

nmazzilli3 · 2025-09-25T17:45:10Z

Problem:

Users was submitting fabtests without the --do-dmabuf-reg-for-hmem flag

Solution:

Added hmem cuda logic changes based on nvidia manual [https://docs.nvidia.com/cuda/gpudirect-rdma/ here]
Added check_dmabuf to init cuda internals and get dmabuf support information
Added conftest to validate these checks if a user specifies cuda command in fabtests
Updated fabtest logger statements to include timestamp

Testing:

Validated tests with --do-dmabuf-reg-for-hmem flag and without it on p6-gb200 cluster
All tests skipped with cuda command
make -j install && python3 install/bin/runfabtests.py --expression "cuda" -vvv -p /home/nmazzill/libfabric/fabtests/install/bin/ --junit-xml ft_git branch --show-currentgit rev-parse HEADall_junit.xml --nworkers 16 -b efa 10.0.123.149 10.0.121.190 | tee ftgit branch --show-currentgit rev-parse HEAD_all_stdout
All tests passed with this cuda command
make -j install && python3 install/bin/runfabtests.py --expression "cuda" --do-dmabuf-reg-for-hmem -vvv -p /home/nmazzill/libfabric/fabtests/install/bin/ --junit-xml ft_git branch --show-currentgit rev-parse HEADall_junit.xml --nworkers 16 -b efa 10.0.123.149 10.0.121.190 | tee ftgit branch --show-currentgit rev-parse HEAD_all_stdout

Sim Issue:

N/A

fabtests/common/hmem_cuda.c

fabtests/include/shared.h

darrylabbate · 2025-09-25T18:20:27Z

Piggybacking on @jiaxiyan's comments here (#11437 (comment)): we don't necessarily need to split commits based on which parts of the codebase are touched, but we do want to keep commits as atomic as possible.

fabtests/common/hmem_cuda.c

fabtests/common/check_cuda_dmabuf.c

fabtests/include/shared.h

fabtests/pytest/efa/conftest.py

fabtests/include/hmem.h

fabtests/pytest/efa/efa_common.py

fabtests/pytest/efa/conftest.py

fabtests/common/hmem_cuda.c

fabtests/common/check_cuda_dmabuf.c

shijin-aws · 2025-10-01T17:51:13Z

Can you rephrase your commit title to be consistent with the PR title

nmazzilli3 · 2025-10-01T20:32:03Z

Can you rephrase your commit title to be consistent with the PR title
To ensure I'm understanding this correctly, you'd like me to add

fabtests/efa: cuda dmabuf validation logic to the top of the commit message and then keep everything else below it?

shijin-aws · 2025-10-01T21:09:07Z

Can you rephrase your commit title to be consistent with the PR title
To ensure I'm understanding this correctly, you'd like me to add

fabtests/efa: cuda dmabuf validation logic to the top of the commit message and then keep everything else below it?

yes

test: Adding cuda dmabuf validation logic Problem: - Users was submitting fabtests without the --do-dmabuf-reg-for-hmem flag Solution: - Added hmem cuda logic changes based on nvidia manual [https://docs.nvidia.com/cuda/gpudirect-rdma/ here] - Added check_dmabuf to init cuda internals and get dmabuf support information - Added conftest to validate these checks if a user specifies cuda command in fabtests Testing: - Validated tests with --do-dmabuf-reg-for-hmem flag and without it on p6-gb200 cluster - All tests skipped with cuda command - make -j install && python3 install/bin/runfabtests.py --expression "cuda" -vvv -p /home/nmazzill/libfabric/fabtests/install/bin/ --junit-xml ft_`git branch --show-current`_`git rev-parse HEAD`_all_junit.xml --nworkers 16 -b efa 10.0.123.149 10.0.121.190 | tee ft_`git branch --show-current`_`git rev-parse HEAD`_all_stdout - All tests passed with this cuda command - make -j install && python3 install/bin/runfabtests.py --expression "cuda" --do-dmabuf-reg-for-hmem -vvv -p /home/nmazzill/libfabric/fabtests/install/bin/ --junit-xml ft_`git branch --show-current`_`git rev-parse HEAD`_all_junit.xml --nworkers 16 -b efa 10.0.123.149 10.0.121.190 | tee ft_`git branch --show-current`_`git rev-parse HEAD`_all_stdout Sim Issue: - N/A Signed-off-by: Nick Mazzilli <[email protected]>

shijin-aws · 2025-10-02T23:02:29Z

@j-xiong can u review the fabtests common code change, thanks

jiaxiyan · 2025-10-02T23:18:58Z

fabtests/pytest/efa/conftest.py

+    has_cuda_mark = any(mark.name == 'cuda_memory' for mark in request.node.iter_markers())
+
+    if has_cuda_mark:
+        print("Running CUDA validation")


Can you remove the print in this function? It will show up in all the test outputs, which is too verbose.

j-xiong

In addition to the comments below, I would suggest breaking this into two commits -- one for the common code changes and one for the efa test changes.

j-xiong · 2025-10-02T23:27:41Z

fabtests/common/check_cuda_dmabuf.c

+/* SPDX-License-Identifier: BSD-2-Clause OR GPL-2.0-only */
+/* SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All rights reserved. */


Please follow the convention of multiline comments here.

j-xiong · 2025-10-02T23:28:19Z

fabtests/common/check_cuda_dmabuf.c

@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: BSD-2-Clause OR GPL-2.0-only */
+/* SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All rights reserved. */
+#include <stdio.h>


Add an empty line before this.

j-xiong · 2025-10-02T23:32:49Z

fabtests/include/hmem.h

+	CUDA_MEMORY_SUPPORT__NOT_INITIALIZED = -1,
+    CUDA_MEMORY_SUPPORT__NOT_SUPPORTED = 0,
+    CUDA_MEMORY_SUPPORT__DMA_BUF_ONLY = 1,
+    CUDA_MEMORY_SUPPORT__GDR_ONLY= 2, 
+    CUDA_MEMORY_SUPPORT__DMABUF_GDR_BOTH  = 3,


Alignment is off, most like due to space being used instead of tabs.

j-xiong · 2025-10-02T23:36:14Z

fabtests/include/hmem.h

+    CUDA_MEMORY_SUPPORT__DMA_BUF_ONLY = 1,
+    CUDA_MEMORY_SUPPORT__GDR_ONLY= 2, 
+    CUDA_MEMORY_SUPPORT__DMABUF_GDR_BOTH  = 3,
+} cuda_memory_support_e;


Don't use typedef here. Just define the enum, and use enum enum_name var_name to define variables.

I would use FT_CUDA_xxx instead of CUDA_xxx to avoid name space confusion. So for the lowercase names.

j-xiong · 2025-10-02T23:38:33Z

fabtests/common/check_cuda_dmabuf.c

+        return CUDA_MEMORY_SUPPORT__NOT_SUPPORTED;
+    }
+
+    cuda_memory_support_e cuda_memory_support = dmabuf_viable_and_supported();


Why not call ft_cuda_memory_support() directly?

j-xiong · 2025-10-02T23:42:03Z

fabtests/common/hmem_cuda.c

+    if (!gdr_supported && !dmabuf_supported) {
+        cuda_memory_support = CUDA_MEMORY_SUPPORT__NOT_SUPPORTED;
+    } else if (gdr_supported && dmabuf_supported) {
+        cuda_memory_support = CUDA_MEMORY_SUPPORT__DMABUF_GDR_BOTH;
+    } else if (dmabuf_supported) {
+        cuda_memory_support = CUDA_MEMORY_SUPPORT__DMA_BUF_ONLY;
+    } else {
+        cuda_memory_support = CUDA_MEMORY_SUPPORT__GDR_ONLY;
+    }


Remove all the braces here.

darrylabbate requested a review from a team September 25, 2025 17:47

nmazzilli3 mentioned this pull request Sep 25, 2025

fabtests/efa: cuda dmabuf validation logic #11437

Closed

darrylabbate reviewed Sep 25, 2025

View reviewed changes

fabtests/common/hmem_cuda.c Show resolved Hide resolved

fabtests/include/shared.h Outdated Show resolved Hide resolved

darrylabbate reviewed Sep 25, 2025

View reviewed changes

fabtests/common/hmem_cuda.c Outdated Show resolved Hide resolved

nmazzilli3 force-pushed the SubspaceTT-1927 branch from c128ea1 to 14a4dea Compare September 25, 2025 20:39

sunkuamzn requested changes Sep 26, 2025

View reviewed changes

nmazzilli3 force-pushed the SubspaceTT-1927 branch from 14a4dea to 394485f Compare September 29, 2025 19:46

nmazzilli3 requested review from sunkuamzn and darrylabbate September 29, 2025 19:47

nmazzilli3 force-pushed the SubspaceTT-1927 branch from 394485f to 0ac312e Compare September 29, 2025 19:51

shijin-aws reviewed Oct 1, 2025

View reviewed changes

fabtests/common/check_cuda_dmabuf.c Outdated Show resolved Hide resolved

nmazzilli3 force-pushed the SubspaceTT-1927 branch from 0ac312e to 2672cef Compare October 2, 2025 15:27

shijin-aws approved these changes Oct 2, 2025

View reviewed changes

shijin-aws requested a review from j-xiong October 2, 2025 23:02

jiaxiyan reviewed Oct 2, 2025

View reviewed changes

j-xiong reviewed Oct 2, 2025

View reviewed changes

		/* SPDX-License-Identifier: BSD-2-Clause OR GPL-2.0-only */
		/* SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All rights reserved. */

fabtests/efa: cuda dmabuf validation logic #11443

Are you sure you want to change the base?

fabtests/efa: cuda dmabuf validation logic #11443

Conversation

nmazzilli3 commented Sep 25, 2025

Uh oh!

Uh oh!

Uh oh!

darrylabbate commented Sep 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shijin-aws commented Oct 1, 2025

Uh oh!

nmazzilli3 commented Oct 1, 2025

Uh oh!

shijin-aws commented Oct 1, 2025

Uh oh!

shijin-aws commented Oct 2, 2025

Uh oh!

jiaxiyan Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

j-xiong left a comment

Choose a reason for hiding this comment

Uh oh!

j-xiong Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

j-xiong Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

j-xiong Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

j-xiong Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

j-xiong Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

j-xiong Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!