Skip to content

Conversation

nmazzilli3
Copy link

@nmazzilli3 nmazzilli3 commented Sep 23, 2025

test: Adding cuda dmabuf validation logic

Problem:

  • Users was submitting fabtests without the --do-dmabuf-reg-for-hmem flag

Solution:

  • Added hmem cuda logic changes based on nvidia manual [https://docs.nvidia.com/cuda/gpudirect-rdma/ here]
  • Added check_dmabuf to init cuda internals and get dmabuf support information
  • Added conftest to validate these checks if a user specifies cuda command in fabtests
  • Updated fabtest logger statements to include timestamp

Testing:

  • Validated tests with --do-dmabuf-reg-for-hmem flag and without it on p6-gb200 cluster
  • All tests skipped with cuda command
  • make -j install && python3 install/bin/runfabtests.py --expression "cuda" -vvv -p /home/nmazzill/libfabric/fabtests/install/bin/ --junit-xml ft_git branch --show-currentgit rev-parse HEADall_junit.xml --nworkers 16 -b efa 10.0.123.149 10.0.121.190 | tee ftgit branch --show-currentgit rev-parse HEAD_all_stdout
  • All tests passed with this cuda command
  • make -j install && python3 install/bin/runfabtests.py --expression "cuda" --do-dmabuf-reg-for-hmem -vvv -p /home/nmazzill/libfabric/fabtests/install/bin/ --junit-xml ft_git branch --show-currentgit rev-parse HEADall_junit.xml --nworkers 16 -b efa 10.0.123.149 10.0.121.190 | tee ftgit branch --show-currentgit rev-parse HEAD_all_stdout

@nmazzilli3 nmazzilli3 changed the title Adding cuda dmabuf validation logic fabtests/cuda dmabuf validation logic Sep 24, 2025
@nmazzilli3 nmazzilli3 changed the title fabtests/cuda dmabuf validation logic fabtests/efa: cuda dmabuf validation logic Sep 24, 2025
@nmazzilli3 nmazzilli3 force-pushed the SubspaceTT-1927 branch 2 times, most recently from f369265 to 7f1aaee Compare September 24, 2025 15:42
Problem:
  - Users was submitting fabtests without the --do-dmabuf-reg-for-hmem flag

Solution:
  - Added hmem cuda logic changes based on nvidia manual [https://docs.nvidia.com/cuda/gpudirect-rdma/ here]
  - Added check_dmabuf to init cuda internals and get dmabuf support information
  - Added conftest to validate these checks if a user specifies cuda command in fabtests
  - Updated fabtest logger statements to include timestamp

Testing:
  - Validated tests with --do-dmabuf-reg-for-hmem flag and without it on p6-gb200 cluster
  - All tests skipped with cuda command
  - make -j install && python3 install/bin/runfabtests.py --expression "cuda" -vvv -p /home/nmazzill/libfabric/fabtests/install/bin/ --junit-xml ft_`git branch --show-current`_`git rev-parse HEAD`_all_junit.xml --nworkers 16 -b efa 10.0.123.149 10.0.121.190 | tee ft_`git branch --show-current`_`git rev-parse HEAD`_all_stdout
  - All tests passed with this cuda command
  - make -j install && python3 install/bin/runfabtests.py --expression "cuda" --do-dmabuf-reg-for-hmem -vvv -p /home/nmazzill/libfabric/fabtests/install/bin/ --junit-xml ft_`git branch --show-current`_`git rev-parse HEAD`_all_junit.xml --nworkers 16 -b efa 10.0.123.149 10.0.121.190 | tee ft_`git branch --show-current`_`git rev-parse HEAD`_all_stdout

Sim Issue:
  - https://t.corp.amazon.com/V1823415044

Signed-off-by: Nick Mazzilli <[email protected]>
Copy link
Member

@darrylabbate darrylabbate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced with you offline already, but leaving a note here for others to make sure the internal Amazon link is removed from the commit message before merging

@jiaxiyan
Copy link
Contributor

We always separate the common code and efa specific code into different commits. This commit should be split into at least three commits based on your bullet points. Also remove the testing section in the commit message.

@nmazzilli3 nmazzilli3 closed this Sep 25, 2025
@nmazzilli3
Copy link
Author

We always separate the common code and efa specific code into different commits. This commit should be split into at least three commits based on your bullet points. Also remove the testing section in the commit message.

I appreciate the feedback. However, I see this change as a single cohesive fix addressing one specific issue: helping users who run fabtests on CUDA-based instances (particularly gb200) without the required --do-dmabuf-reg-for-hmem flag. The changes work together to detect this situation and improve the logging for better troubleshooting. Would you help me understand which aspects you feel would benefit from being separated into different commits and if this rule applies to a single issue.

@nmazzilli3
Copy link
Author

Synced with you offline already, but leaving a note here for others to make sure the internal Amazon link is removed from the commit message before merging

Resubmitted without sim ticket in conventional commit helper.

@nmazzilli3
Copy link
Author

Reopening in without sim ticket information

#11443

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants