-
Notifications
You must be signed in to change notification settings - Fork 444
fabtests/efa: cuda dmabuf validation logic #11437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bab0b40
to
7a3c2cf
Compare
f369265
to
7f1aaee
Compare
Problem: - Users was submitting fabtests without the --do-dmabuf-reg-for-hmem flag Solution: - Added hmem cuda logic changes based on nvidia manual [https://docs.nvidia.com/cuda/gpudirect-rdma/ here] - Added check_dmabuf to init cuda internals and get dmabuf support information - Added conftest to validate these checks if a user specifies cuda command in fabtests - Updated fabtest logger statements to include timestamp Testing: - Validated tests with --do-dmabuf-reg-for-hmem flag and without it on p6-gb200 cluster - All tests skipped with cuda command - make -j install && python3 install/bin/runfabtests.py --expression "cuda" -vvv -p /home/nmazzill/libfabric/fabtests/install/bin/ --junit-xml ft_`git branch --show-current`_`git rev-parse HEAD`_all_junit.xml --nworkers 16 -b efa 10.0.123.149 10.0.121.190 | tee ft_`git branch --show-current`_`git rev-parse HEAD`_all_stdout - All tests passed with this cuda command - make -j install && python3 install/bin/runfabtests.py --expression "cuda" --do-dmabuf-reg-for-hmem -vvv -p /home/nmazzill/libfabric/fabtests/install/bin/ --junit-xml ft_`git branch --show-current`_`git rev-parse HEAD`_all_junit.xml --nworkers 16 -b efa 10.0.123.149 10.0.121.190 | tee ft_`git branch --show-current`_`git rev-parse HEAD`_all_stdout Sim Issue: - https://t.corp.amazon.com/V1823415044 Signed-off-by: Nick Mazzilli <[email protected]>
7f1aaee
to
0c56fd5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced with you offline already, but leaving a note here for others to make sure the internal Amazon link is removed from the commit message before merging
We always separate the common code and efa specific code into different commits. This commit should be split into at least three commits based on your bullet points. Also remove the testing section in the commit message. |
I appreciate the feedback. However, I see this change as a single cohesive fix addressing one specific issue: helping users who run fabtests on CUDA-based instances (particularly gb200) without the required --do-dmabuf-reg-for-hmem flag. The changes work together to detect this situation and improve the logging for better troubleshooting. Would you help me understand which aspects you feel would benefit from being separated into different commits and if this rule applies to a single issue. |
Resubmitted without sim ticket in conventional commit helper. |
Reopening in without sim ticket information |
test: Adding cuda dmabuf validation logic
Problem:
Solution:
Testing:
git branch --show-current
git rev-parse HEAD
all_junit.xml --nworkers 16 -b efa 10.0.123.149 10.0.121.190 | tee ftgit branch --show-current
git rev-parse HEAD
_all_stdoutgit branch --show-current
git rev-parse HEAD
all_junit.xml --nworkers 16 -b efa 10.0.123.149 10.0.121.190 | tee ftgit branch --show-current
git rev-parse HEAD
_all_stdout