-
Notifications
You must be signed in to change notification settings - Fork 448
prov/shm: Fix fi_av_insert() for FI_ADDR_STR address format #11336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
CI failures. |
|
Thanks for the comment. I completely forgot about the tests. I'll fix it soon. |
|
bot:aws:retest |
|
Similar error happened on AWS CI when running OSU benchmarks on single node (using shm) |
|
More CI failure log to help you debug. This is from oneccl sycl shm tests |
da1fb58 to
695a365
Compare
|
I fixed fabtests, but currently one of them failed with timeout (even without my patch) |
Which test script are you using? They should be configured to run correctly. The way most fabtests work is by doing Does it work if you run like this? |
|
@tatarintsevsv That looks like one of the negative tests which shm doesn't support because of the non-hardware based addressing. For shm, we exclude the negative tests which are expected to fail - for runfabtests.sh use the -N argument to skip these tests |
|
Just to mention, the AWS CI is still failing with same segfault in MPI test |
I'm run but server-side must be runned as |
As far as I can see, the EFA provider uses SHM EP's for some tasks and also has to pass addresses to fi_av_insert() as (char**). |
Ok, I'll add this test to shm.exclude. |
Treat addr parameter as string array (char**) Fix fabtests and fi_pingpong for FI_ADDR_STR address format Signed-off-by: Sergey Tatarintsev <[email protected]>
695a365 to
16d0309
Compare
Run like this: [zdworkin@n1 bin]$ ./fi_rdm -p shm -s g00n13s
|
|
@tatarintsevsv It's not really possible to fix this case for shm, unfortunately. |
Thanks for explaining the meaning of the test. I was already add this test to exclude |
Signed-off-by: Sergey Tatarintsev <[email protected]>
|
added commit for fix prov/efa segfault on address insertion on shm ep's |
|
The latest push still doesn't fix the crash of MPI runs |
Can you provide more details about this crash (tracing or something)? |
Treat addr parameter as string array (char**)
Fix fi_pingpong for FI_ADDR_STR address format