Skip to content

Conversation

@tatarintsevsv
Copy link

Treat addr parameter as string array (char**)
Fix fi_pingpong for FI_ADDR_STR address format

@zachdworkin
Copy link
Contributor

CI failures.
Fabtests (with and without HMEM): Run the runfabtests.sh script with shm provider to find the failures
Oneccl + shm (every benchmark): KILLED BY SIGNAL: 11 (Segmentation fault)

@tatarintsevsv
Copy link
Author

Thanks for the comment. I completely forgot about the tests. I'll fix it soon.

@shijin-aws
Copy link
Contributor

bot:aws:retest

@shijin-aws
Copy link
Contributor

Similar error happened on AWS CI when running OSU benchmarks on single node (using shm)

v2021.14-mpi/run/pt2pt/osu_bibw/node1-ppn2.txt
INFO     root:utils.py:621 mpirun output:
[0] MPI startup(): Intel(R) MPI Library, Version 2021.14  Build 20240911 (id: b3fc682)
[0] MPI startup(): Copyright (C) 2003-2024 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 2.3.0a1
[0] MPI startup(): libfabric provider: efa

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 54430 RUNNING AT 172.31.20.148
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 54431 RUNNING AT 172.31.20.148
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

@zachdworkin
Copy link
Contributor

More CI failure log to help you debug. This is from oneccl sycl shm tests
2025:08:26-20:19:03:(25367) |CCL_WARN| Endpoint name truncated from 64 to 23 bytes due to FI_NAME_MAX limit. This might cause collisions if names differ only in the truncated portion

@tatarintsevsv
Copy link
Author

I fixed fabtests, but currently one of them failed with timeout (even without my patch)
when running fi_rdm g00n13s -p "shm" i've got
libfabric:23660:1756441591::shm:av:smr_map_to_region():376<warn> shm_open error: name g00n13s:9228 errno 2
I think server-side must be started with -s option set. Somethink like
(fi_rdm -s g00n13s -p "shm" &) && timeout 120 fi_rdm g00n13s -p "shm"
But I not sure how to fix test scripts

@zachdworkin
Copy link
Contributor

I fixed fabtests, but currently one of them failed with timeout (even without my patch) when running fi_rdm g00n13s -p "shm" i've got libfabric:23660:1756441591::shm:av:smr_map_to_region():376<warn> shm_open error: name g00n13s:9228 errno 2 I think server-side must be started with -s option set. Somethink like (fi_rdm -s g00n13s -p "shm" &) && timeout 120 fi_rdm g00n13s -p "shm" But I not sure how to fix test scripts

Which test script are you using? They should be configured to run correctly.

The way most fabtests work is by doing
server: executable -s nodename-interface
client: executable -s nodename-interface server_nodename-interface

Does it work if you run like this?

@aingerson
Copy link
Contributor

@tatarintsevsv That looks like one of the negative tests which shm doesn't support because of the non-hardware based addressing. For shm, we exclude the negative tests which are expected to fail - for runfabtests.sh use the -N argument to skip these tests

@shijin-aws
Copy link
Contributor

Just to mention, the AWS CI is still failing with same segfault in MPI test

@tatarintsevsv
Copy link
Author

I fixed fabtests, but currently one of them failed with timeout (even without my patch) when running fi_rdm g00n13s -p "shm" i've got libfabric:23660:1756441591::shm:av:smr_map_to_region():376<warn> shm_open error: name g00n13s:9228 errno 2 I think server-side must be started with -s option set. Somethink like (fi_rdm -s g00n13s -p "shm" &) && timeout 120 fi_rdm g00n13s -p "shm" But I not sure how to fix test scripts

Which test script are you using? They should be configured to run correctly.

The way most fabtests work is by doing server: executable -s nodename-interface client: executable -s nodename-interface server_nodename-interface

Does it work if you run like this?

I'm run ./runfabtests.sh shm, and fi_rdm runned as:

$ ps ax| grep fi_
  13454 pts/4    S      0:00 timeout 120 fi_rdm g00n13s -p shm
  13455 pts/4    R      0:09 fi_rdm g00n13s -p shm

but server-side must be runned as fi_rdm -s g00n13s -p shm, so test failed with timeout

@tatarintsevsv
Copy link
Author

tatarintsevsv commented Aug 31, 2025

Just to mention, the AWS CI is still failing with same segfault in MPI test

As far as I can see, the EFA provider uses SHM EP's for some tasks and also has to pass addresses to fi_av_insert() as (char**).
Can you retest the EFA with the attached patch?
efa-shm-av_insert.patch
(not sure, but I think this patch must be in separate commit with "prov/efa" comment)

@tatarintsevsv
Copy link
Author

@tatarintsevsv That looks like one of the negative tests which shm doesn't support because of the non-hardware based addressing. For shm, we exclude the negative tests which are expected to fail - for runfabtests.sh use the -N argument to skip these tests

Ok, I'll add this test to shm.exclude.
But I think it would be better to fix the launch of this test for prov/shm. As i wrote above, test will successfull if we pass -s key to sever-side

Treat addr parameter as string array (char**)
Fix fabtests and fi_pingpong for FI_ADDR_STR address format

Signed-off-by: Sergey Tatarintsev <[email protected]>
@zachdworkin
Copy link
Contributor

I fixed fabtests, but currently one of them failed with timeout (even without my patch) when running fi_rdm g00n13s -p "shm" i've got libfabric:23660:1756441591::shm:av:smr_map_to_region():376<warn> shm_open error: name g00n13s:9228 errno 2 I think server-side must be started with -s option set. Somethink like (fi_rdm -s g00n13s -p "shm" &) && timeout 120 fi_rdm g00n13s -p "shm" But I not sure how to fix test scripts

Which test script are you using? They should be configured to run correctly.
The way most fabtests work is by doing server: executable -s nodename-interface client: executable -s nodename-interface server_nodename-interface
Does it work if you run like this?

I'm run ./runfabtests.sh shm, and fi_rdm runned as:

$ ps ax| grep fi_
  13454 pts/4    S      0:00 timeout 120 fi_rdm g00n13s -p shm
  13455 pts/4    R      0:09 fi_rdm g00n13s -p shm

but server-side must be runned as fi_rdm -s g00n13s -p shm, so test failed with timeout

Run like this:

[zdworkin@n1 bin]$ ./fi_rdm -p shm -s g00n13s
Waiting for message from client...
Data check OK
Received data from client: Hello from Client!

[zdworkin@n1 bin]$ ./fi_rdm -p shm -s g00n13s g00n13s
Sending message...
Send completion received

I think Alexia's comment will fix your issue with runfabtests failing since that test is part of the negative tests. However, if you want to run it by hand you will need to give the client command the server name as the non-flagged argument.

@aingerson
Copy link
Contributor

@tatarintsevsv It's not really possible to fix this case for shm, unfortunately.
First of all, the negative test in runfabtests is a one sided test (ie you don't have a client, just a server) and the whole point of the test is that it's supposed to fail. The problem is shm can pass with that fake address and so the fact that you can make shm run is the issue.
What the test you're talking about is testing is the provider's ability to detect a fake address. So essentially because "g00n13s" is not a real address, we expect the server to see that address and then not be able to bind to that local address because it doesn't exit. The issue is that shm's addressing is all made up strings so shm treats every string as a made up address that could start at any time. So instead of failing the address resolution, shm just keeps trying to connect to that address, thinking it just hasn't started yet.
There's not a great way to fix it and it's also just not really worth it because detecting an incorrect IP address isn't really something we care about for shm.
You can go ahead and add "g00n13s" to the exclude file though

@tatarintsevsv
Copy link
Author

@tatarintsevsv It's not really possible to fix this case for shm, unfortunately. First of all, the negative test in runfabtests is a one sided test (ie you don't have a client, just a server) and the whole point of the test is that it's supposed to fail. The problem is shm can pass with that fake address and so the fact that you can make shm run is the issue. What the test you're talking about is testing is the provider's ability to detect a fake address. So essentially because "g00n13s" is not a real address, we expect the server to see that address and then not be able to bind to that local address because it doesn't exit. The issue is that shm's addressing is all made up strings so shm treats every string as a made up address that could start at any time. So instead of failing the address resolution, shm just keeps trying to connect to that address, thinking it just hasn't started yet. There's not a great way to fix it and it's also just not really worth it because detecting an incorrect IP address isn't really something we care about for shm. You can go ahead and add "g00n13s" to the exclude file though

Thanks for explaining the meaning of the test. I was already add this test to exclude

@tatarintsevsv
Copy link
Author

added commit for fix prov/efa segfault on address insertion on shm ep's

@shijin-aws
Copy link
Contributor

The latest push still doesn't fix the crash of MPI runs

@tatarintsevsv
Copy link
Author

The latest push still doesn't fix the crash of MPI runs

Can you provide more details about this crash (tracing or something)?
Unfortunately I can't test prov/efa on my own

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants