-
Notifications
You must be signed in to change notification settings - Fork 448
prov/efa: allocate peer map entry pool during the rdm ep create #11468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
the commit "4016fd5" has moved the peer map entry pool from av to endpoint but deferred the pool memory allocation till the remote requests for a peer. So, the acutal memory pool allocation was happening on the first request and hence adding latency overhead to the intial requests. This commit is fixing the letency issue by eagerly allocating the pool during the endpoint create. Signed-off-by: Sunita Nadampalli <[email protected]>
|
the CI failure doesn't seem to be due to the PR change, @shijin-aws , can you please check and comment. |
|
bot:aws:retest |
|
Something is wrong with nccl installation, need to check further |
|
bot:aws:retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all of tx buffer pools today have a lazy initialization. I am curious why this one is special
|
Do you know why all those pools are lazy initialized? |
|
No I don't know. But I can see it is not good to lazy initialize some pools that must be allocated, otherwise the first transmission will be bad delayed |
|
peer_map_entry_pool is also used during CQ read, so the delay could even be in the RX path |
|
The peer map entry pool is special because the earlier commit 4016fd5 changed the behavior so the peer entry pool was in AV, and it will be growed in the first av insertion aka the slow path commit 4016fd5 makes it to the fast path call so this change is moving it back to slow path, which I think it's a reasonable revert IMO. Merging |
the commit "4016fd5" has moved the peer map entry pool from av to endpoint but deferred the pool memory allocation till the remote requests for a peer. So, the acutal memory pool allocation was happening during the connection establishment request and hence adding latency overhead to the communication.
This commit is fixing the latency issue by eagerly allocating the pool during the endpoint create.