Skip to content

Conversation

@snadampal
Copy link
Contributor

@snadampal snadampal commented Oct 4, 2025

the commit "4016fd5" has moved the peer map entry pool from av to endpoint but deferred the pool memory allocation till the remote requests for a peer. So, the acutal memory pool allocation was happening during the connection establishment request and hence adding latency overhead to the communication.

This commit is fixing the latency issue by eagerly allocating the pool during the endpoint create.

the commit "4016fd5" has moved the peer map entry pool from av to
endpoint but deferred the pool memory allocation till the remote
requests for a peer. So, the acutal memory pool allocation was
happening on the first request and hence adding latency overhead
to the intial requests.

This commit is fixing the letency issue by eagerly allocating the pool
during the endpoint create.

Signed-off-by: Sunita Nadampalli <[email protected]>
@snadampal
Copy link
Contributor Author

snadampal commented Oct 5, 2025

the CI failure doesn't seem to be due to the PR change, @shijin-aws , can you please check and comment.

@shijin-aws
Copy link
Contributor

bot:aws:retest

@shijin-aws
Copy link
Contributor

Something is wrong with nccl installation, need to check further

@a-szegel
Copy link
Contributor

a-szegel commented Oct 7, 2025

bot:aws:retest

Copy link
Contributor

@shijin-aws shijin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of tx buffer pools today have a lazy initialization. I am curious why this one is special

@snadampal
Copy link
Contributor Author

Do you know why all those pools are lazy initialized?
It makes sense to delay the pools that get used only during overflows, unexpected or error message handling but not for others.

@shijin-aws
Copy link
Contributor

No I don't know. But I can see it is not good to lazy initialize some pools that must be allocated, otherwise the first transmission will be bad delayed

@shijin-aws shijin-aws requested a review from sunkuamzn October 7, 2025 18:07
@sunkuamzn
Copy link
Contributor

peer_map_entry_pool is also used during CQ read, so the delay could even be in the RX path

@shijin-aws shijin-aws merged commit c8590e9 into ofiwg:main Oct 7, 2025
13 checks passed
@shijin-aws
Copy link
Contributor

The peer map entry pool is special because the earlier commit 4016fd5 changed the behavior

so the peer entry pool was in AV, and it will be growed in the first av insertion aka the slow path

commit 4016fd5 makes it to the fast path call

so this change is moving it back to slow path, which I think it's a reasonable revert IMO.

Merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants