Skip to content

Conversation

smb-z
Copy link

@smb-z smb-z commented Jun 2, 2025

Description

Although Agent.GatherCandidates() have a cancellable context, the turn.Client.Allocate() processing does not use it and can hang for about 7.8 seconds if the TURN server is unreachable.
This happens because Allocate() completes only after the ALLOCATE request retry limit is reached.
Ultimately, PeerConnection.Close() will block while waiting for unnecessary allocation to complete.

PR fixes this behavior by cancelling ALLOCATE phase if context was cancelled.

Caveat: If cancelled, Client.Allocate will nevertheless return error ("transaction closed") which is logged as warning here.

Copy link

codecov bot commented Jun 2, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.61%. Comparing base (753c2a0) to head (d9f9d1d).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #781      +/-   ##
==========================================
+ Coverage   78.38%   78.61%   +0.22%     
==========================================
  Files          41       41              
  Lines        5441     5448       +7     
==========================================
+ Hits         4265     4283      +18     
+ Misses        945      934      -11     
  Partials      231      231              
Flag Coverage Δ
go 78.61% <100.00%> (+0.22%) ⬆️
wasm 27.05% <0.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This fixes waiting for TURN Allocation to complete when Agent is
closing/restarting.
@smb-z smb-z force-pushed the turn_allocate_cancel branch from 1357c1e to d9f9d1d Compare June 2, 2025 13:48
@smb-z smb-z marked this pull request as ready for review June 2, 2025 13:55
@smb-z
Copy link
Author

smb-z commented Jun 2, 2025

@JoeTurki Hi! Would you mind having a look at this PR?

@aalekseevx aalekseevx requested a review from JoeTurki June 2, 2025 14:11
@JoeTurki
Copy link
Member

JoeTurki commented Jun 3, 2025

@smb-z Hello, Thank you so much for making this change, it makes sense.
I only has a concern about a potential race condition. I'm not sure, I'll test it in a day, If it doesn't have issues, We'll merge it right away, Otherwise, I'll update your PR.

Thank you so much.

@JoeTurki JoeTurki self-assigned this Jun 3, 2025
@smb-z
Copy link
Author

smb-z commented Jun 3, 2025

Thanks, @JoeTurki!
It seems you're right, and cancelling client.Allocate() is more tricky than I first thought. Here are some race issues I've found:

  1. If the context is cancelled just after client.Allocate() successfully completes but before it returns, we'll end up in a state when err here is nil, but the client is closed by the goroutine. I wasn't able to find any significant problems with this, but it looks unreliable, as the code following the error check is expecting Allocate to complete successfully. Thanks, @aalekseevx for pointing this out!
  2. If the context is cancelled before client.Allocate(), it wouldn't have any effect, and we'll still have client.Allocate() blocked for 7.8s.

It looks like a better option is to pass the context to Allocate() and process context cancelling inside the TURN client itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants