-
Notifications
You must be signed in to change notification settings - Fork 395
srv_resolver: add 15s timeout to DNS lookups #19026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
try: | ||
answers, _, _ = await make_deferred_yieldable( | ||
self._dns_client.lookupService(service_name) | ||
self._dns_client.lookupService(service_name, timeout=(1, 1, 2, 4, 2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should include all of the juicy context for why we're doing this in the comments. I've already written this out in #19026 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thank you very much! Do you want me to add you as a co-author?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the offer! It's normal to help out contributors and Element pays me so we can skip.
May be reasonable to add @ShadowJonathan as a co-author as we cribbed their TimeoutError
change from matrix-org/synapse#9776
28b9258
to
5059cd5
Compare
# TODO: In the future, we could consider removing this extra | ||
# time if we don't see complaints. For comparison, The Windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO: In the future, we could consider removing this extra | |
# time if we don't see complaints. For comparison, The Windows | |
# TODO: In the future (after 2026-01-01), we could consider removing this extra | |
# time if we don't see complaints. For comparison, the Windows |
My own comment but noticed one typo.
And I think it would be good to add a date for anyone stumbling upon this and wondering when/if we can make the change.
Reduces the SRV DNS record lookup timeout to 15 seconds. | ||
This fixes issues when DNS lookups hang, as the default timeout of 60 seconds matches the timeout for the federation request itself, | ||
thus we see the HTTP request timeout and not the actual DNS error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how line wrapping works with the Towncrier changelog entries but I never wrap them so that's what I'd go with.
raise e | ||
except defer.TimeoutError as e: | ||
raise defer.TimeoutError( | ||
f"Could not resolve DNS for SRV record {service_name!r} due to timeout (50s total)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"Could not resolve DNS for SRV record {service_name!r} due to timeout (50s total)" | |
f"Timed out while trying to resolve DNS for SRV record for {service_name!r} (timeout=15s)" |
Does this sound better?
Only real nit is this previously said 50s total
vs our new 15s
timeout. Ideally, we'd have a constant to use here but I'm not sure that moving timeout=(1, 3, 3, 3, 5)
to the top as a constant is better. The comments probably read better in place where it's used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say to make it a constant, then its easy to do the following:
f"Could not resolve DNS for SRV record {service_name!r} due to timeout (50s total)" | |
f"Timed out while trying to resolve DNS for SRV record for {service_name!r} (timeout={sum(LOOKUP_TIMEOUTS)}s)" |
return list(cache_entry) | ||
else: | ||
raise e | ||
except defer.TimeoutError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a test that stressed this part of the code. Especially since I'm unsure if defer.TimeoutError
is actually the exception type raised here.
Do you think you would be up for that? Probably involves reactor.advance(15 + 1)
to advance time past the timeout. Otherwise, I can take a stab at it.
Reduces the SRV DNS record lookup timeout to 15 seconds. | ||
This fixes issues when DNS lookups hang, as the default timeout of 60 seconds matches the timeout for the federation request itself, | ||
thus we see the HTTP request timeout and not the actual DNS error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduces the SRV DNS record lookup timeout to 15 seconds. | |
This fixes issues when DNS lookups hang, as the default timeout of 60 seconds matches the timeout for the federation request itself, | |
thus we see the HTTP request timeout and not the actual DNS error. | |
Shorten DNS resolver timeout/retry sequence from 60s to 15s to ensure DNS failures are visible before federation HTTP request timeouts. |
try: | ||
answers, _, _ = await make_deferred_yieldable( | ||
self._dns_client.lookupService(service_name) | ||
self._dns_client.lookupService(service_name, timeout=(1, 1, 2, 4, 2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the offer! It's normal to help out contributors and Element pays me so we can skip.
May be reasonable to add @ShadowJonathan as a co-author as we cribbed their TimeoutError
change from matrix-org/synapse#9776
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
15 seconds may be short, but I think that if a server isn't responding inbetween those multi-second retries, its having other issues anyways.
LGTM
synapse.http.federation.srv_resolver.SrvResolver.resolve_service
isn't able to "timeout" properly, and thus stalls federation #9774Problem
The default timeout when looking up SRV records for the dns client is 60 seconds, which is quite long and can be problematic as that's the value of the default http client timeout, thus the error surfaced is the http error and not the underlying DNS error.
Solution
I've added a more aggressive timeout of 15 seconds total (with retries after 1, 3, 3, 3, and 5 seconds) for the SRV lookup. I've also re-labeled the `defer.TimeoutError, as that was done in the prior synapse PR, and makes the error more clear to users.
Pull Request Checklist
EventStore
toEventWorkerStore
.".code blocks
.