Skip to content

Conversation

amaanq
Copy link

@amaanq amaanq commented Oct 7, 2025

Problem

The default timeout when looking up SRV records for the dns client is 60 seconds, which is quite long and can be problematic as that's the value of the default http client timeout, thus the error surfaced is the http error and not the underlying DNS error.

Solution

I've added a more aggressive timeout of 15 seconds total (with retries after 1, 3, 3, 3, and 5 seconds) for the SRV lookup. I've also re-labeled the `defer.TimeoutError, as that was done in the prior synapse PR, and makes the error more clear to users.

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

@amaanq amaanq requested a review from a team as a code owner October 7, 2025 19:34
@CLAassistant
Copy link

CLAassistant commented Oct 7, 2025

CLA assistant check
All committers have signed the CLA.

try:
answers, _, _ = await make_deferred_yieldable(
self._dns_client.lookupService(service_name)
self._dns_client.lookupService(service_name, timeout=(1, 1, 2, 4, 2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include all of the juicy context for why we're doing this in the comments. I've already written this out in #19026 (comment)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thank you very much! Do you want me to add you as a co-author?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the offer! It's normal to help out contributors and Element pays me so we can skip.

May be reasonable to add @ShadowJonathan as a co-author as we cribbed their TimeoutError change from matrix-org/synapse#9776

@amaanq amaanq force-pushed the srv-timeout branch 2 times, most recently from 28b9258 to 5059cd5 Compare October 7, 2025 23:05
Comment on lines +173 to +174
# TODO: In the future, we could consider removing this extra
# time if we don't see complaints. For comparison, The Windows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO: In the future, we could consider removing this extra
# time if we don't see complaints. For comparison, The Windows
# TODO: In the future (after 2026-01-01), we could consider removing this extra
# time if we don't see complaints. For comparison, the Windows

My own comment but noticed one typo.

And I think it would be good to add a date for anyone stumbling upon this and wondering when/if we can make the change.

Comment on lines +1 to +3
Reduces the SRV DNS record lookup timeout to 15 seconds.
This fixes issues when DNS lookups hang, as the default timeout of 60 seconds matches the timeout for the federation request itself,
thus we see the HTTP request timeout and not the actual DNS error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how line wrapping works with the Towncrier changelog entries but I never wrap them so that's what I'd go with.

raise e
except defer.TimeoutError as e:
raise defer.TimeoutError(
f"Could not resolve DNS for SRV record {service_name!r} due to timeout (50s total)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"Could not resolve DNS for SRV record {service_name!r} due to timeout (50s total)"
f"Timed out while trying to resolve DNS for SRV record for {service_name!r} (timeout=15s)"

Does this sound better?

Only real nit is this previously said 50s total vs our new 15s timeout. Ideally, we'd have a constant to use here but I'm not sure that moving timeout=(1, 3, 3, 3, 5) to the top as a constant is better. The comments probably read better in place where it's used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say to make it a constant, then its easy to do the following:

Suggested change
f"Could not resolve DNS for SRV record {service_name!r} due to timeout (50s total)"
f"Timed out while trying to resolve DNS for SRV record for {service_name!r} (timeout={sum(LOOKUP_TIMEOUTS)}s)"

return list(cache_entry)
else:
raise e
except defer.TimeoutError as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a test that stressed this part of the code. Especially since I'm unsure if defer.TimeoutError is actually the exception type raised here.

Do you think you would be up for that? Probably involves reactor.advance(15 + 1) to advance time past the timeout. Otherwise, I can take a stab at it.

Comment on lines +1 to +3
Reduces the SRV DNS record lookup timeout to 15 seconds.
This fixes issues when DNS lookups hang, as the default timeout of 60 seconds matches the timeout for the federation request itself,
thus we see the HTTP request timeout and not the actual DNS error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Reduces the SRV DNS record lookup timeout to 15 seconds.
This fixes issues when DNS lookups hang, as the default timeout of 60 seconds matches the timeout for the federation request itself,
thus we see the HTTP request timeout and not the actual DNS error.
Shorten DNS resolver timeout/retry sequence from 60s to 15s to ensure DNS failures are visible before federation HTTP request timeouts.

try:
answers, _, _ = await make_deferred_yieldable(
self._dns_client.lookupService(service_name)
self._dns_client.lookupService(service_name, timeout=(1, 1, 2, 4, 2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the offer! It's normal to help out contributors and Element pays me so we can skip.

May be reasonable to add @ShadowJonathan as a co-author as we cribbed their TimeoutError change from matrix-org/synapse#9776

Copy link
Contributor

@ShadowJonathan ShadowJonathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 seconds may be short, but I think that if a server isn't responding inbetween those multi-second retries, its having other issues anyways.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

synapse.http.federation.srv_resolver.SrvResolver.resolve_service isn't able to "timeout" properly, and thus stalls federation
4 participants