Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions changelog.d/19026.misc
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Reduces the SRV DNS record lookup timeout to 15 seconds.
This fixes issues when DNS lookups hang, as the default timeout of 60 seconds matches the timeout for the federation request itself,
thus we see the HTTP request timeout and not the actual DNS error.
Comment on lines +1 to +3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how line wrapping works with the Towncrier changelog entries but I never wrap them so that's what I'd go with.

Comment on lines +1 to +3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Reduces the SRV DNS record lookup timeout to 15 seconds.
This fixes issues when DNS lookups hang, as the default timeout of 60 seconds matches the timeout for the federation request itself,
thus we see the HTTP request timeout and not the actual DNS error.
Shorten DNS resolver timeout/retry sequence from 60s to 15s to ensure DNS failures are visible before federation HTTP request timeouts.

37 changes: 36 additions & 1 deletion synapse/http/federation/srv_resolver.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@

import attr

from twisted.internet import defer
from twisted.internet.error import ConnectError
from twisted.names import client, dns
from twisted.names.error import DNSNameError, DNSNotImplementedError, DomainError
Expand Down Expand Up @@ -145,7 +146,37 @@ async def resolve_service(self, service_name: bytes) -> List[Server]:

try:
answers, _, _ = await make_deferred_yieldable(
self._dns_client.lookupService(service_name)
self._dns_client.lookupService(
service_name,
# This is a sequence of ints that represent the "number of seconds
# after which to reissue the query. When the last timeout expires,
# the query is considered failed." The default value in Twisted is
# `timeout=(1, 3, 11, 45)` (60s total) which is an "arbitrary"
# exponential backoff sequence and is too long (see below).
#
# We want the total timeout to be below the overarching HTTP request
# timeout (60s for federation requests) that spurred on this lookup.
# This way, we can see the underlying DNS failure and move on
# instead of the user ending up with a generic HTTP request timeout.
#
# Since these DNS queries are done over UDP (unreliable transport),
# by it's nature, it's bound to occasionally fail (dropped packets,
# etc). We want a list that starts small and re-issues DNS queries
# multiple times until we get a response or timeout.
timeout=(
1, # Quick retry for packet loss/scenarios
3, # Still reasonable for slow responders
3, # ...
3, # Already catching 99.9% of successful queries at 10s
# Final attempt for extreme edge cases.
#
# TODO: In the future, we could consider removing this extra
# time if we don't see complaints. For comparison, The Windows
Comment on lines +173 to +174
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO: In the future, we could consider removing this extra
# time if we don't see complaints. For comparison, The Windows
# TODO: In the future (after 2026-01-01), we could consider removing this extra
# time if we don't see complaints. For comparison, the Windows

My own comment but noticed one typo.

And I think it would be good to add a date for anyone stumbling upon this and wondering when/if we can make the change.

# DNS resolver gives up after 10s using `(1, 1, 2, 4, 2)`, see
# https://learn.microsoft.com/en-us/troubleshoot/windows-server/networking/dns-client-resolution-timeouts
5,
),
)
)
except DNSNameError:
# TODO: cache this. We can get the SOA out of the exception, and use
Expand All @@ -165,6 +196,10 @@ async def resolve_service(self, service_name: bytes) -> List[Server]:
return list(cache_entry)
else:
raise e
except defer.TimeoutError as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a test that stressed this part of the code. Especially since I'm unsure if defer.TimeoutError is actually the exception type raised here.

Do you think you would be up for that? Probably involves reactor.advance(15 + 1) to advance time past the timeout. Otherwise, I can take a stab at it.

raise defer.TimeoutError(
f"Could not resolve DNS for SRV record {service_name!r} due to timeout (50s total)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"Could not resolve DNS for SRV record {service_name!r} due to timeout (50s total)"
f"Timed out while trying to resolve DNS for SRV record for {service_name!r} (timeout=15s)"

Does this sound better?

Only real nit is this previously said 50s total vs our new 15s timeout. Ideally, we'd have a constant to use here but I'm not sure that moving timeout=(1, 3, 3, 3, 5) to the top as a constant is better. The comments probably read better in place where it's used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say to make it a constant, then its easy to do the following:

Suggested change
f"Could not resolve DNS for SRV record {service_name!r} due to timeout (50s total)"
f"Timed out while trying to resolve DNS for SRV record for {service_name!r} (timeout={sum(LOOKUP_TIMEOUTS)}s)"

) from e

if (
len(answers) == 1
Expand Down
8 changes: 6 additions & 2 deletions tests/http/federation/test_srv_resolver.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,9 @@ def do_lookup() -> Generator["Deferred[object]", object, List[Server]]:
test_d = do_lookup()
self.assertNoResult(test_d)

dns_client_mock.lookupService.assert_called_once_with(service_name)
dns_client_mock.lookupService.assert_called_once_with(
service_name, timeout=(1, 3, 3, 3, 5)
)

result_deferred.callback(([answer_srv], None, None))

Expand Down Expand Up @@ -98,7 +100,9 @@ def test_from_cache_expired_and_dns_fail(
servers: List[Server]
servers = yield defer.ensureDeferred(resolver.resolve_service(service_name)) # type: ignore[assignment]

dns_client_mock.lookupService.assert_called_once_with(service_name)
dns_client_mock.lookupService.assert_called_once_with(
service_name, timeout=(1, 3, 3, 3, 5)
)

self.assertEqual(len(servers), 1)
self.assertEqual(servers, cache[service_name])
Expand Down