Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC1708: .well-known support for server name resolution #1708

Merged
merged 12 commits into from
Jan 14, 2019
143 changes: 143 additions & 0 deletions proposals/1708-well-known-for-federation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# MSC1708: .well-known support for server name resolution
richvdh marked this conversation as resolved.
Show resolved Hide resolved

Currently, mapping from a server name to a hostname for federation is done via
`SRV` records. This presents two principal difficulties:

* SRV records are not widely used, and administrators may be unfamiliar with
richvdh marked this conversation as resolved.
Show resolved Hide resolved
them, and there may be other practical difficulties in their deployment such
as poor support from hosting providers. [^1]
richvdh marked this conversation as resolved.
Show resolved Hide resolved

* It is likely that we will soon require valid X.509 certificates on the
federation endpoint. It will then be necessary for the homeserver to present
a certificate which is valid for the server name. This presents difficulties
for hosted server offerings: BigCorp may be reluctant to hand over the
keys for `bigcorp.com` to the administrators of the `bigcorp.com` matrix
homeserver.
richvdh marked this conversation as resolved.
Show resolved Hide resolved
richvdh marked this conversation as resolved.
Show resolved Hide resolved

Here we propose to solve these problems by augmenting the current `SRV` record
with a `.well-known` lookup.

## Proposal

For reference, the current [specification for resolving server
names](https://matrix.org/docs/spec/server_server/unstable.html#resolving-server-names)
is as follows:

* If the hostname is an IP literal, then that IP address should be used,
together with the given port number, or 8448 if no port is given.

* Otherwise, if the port is present, then an IP address is discovered by
richvdh marked this conversation as resolved.
Show resolved Hide resolved
looking up an AAAA or A record for the hostname, and the specified port is
used.

* If the hostname is not an IP literal and no port is given, the server is
discovered by first looking up a `_matrix._tcp` SRV record for the
hostname, which may give a hostname (to be looked up using AAAA or A queries)
and port. If the SRV record does not exist, then the server is discovered by
looking up an AAAA or A record on the hostname and taking the default
fallback port number of 8448.

Homeservers may use SRV records to load balance requests between multiple TLS
endpoints or to failover to another endpoint if an endpoint fails.
jcgruenhage marked this conversation as resolved.
Show resolved Hide resolved

The first two points remain unchanged: if the server name is an IP literal, or
contains a port, then requests will be made directly as before.

If the hostname is neither an IP literal, nor does it have an explicit port,
then the requesting server should continue to make an SRV lookup as before, and
use the result if one is found.

If *no* SRV result is found, the requesting server should make a `GET` request
to `https://\<server_name>/.well-known/matrix/server`, with normal X.509
certificate validation. If the request fails in any way, then we fall back as
before to using using port 8448 on the hostname.

Rationale: Falling back to port 8448 (rather than aborting the request) is
necessary to maintain compatibility with existing deployments, which may not
present valid certificates on port 443, or may return 4xx or 5xx errors.

If the GET request succeeds, it should result in a JSON response, with contents
structured as shown:

```json
{
"server": "<server>[:<port>]"
richvdh marked this conversation as resolved.
Show resolved Hide resolved
}
```
richvdh marked this conversation as resolved.
Show resolved Hide resolved

The `server` property should be a hostname or IP address, followed by an
optional port.

If the response cannot be parsed as JSON, or lacks a valid `server` property,
richvdh marked this conversation as resolved.
Show resolved Hide resolved
the request is considered to have failed, and no fallback to port 8448 takes
place.

Otherwise, the requesting server performs an `AAAA/A` lookup on the hostname
(if necessary), and connects to the resultant address and the specifed
port. The port defaults to 8448, if unspecified.

(The formal grammar for the `server` property is identical to that of a [server
richvdh marked this conversation as resolved.
Show resolved Hide resolved
name](https://matrix.org/docs/spec/appendices.html#server-name).)

### Caching

Servers should not look up the `.well-known` file for every request, as this
richvdh marked this conversation as resolved.
Show resolved Hide resolved
would impose an unacceptable overhead on both sides. Instead, the results of
the `.well-known` request should be cached according to the HTTP response
headers, as per [RFC7234](https://tools.ietf.org/html/rfc7234). If the response
does not include an explicit expiry time, the requesting server should use a
sensible default: 24 hours is suggested.

Because there is no way to request a revalidation, it is also recommended that
requesting servers cap the expiry time. 48 hours is suggested.

Similarly, a failure to retrieve the `.well-known` file should be cached for
a reasonable period. 24 hours is suggested again.
richvdh marked this conversation as resolved.
Show resolved Hide resolved

### The future of SRV records

It's worth noting that this proposal is very clear in that we will maintain
support for SRV records for the immediate future; there are no current plans to
deprecate them.

However, clearly a `.well-known` file can provide much of the functionality of
an SRV record, and having to support both may be undesirable. Accordingly, we
may consider sunsetting SRV record support at some point in the future.
richvdh marked this conversation as resolved.
Show resolved Hide resolved

### Outstanding questions

Should we follow 30x redirects for the .well-known file? On the one hand, there
is no obvious usecase and they add complexity (for example: how do they
interact with caches?). On the other hand, we'll presumably be using an HTTP
client library to handle some of the caching stuff, and they might be useful
for something?

## Problems

It will take a while for `.well-known` to be supported across the ecosystem;
until it is, it will be difficult to deploy homeservers which rely on it for
their routing: if Alice is using a current homeserver implementation, and Bob
deploys a new implementation which relies on `.well-known` for routing, then
Alice will be unable to send messages to Bob. (This is the same problem we have with
[SNI](https://github.com/matrix-org/synapse/issues/1491#issuecomment-415153428).)

The main defence against this seems to be to release support for `.well-known`
as soon as possible, to maximise uptake in the ecosystem. It is likely that, as
we approach Matrix 1.0, there will be sufficient other new features (such as
new Room versions) that upgrading will be necessary anyway.


## Security considerations

The `.well-known` file potentially broadens the attack surface for an attacker
wishing to intercept federation traffic to a particular server.

## Conclusion

This proposal adds a new mechanism, alongside the existing `SRV` record lookup
for finding the server responsible for a particular matrix server_name, which
will allow greater flexibility in deploying homeservers.


[^1] For example, Cloudflare automatically "flattens" SRV record responses.
richvdh marked this conversation as resolved.
Show resolved Hide resolved