-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
rfcs: new RFC on certificate revocation
Release note: None
- Loading branch information
Showing
1 changed file
with
383 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,383 @@ | ||
- Feature Name: TLS client certificate revocation | ||
- Status: accepted | ||
- Start Date: 2020-06-24 | ||
- Authors: knz, ben | ||
- RFC PR: [#50602](https://github.com/cockroachdb/cockroach/pull/50602) | ||
- Cockroach Issue: [#29641](https://github.com/cockroachdb/cockroach/issues/29641) | ||
|
||
# Summary | ||
|
||
This RFC proposes to introduce mechanisms to revoke TLS client | ||
certificates when used by CockroachDB for authentication. | ||
|
||
The motivation is to enable revoking certificates earlier than their | ||
expiration date, in case a certificate was compromised. | ||
|
||
This is also required for various compliance checklists. | ||
|
||
The technical proposal has multiple components: | ||
|
||
- it extends the TLS certificate validation code to support checking | ||
the cert against | ||
[OCSP](#online-certificate-status-protocol-ocsp) (explained below). | ||
|
||
- it introduces code to fetch OCSP responses from the | ||
network and cache them. | ||
|
||
- it introduces a cluster setting `security.ocsp.mode` which controls | ||
the strictness of the certificate validation. | ||
|
||
- it reports on the status of OCSP responses in the | ||
`/_status/certificates` endpoint. | ||
|
||
- it introduces a cluster setting to control the expiration of | ||
the cache: `server.certificate_revocation_cache.refresh_interval`. | ||
|
||
- it introduces an API endpoint `/_status/certificates/cached_revocations`, | ||
which, upon HTTP get requests, produces a report on all currently | ||
cached OCSP responses upon HTTP GET requests. | ||
|
||
(In a later phase, also available via SQL built-in function, pending | ||
[#51454](https://github.com/cockroachdb/cockroach/issues/51454) or | ||
similar work.) | ||
|
||
- the same API endpoint also supports HTTP POST request to manually | ||
force a refresh. | ||
|
||
**Note: as of August 2020, [PR | ||
#53218](https://github.com/cockroachdb/cockroach/pull/53218) | ||
implements a MVP of the checking logic. However it does not implement | ||
caching as described in this RFC. The caching remains to be done.** | ||
|
||
# Motivation | ||
|
||
see above | ||
|
||
# Background | ||
|
||
## Online Certificate Status Protocol (OCSP) | ||
|
||
OCSP is a network protocol designed to facilitate online validation of TLS certificates. | ||
It performs the same role as CRLs but is intended to be lightweight in comparison. | ||
|
||
The way it works is the following: | ||
|
||
- upon observing a TLS cert for validation, the service extracts the cert's serial number. | ||
|
||
- the service sends the cert's serial number to an OCSP server. The | ||
server's URL is known ahead of time (configured separately) or | ||
embedded in the client/CA certs themselves under the | ||
`authorityInfoAccess` field. | ||
|
||
- the OCSP server internally checks the cert's validity against CRLs etc. | ||
|
||
- the OCSP server returns a response to the service with status either | ||
`good`, `revoked` or `unknown`. The response itself is signed using | ||
a valid CA. The service must verify the signature in the OCSP response. | ||
|
||
The cost to implement / validate a cert using OCSP is typically lower | ||
computationally than using CRLs. The OCSP server typically caches | ||
response across multiple requests. | ||
|
||
Nevertheless, OCSP incurs a mandatory network round-trip upon every | ||
verification. In CockroachDB it would be unreasonable to query OCSP | ||
upon every incoming client connection. Therefore, for our purpose OCSP | ||
does not obviate the need for a service-side cache. | ||
|
||
References: | ||
|
||
https://www.ssl.com/faqs/faq-digital-certificate-revocation/ | ||
|
||
https://jamielinux.com/docs/openssl-certificate-authority/online-certificate-status-protocol.html | ||
|
||
https://en.wikipedia.org/wiki/Online_Certificate_Status_Protocol | ||
|
||
# Guide-level explanation | ||
|
||
After a cluster has been set up, a security event may occur that | ||
requires the operator to revoke a client TLS certificate. This event | ||
could be a compromise (i.e. certificate lands in the wrong hands) or | ||
the deployment of corporate IT / compliance processes that requires a | ||
revocation capability in all networked services. | ||
|
||
For this CockroachDB offers the ability to check OCSP servers. | ||
Technically OCSP is a network service that can validate certificates | ||
remotely. | ||
|
||
## Upon starting a CockroachDB node | ||
|
||
No additional command-line flags are needed. | ||
|
||
The TLS certificates are assumed to contain an `authorityInfoAccess` | ||
field that points to their OCSP server. | ||
|
||
## Manually revoking a certificate | ||
|
||
To revoke a certificate, the operator should proceed as follows: | ||
|
||
- add the revocation information to the OCSP database in the OCSP | ||
service configured via the `authorityInfoAccess` in certs. | ||
|
||
- finally, force a refresh of the CockroachDB cache: invoke | ||
`crdb_internal.refresh_certificate_revocation()` in SQL or send an | ||
HTTP POST to `/_status/certificates/cached_revocations`. | ||
|
||
The manual refresh is not required if the revocation is not urgent: | ||
the cache is refreshed periodically in the background. The maximum | ||
time between refreshes is configured via the cluster setting | ||
`server.certificate_revocation_cache.refresh_interval`, which defaults to 24 hours. | ||
|
||
Remark from Ben: | ||
|
||
> OCSP responses include an optional `NextUpdate` timestamp field which | ||
> defines the validity period of the response. If this is set, we may | ||
> want to use it to set the cache expiration time. We should ask the | ||
> customer how long they want the cache to be valid for (24h seems | ||
> long to me; I'd expect a default more like an hour since this is | ||
> opt-in for users who care about revocation) and whether they use | ||
> this field. | ||
## Checking the status of revocations | ||
|
||
To inspect a CockroachDB node's opinion of revocations, use the | ||
`/_status/certificates/cached_revocations` API endpoint. | ||
|
||
# Reference-level explanation | ||
|
||
## Detailed design | ||
|
||
Each node implements a certificate verification broker. Each | ||
verification of a TLS cert for authn is changed to go through this new | ||
broker component. An error from the broker causes the cert verification to fail. | ||
|
||
The broker implements a cache internally. The cache is refreshed | ||
periodically at the frequency set by the new setting | ||
`server.certificate_revocation_cache.refresh_interval`. The refresh | ||
protocol is explained further below. | ||
|
||
## Changes to the HTTP endpoints | ||
|
||
The Admin endpoint `/_admin/v1/certificate_revocation` maps to an RPC which | ||
supports both Get and Post requests. | ||
|
||
The Get version produces a representation of the cache. This supports | ||
a "node ids" list argument like we have for other RPCs. When provided | ||
the "local" ID it reports on the local cache only. When provided no | ||
ID, it reports on the entire cluster. When provided a specific ID, it | ||
reports the cache on that nodes. This logic uses the node iteration | ||
code `status.iterateNodes()` that is already implemented, see | ||
`EnqueueRange()` for an example. | ||
|
||
|
||
The Post request forces a cluster-wide cache refresh. This is | ||
explained below. | ||
|
||
## Changes to SQL | ||
|
||
A new built-in function | ||
`crdb_internal.refresh_certificate_revocations()` also forces a | ||
cluster-side refresh of the cache. See below for details. | ||
|
||
The name of the built-in remains to be | ||
refined later, pending further investigation of | ||
[#51454](https://github.com/cockroachdb/cockroach/issues/51454). | ||
|
||
## Cluster-wide trigger for cache refreshes | ||
|
||
A refresh of the revocation cache can be triggered from any | ||
node. However, it's possible that a refresh be triggered from one node | ||
while another node is disconnected/down/partitioned away. We want that | ||
manual refresh requests do not get lost, especially when | ||
`server.certificate_revocation_cache.refresh_interval` is configured | ||
to a large interval (default 24 hours). | ||
|
||
To achieve this, we define a new system config key `LastCertificateRevocationRefreshRequestTimestamp`. | ||
|
||
Upon triggering the cache refresh, the node where the refresh was | ||
triggerred writes the current time to this key. Gossip then propagates | ||
the update to all other nodes. Eventually all nodes learn of the the | ||
refresh request. | ||
|
||
Concurrently, an async process on every node watches this config | ||
key. Every time its value moves past the time of the last refresh on | ||
that node, an extra refresh is triggered. | ||
|
||
Ben's remark: | ||
|
||
> If the cache is short-lived, we may be able to avoid creating a | ||
> system to manually force a refresh (or make it testing-only and it | ||
> doesn't need to handle disconnected nodes). | ||
## Cache refresh process | ||
|
||
For OCSP cache entries, all entries in the cache with status `good` | ||
are flushed, so that a new OCSP request will be forced upon the next | ||
use of the TLS certs. All entries with status `revoked` are | ||
preserved: any already-revoked cert is considered to remain revoked. | ||
|
||
Refreshes and errors are logged. | ||
|
||
## Cache queries during TLS cert verification | ||
|
||
When a TLS cert is to be checked, the code asks the broker for confirmation. | ||
|
||
The broker first checks that the certs are properly signed by their | ||
CA. If that fails, the verification fails. | ||
|
||
The broker then inspects the `security.ocsp.mode` cluster setting. If | ||
set to `off`, then no further logic is applied. | ||
|
||
Otherwise, the broker then inspects the certificate and its CA chain | ||
to collect all the OCSP reference URLs. For every cert in the chain | ||
where one of the parent CAs has an OCSP URL, the code checks the OCSP | ||
response cache for that URL, cert serial pair. If there is an entry | ||
with `good` or `revoked` already in the cache, that is used directly. | ||
|
||
If there is no cached entry yet, the OCSP server is queried. The response | ||
from OCSP is then analyzed: | ||
|
||
- if the OCSP response is badly signed, then the response is | ||
ignored. The cert validation fails. | ||
- if the OCSP response is `revoked`, the cert validation | ||
fails. | ||
- if the OCSP response is `good`, the cert validation succeeds. | ||
- if the OCSP response is `unknown`, or there is an error, then the behavior | ||
depends on the new cluster setting `security.ocsp.mode`: | ||
|
||
- if `strict`, then cert validation fails. | ||
- if `lax`, then cert validation succeeds. | ||
|
||
|
||
If the response was `revoked` or `good` (and properly signed), an | ||
entry is added in the cache for the URL/serial pair. | ||
|
||
## Drawbacks | ||
|
||
None known | ||
|
||
## Rationale and Alternatives | ||
|
||
The following designs have been considered: | ||
|
||
### Using CRLs instead of OCSP | ||
|
||
#### Background about CRLs | ||
|
||
A CRL is a list of X509 certs that mark other certs as "revoked", i.e. invalid. | ||
|
||
Technically, a *revocation cert* is a cert signed by a recognized CA, | ||
which certifies that another cert, identified by serial | ||
number/fingerprint, has been revoked as of a specific date. | ||
|
||
The certificate validation code should obey the revocation lists and | ||
refuse to validate/authenticate services using certs that have a | ||
currently known revocation cert in a CRL. | ||
|
||
In a service like CockroachDB, authn certs are presented by the client | ||
upon establishing new connections; whereas CRLs are configured | ||
server-side and fed into the server on a periodic basis. | ||
|
||
In practice, CRLs are fed using two mechanisms: | ||
|
||
- *external*: the operator has one or more revocation certs as | ||
separate files, or a combined file containing multiple certs. Then | ||
the operator "uploads" the CRL into the service. This should be done in at least two ways: | ||
|
||
- upon start-up, to load an initial CRL before the various other | ||
sub-systems in the service are fired up. | ||
|
||
- periodically, to refresh the CRL with new revocations while the | ||
service is running. This can be done by "pull" (the service uses | ||
e.g. HTTP to fetch a CRL over the network) or "push" (the operator | ||
invokes an API in the server to upload the CRL into it). | ||
|
||
- *internal*: each CA certificate can contain a field called | ||
`CRLDistributionPoints`. This field is a list of URLs that point to | ||
CRLs related to that CA. | ||
|
||
Services that support `CRLDistributionPoints` should fetch the CRLs | ||
prior to validating certs signed by that CA. | ||
|
||
A particular pitfall/chalenge with this field is that there may be | ||
multiple intermediate CAs, each with its own | ||
`CRLDistributionPoints`. Some of the CA certificates may be provided | ||
only during the TLS connection by the client, as part of the TLS | ||
client cert payload. So the CRL distribution points cannot generally | ||
be known "ahead of time" when a server starts up. | ||
|
||
References: | ||
|
||
https://www.pixelstech.net/article/1445498968-Generate-certificate-with-cRLDistributionPoints-extension-using-OpenSSL | ||
|
||
http://pkiglobe.org/crl_dist_points.html | ||
|
||
### Solution outline | ||
|
||
- There would be a command-line flag to read the initial CRL from a network | ||
service or the filesystem. | ||
|
||
The `--cert-revocation-list` flag is provided | ||
with a URL to a location that provides the revocation certs. This can | ||
either use a path to a local file containing the CRL, or a HTTP URL to | ||
an external service. This is optional if the CA certs are known to | ||
list their CRL URLs themselves. | ||
|
||
If the CRL is provided externally as a collection of discrete files, | ||
they can be combined together into a single file via an `openssl` | ||
command. | ||
|
||
- To revoke a cert, an operator would add the revocation cert to the | ||
list of certs. This can be either a local file (when | ||
`--cert-revocation-list` points to a local file), or a CRL server | ||
(when `--cert-revocation-list` points to a URL, or when using the | ||
`CRLDistributionLists` field in CA certs). | ||
|
||
- There would be a new SQL built-in function which an operator can use | ||
to refresh the CRLs from the network or the configured CRL local | ||
file: `crdb_internal.refresh_certificate_revocation()`. | ||
|
||
- the API endpoint `/_status/certificates/cached_revocations` would return | ||
cached CRL entries. | ||
|
||
- The cert validation would work as follows. | ||
|
||
The broker first checks that the certs are properly signed by their | ||
CA. If that fails, the verification fails. | ||
|
||
The broker then inspects the certificate and its CA chain to collect | ||
all the CRL distribution URLs, and merges that with the URL configured | ||
for `--cert-revocation-list`. | ||
|
||
For each URL in this list it checks if it has an entry in the URL -> | ||
timestamp map. If it does not (URL not known yet), or if the timestamp | ||
is older than the configured refresh interval, it fetches that URL | ||
psynchronously and updates the cache with the results. If there is an | ||
error, the URL -> timestamp map is not updated and the TLS cert | ||
validation fails. | ||
|
||
If a revocation cert is found in the CRL for either the leaf cert of | ||
any of its CA certs in the chain, TLS cert validation also fails. | ||
|
||
- The background cache refresh process would work as follows: it would periodically re-load | ||
the file / URL configured via `--cert-revocation-list`. New revocation | ||
certs are added to the cache. Existing revocation certs are left | ||
alone. | ||
|
||
There is a separate CRL refresh timestamp maintained by the cache | ||
for each CRL URL (a map URL -> timestamp). For each URL the timestamp | ||
is bumped forward, but only if there was no error during the CRL | ||
refresh. If there was an error, the refresh timestamp for that URL is | ||
not updated, so that the async task that monitors refreshes tries that | ||
URL again soon. | ||
|
||
|
||
|
||
Rejected idea: Store the CRL in a table and query it upon every authn request. | ||
|
||
- The CRL becomes unavailable if the cluster is in a sad state. | ||
- Causes a KV lookup overhead and a hotspot upon connections. | ||
|
||
|
||
## Unresolved questions | ||
|
||
N/A |