From 33539e03791f5b71f6ea51b42c402de129b91f8d Mon Sep 17 00:00:00 2001
From: Raphael 'kena' Poss <knz@thaumogen.net>
Date: Wed, 24 Jun 2020 14:46:07 +0200
Subject: [PATCH] rfcs: new RFC on certificate revocation

Release note: None
---
 docs/RFCS/20200624_cert_revocation.md | 383 ++++++++++++++++++++++++++
 1 file changed, 383 insertions(+)
 create mode 100644 docs/RFCS/20200624_cert_revocation.md

diff --git a/docs/RFCS/20200624_cert_revocation.md b/docs/RFCS/20200624_cert_revocation.md
new file mode 100644
index 000000000000..3f05314dc9aa
--- /dev/null
+++ b/docs/RFCS/20200624_cert_revocation.md
@@ -0,0 +1,383 @@
+- Feature Name: TLS client certificate revocation
+- Status: accepted
+- Start Date: 2020-06-24
+- Authors: knz, ben
+- RFC PR: [#50602](https://github.com/cockroachdb/cockroach/pull/50602)
+- Cockroach Issue: [#29641](https://github.com/cockroachdb/cockroach/issues/29641)
+
+# Summary
+
+This RFC proposes to introduce mechanisms to revoke TLS client
+certificates when used by CockroachDB for authentication.
+
+The motivation is to enable revoking certificates earlier than their
+expiration date, in case a certificate was compromised.
+
+This is also required for various compliance checklists.
+
+The technical proposal has multiple components:
+
+- it extends the TLS certificate validation code to support checking
+  the cert against
+  [OCSP](#online-certificate-status-protocol-ocsp) (explained below).
+
+- it introduces code to fetch OCSP responses from the
+  network and cache them.
+
+- it introduces a cluster setting `security.ocsp.mode` which controls
+  the strictness of the certificate validation.
+
+- it reports on the status of OCSP responses in the
+  `/_status/certificates` endpoint.
+
+- it introduces a cluster setting to control the expiration of
+  the cache:  `server.certificate_revocation_cache.refresh_interval`.
+
+- it introduces an API endpoint `/_status/certificates/cached_revocations`,
+  which, upon HTTP get requests, produces a report on all currently
+  cached OCSP responses upon HTTP GET requests.
+
+  (In a later phase, also available via SQL built-in function, pending
+  [#51454](https://github.com/cockroachdb/cockroach/issues/51454) or
+  similar work.)
+
+- the same API endpoint also supports HTTP POST request to manually
+  force a refresh.
+
+**Note: as of August 2020, [PR
+#53218](https://github.com/cockroachdb/cockroach/pull/53218)
+implements a MVP of the checking logic. However it does not implement
+caching as described in this RFC. The caching remains to be done.**
+
+# Motivation
+
+see above
+
+# Background
+
+## Online Certificate Status Protocol (OCSP)
+
+OCSP is a network protocol designed to facilitate online validation of TLS certificates.
+It performs the same role as CRLs but is intended to be lightweight in comparison.
+
+The way it works is the following:
+
+- upon observing a TLS cert for validation, the service extracts the cert's serial number.
+
+- the service sends the cert's serial number to an OCSP server. The
+  server's URL is known ahead of time (configured separately) or
+  embedded in the client/CA certs themselves under the
+  `authorityInfoAccess` field.
+
+- the OCSP server internally checks the cert's validity against CRLs etc.
+
+- the OCSP server returns a response to the service with status either
+  `good`, `revoked` or `unknown`. The response itself is signed using
+  a valid CA. The service must verify the signature in the OCSP response.
+
+The cost to implement / validate a cert using OCSP is typically lower
+computationally than using CRLs. The OCSP server typically caches
+response across multiple requests.
+
+Nevertheless, OCSP incurs a mandatory network round-trip upon every
+verification. In CockroachDB it would be unreasonable to query OCSP
+upon every incoming client connection. Therefore, for our purpose OCSP
+does not obviate the need for a service-side cache.
+
+References:
+
+https://www.ssl.com/faqs/faq-digital-certificate-revocation/
+
+https://jamielinux.com/docs/openssl-certificate-authority/online-certificate-status-protocol.html
+
+https://en.wikipedia.org/wiki/Online_Certificate_Status_Protocol
+
+# Guide-level explanation
+
+After a cluster has been set up, a security event may occur that
+requires the operator to revoke a client TLS certificate. This event
+could be a compromise (i.e. certificate lands in the wrong hands) or
+the deployment of corporate IT / compliance processes that requires a
+revocation capability in all networked services.
+
+For this CockroachDB offers the ability to check OCSP servers.
+Technically OCSP is a network service that can validate certificates
+remotely.
+
+## Upon starting a CockroachDB node
+
+No additional command-line flags are needed.
+
+The TLS certificates are assumed to contain an `authorityInfoAccess`
+field that points to their OCSP server.
+
+## Manually revoking a certificate
+
+To revoke a certificate, the operator should proceed as follows:
+
+- add the revocation information to the OCSP database in the OCSP
+  service configured via the `authorityInfoAccess` in certs.
+
+- finally, force a refresh of the CockroachDB cache: invoke
+  `crdb_internal.refresh_certificate_revocation()` in SQL or send an
+  HTTP POST to `/_status/certificates/cached_revocations`.
+
+The manual refresh is not required if the revocation is not urgent:
+the cache is refreshed periodically in the background. The maximum
+time between refreshes is configured via the cluster setting
+`server.certificate_revocation_cache.refresh_interval`, which defaults to 24 hours.
+
+Remark from Ben:
+
+> OCSP responses include an optional `NextUpdate` timestamp field which
+> defines the validity period of the response. If this is set, we may
+> want to use it to set the cache expiration time. We should ask the
+> customer how long they want the cache to be valid for (24h seems
+> long to me; I'd expect a default more like an hour since this is
+> opt-in for users who care about revocation) and whether they use
+> this field.
+
+## Checking the status of revocations
+
+To inspect a CockroachDB node's opinion of revocations, use the
+`/_status/certificates/cached_revocations` API endpoint.
+
+# Reference-level explanation
+
+## Detailed design
+
+Each node implements a certificate verification broker.  Each
+verification of a TLS cert for authn is changed to go through this new
+broker component. An error from the broker causes the cert verification to fail.
+
+The broker implements a cache internally. The cache is refreshed
+periodically at the frequency set by the new setting
+`server.certificate_revocation_cache.refresh_interval`. The refresh
+protocol is explained further below.
+
+## Changes to the HTTP endpoints
+
+The Admin endpoint `/_admin/v1/certificate_revocation` maps to an RPC which
+supports both Get and Post requests.
+
+The Get version produces a representation of the cache. This supports
+a "node ids" list argument like we have for other RPCs. When provided
+the "local" ID it reports on the local cache only. When provided no
+ID, it reports on the entire cluster. When provided a specific ID, it
+reports the cache on that nodes. This logic uses the node iteration
+code `status.iterateNodes()` that is already implemented, see
+`EnqueueRange()` for an example.
+
+
+The Post request forces a cluster-wide cache refresh. This is
+explained below.
+
+## Changes to SQL
+
+A new built-in function
+`crdb_internal.refresh_certificate_revocations()` also forces a
+cluster-side refresh of the cache.  See below for details.
+
+The name of the built-in remains to be
+refined later, pending further investigation of
+[#51454](https://github.com/cockroachdb/cockroach/issues/51454).
+
+## Cluster-wide trigger for cache refreshes
+
+A refresh of the revocation cache can be triggered from any
+node. However, it's possible that a refresh be triggered from one node
+while another node is disconnected/down/partitioned away. We want that
+manual refresh requests do not get lost, especially when
+`server.certificate_revocation_cache.refresh_interval` is configured
+to a large interval (default 24 hours).
+
+To achieve this, we define a new system config key `LastCertificateRevocationRefreshRequestTimestamp`.
+
+Upon triggering the cache refresh, the node where the refresh was
+triggerred writes the current time to this key. Gossip then propagates
+the update to all other nodes. Eventually all nodes learn of the the
+refresh request.
+
+Concurrently, an async process on every node watches this config
+key. Every time its value moves past the time of the last refresh on
+that node, an extra refresh is triggered.
+
+Ben's remark:
+
+> If the cache is short-lived, we may be able to avoid creating a
+> system to manually force a refresh (or make it testing-only and it
+> doesn't need to handle disconnected nodes).
+
+## Cache refresh process
+
+For OCSP cache entries, all entries in the cache with status `good`
+are flushed, so that a new OCSP request will be forced upon the next
+use of the TLS certs.  All entries with status `revoked` are
+preserved: any already-revoked cert is considered to remain revoked.
+
+Refreshes and errors are logged.
+
+## Cache queries during TLS cert verification
+
+When a TLS cert is to be checked, the code asks the broker for confirmation.
+
+The broker first checks that the certs are properly signed by their
+CA. If that fails, the verification fails.
+
+The broker then inspects the `security.ocsp.mode` cluster setting. If
+set to `off`, then no further logic is applied.
+
+Otherwise, the broker then inspects the certificate and its CA chain
+to collect all the OCSP reference URLs. For every cert in the chain
+where one of the parent CAs has an OCSP URL, the code checks the OCSP
+response cache for that URL, cert serial pair. If there is an entry
+with `good` or `revoked` already in the cache, that is used directly.
+
+If there is no cached entry yet, the OCSP server is queried.  The response
+from OCSP is then analyzed:
+
+- if the OCSP response is badly signed, then the response is
+  ignored. The cert validation fails.
+- if the OCSP response is `revoked`, the cert validation
+  fails.
+- if the OCSP response is `good`, the cert validation succeeds.
+- if the OCSP response is `unknown`, or there is an error, then the behavior
+  depends on the new cluster setting `security.ocsp.mode`:
+
+  - if `strict`, then cert validation fails.
+  - if `lax`, then cert validation succeeds.
+
+
+If the response was `revoked` or `good` (and properly signed), an
+entry is added in the cache for the URL/serial pair.
+
+## Drawbacks
+
+None known
+
+## Rationale and Alternatives
+
+The following designs have been considered:
+
+### Using CRLs instead of OCSP
+
+#### Background about CRLs
+
+A CRL is a list of X509 certs that mark other certs as "revoked", i.e. invalid.
+
+Technically, a *revocation cert* is a cert signed by a recognized CA,
+which certifies that another cert, identified by serial
+number/fingerprint, has been revoked as of a specific date.
+
+The certificate validation code should obey the revocation lists and
+refuse to validate/authenticate services using certs that have a
+currently known revocation cert in a CRL.
+
+In a service like CockroachDB, authn certs are presented by the client
+upon establishing new connections; whereas CRLs are configured
+server-side and fed into the server on a periodic basis.
+
+In practice, CRLs are fed using two mechanisms:
+
+- *external*: the operator has one or more revocation certs as
+  separate files, or a combined file containing multiple certs. Then
+  the operator "uploads" the CRL into the service. This should be done in at least two ways:
+
+  - upon start-up, to load an initial CRL before the various other
+    sub-systems in the service are fired up.
+
+  - periodically, to refresh the CRL with new revocations while the
+    service is running. This can be done by "pull" (the service uses
+    e.g. HTTP to fetch a CRL over the network) or "push" (the operator
+    invokes an API in the server to upload the CRL into it).
+
+- *internal*: each CA certificate can contain a field called
+  `CRLDistributionPoints`. This field is a list of URLs that point to
+  CRLs related to that CA.
+
+  Services that support `CRLDistributionPoints` should fetch the CRLs
+  prior to validating certs signed by that CA.
+
+  A particular pitfall/chalenge with this field is that there may be
+  multiple intermediate CAs, each with its own
+  `CRLDistributionPoints`. Some of the CA certificates may be provided
+  only during the TLS connection by the client, as part of the TLS
+  client cert payload. So the CRL distribution points cannot generally
+  be known "ahead of time" when a server starts up.
+
+  References:
+
+  https://www.pixelstech.net/article/1445498968-Generate-certificate-with-cRLDistributionPoints-extension-using-OpenSSL
+
+  http://pkiglobe.org/crl_dist_points.html
+
+### Solution outline
+
+- There would be a command-line flag to read the initial CRL from a network
+  service or the filesystem.
+
+  The `--cert-revocation-list` flag is provided
+  with a URL to a location that provides the revocation certs. This can
+  either use a path to a local file containing the CRL, or a HTTP URL to
+  an external service. This is optional if the CA certs are known to
+  list their CRL URLs themselves.
+
+  If the CRL is provided externally as a collection of discrete files,
+  they can be combined together into a single file via an `openssl`
+  command.
+
+- To revoke a cert, an operator would add the revocation cert to the
+  list of certs. This can be either a local file (when
+  `--cert-revocation-list` points to a local file), or a CRL server
+  (when `--cert-revocation-list` points to a URL, or when using the
+  `CRLDistributionLists` field in CA certs).
+
+- There would be a new SQL built-in function which an operator can use
+  to refresh the CRLs from the network or the configured CRL local
+  file: `crdb_internal.refresh_certificate_revocation()`.
+
+- the API endpoint `/_status/certificates/cached_revocations` would return
+  cached CRL entries.
+
+- The cert validation would work as follows.
+
+  The broker first checks that the certs are properly signed by their
+  CA. If that fails, the verification fails.
+
+  The broker then inspects the certificate and its CA chain to collect
+  all the CRL distribution URLs, and merges that with the URL configured
+  for `--cert-revocation-list`.
+
+  For each URL in this list it checks if it has an entry in the URL ->
+  timestamp map. If it does not (URL not known yet), or if the timestamp
+  is older than the configured refresh interval, it fetches that URL
+  psynchronously and updates the cache with the results. If there is an
+  error, the URL -> timestamp map is not updated and the TLS cert
+  validation fails.
+
+  If a revocation cert is found in the CRL for either the leaf cert of
+  any of its CA certs in the chain, TLS cert validation also fails.
+
+- The background cache refresh process would work as follows: it would periodically re-load
+  the file / URL configured via `--cert-revocation-list`. New revocation
+  certs are added to the cache. Existing revocation certs are left
+  alone.
+
+  There is a separate CRL refresh timestamp maintained by the cache
+  for each CRL URL (a map URL -> timestamp). For each URL the timestamp
+  is bumped forward, but only if there was no error during the CRL
+  refresh. If there was an error, the refresh timestamp for that URL is
+  not updated, so that the async task that monitors refreshes tries that
+  URL again soon.
+
+
+
+Rejected idea: Store the CRL in a table and query it upon every authn request.
+
+  - The CRL becomes unavailable if the cluster is in a sad state.
+  - Causes a KV lookup overhead and a hotspot upon connections.
+
+
+## Unresolved questions
+
+N/A