rfcs: new RFC on certificate revocation #50602

knz · 2020-06-24T12:46:45Z

The RFC is here https://github.com/knz/cockroach/blob/20200624-cert-revoke-rfc/docs/RFCS/20200624_cert_revocation.md

cockroach-teamcity · 2020-06-24T12:46:52Z

This change is

tbg · 2020-06-24T14:24:10Z

docs/RFCS/20200624_cert_revocation.md

+Note:
+
+- if the operator sets the `server.user_login.certificate_revocation_list` manually,
+  and then configures `...update_url`, their manual configuration will be overwritten


This seems like a footgun, no? If I specify it manually, it should stick forever (i.e. the data sources should merge).

Note that there isn't an issue if we say that you only get to provide a URL, but we have a custom scheme such as inline://revoke1,revoke2,...` (or even say "if it doesn't look like a URL but looks like a list, it's a list")?

The other option would be to produce an error to the operator if they try to set both and only data source to be active at a time. This might be less surprising in practice to the operator and prevent errors.

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @bdarnell, @DuskEagle, @joshimhoff, @knz, and @tbg)

docs/RFCS/20200624_cert_revocation.md, line 12 at r1 (raw file):

This RFC proposes to introduce a certificate revocation list (CRL) in
CockroachDB to validate against TLS client certificates used to
authenticate SQL client connections.

This shouldn't be limited to SQL client connections - we should support revocation for each of the CAs we use.

docs/RFCS/20200624_cert_revocation.md, line 16 at r1 (raw file):

The motivation is to enable revoking certificates earlier than their
expiration date, in case a certificate was compromised or the user
account is temporarily blocked from access (e.g. account suspension).

Certificate revocation is a poor fit for temporary blocking - it is in practice an irreversible decision and a new cert will need to be issued. We have a better solution for temporary blocking, the NOLOGIN user attribute.

docs/RFCS/20200624_cert_revocation.md, line 28 at r1 (raw file):

- introduce two additional cluster setting to refresh
  the primary setting above periodically:
  `server.user_login.certificate_revocation_list.update_url`,

The CA certificate may contain CRL URLs (called CRLDistributionPoints). We should probably use these instead of requiring separate configuration.

On the other hand, that would potentially mean we'd have to check CRLs for every CA, not only in the root ca.crt but also for intermediate certs that we may not have in our on-disk configuration but may only learn about at connection time. (one customer with whom we've recently discussed needs for revocation has a configuration with intermediate CAs; we should find out how they configure CRL distribution points. Of course, that customer also indicated a preference for OCSP instead of CRLs if possible).

The issue of intermediate CAs really complicates matters; if we need to support revocation at that level it may turn out to be simpler to use OSCP instead of CRLs.

docs/RFCS/20200624_cert_revocation.md, line 73 at r1 (raw file):

Previously, lunevalex wrote…

The other option would be to produce an error to the operator if they try to set both and only data source to be active at a time. This might be less surprising in practice to the operator and prevent errors.

Or we could store the automated-refresh CRLs and the manual CRL data separately and merge them. It would be bad if certificates got un-revoked because one of these sources clobbered the other.

docs/RFCS/20200624_cert_revocation.md, line 76 at r1 (raw file):

  the first time a CRL can be fetched successfully from the URL.

- if an URL fetch is unsuccessful, a message will be logged to indicate this fact,

At least one user has requested the option to hard-fail and disallow all logins if it's been too long since a successful CRL update.

We should have metrics (not just logging) for this (e.g. a metric for time of last successful CRL refresh).

docs/RFCS/20200624_cert_revocation.md, line 85 at r1 (raw file):

- when RESTOREing an entire cluster, the CRL stored in the backup will be restored.
  It is advised to ensure the CRL is refreshed at least once after a RESTORE

How would users ensure this?

docs/RFCS/20200624_cert_revocation.md, line 98 at r1 (raw file):

A cluster setting as the main store for the CRL answers two needs:

- the setting can be modified cluster-wide even if the KV layer is

Cluster settings are not propagated via gossip until the KV write commits. But I'm OK if CRL updates are impossible when KV is unavailable. Settings do give you read availability when KV is down.

docs/RFCS/20200624_cert_revocation.md, line 133 at r1 (raw file):

    node can become an attack vector because its CRL is outdated.

- Store the CRL on-disk in files, pass the directory via a command-line flag. Rejected:

Node-local caching of the CRL may be necessary for supporting revocation of node certificates, since we need to validate node certs before gossip and KV become available.

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @bdarnell, @DuskEagle, @joshimhoff, @tbg, and @thtruo)

docs/RFCS/20200624_cert_revocation.md, line 12 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This shouldn't be limited to SQL client connections - we should support revocation for each of the CAs we use.

This sounds weird to me. The configurable CA should not be revocable - instead you should just rotate the CA. The problem I see if that if the CA is revoked then everything breaks.

docs/RFCS/20200624_cert_revocation.md, line 16 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Certificate revocation is a poor fit for temporary blocking - it is in practice an irreversible decision and a new cert will need to be issued. We have a better solution for temporary blocking, the NOLOGIN user attribute.

Good point. Removed the second part of the sentence.

docs/RFCS/20200624_cert_revocation.md, line 28 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The CA certificate may contain CRL URLs (called CRLDistributionPoints). We should probably use these instead of requiring separate configuration.

On the other hand, that would potentially mean we'd have to check CRLs for every CA, not only in the root ca.crt but also for intermediate certs that we may not have in our on-disk configuration but may only learn about at connection time. (one customer with whom we've recently discussed needs for revocation has a configuration with intermediate CAs; we should find out how they configure CRL distribution points. Of course, that customer also indicated a preference for OCSP instead of CRLs if possible).

The issue of intermediate CAs really complicates matters; if we need to support revocation at that level it may turn out to be simpler to use OSCP instead of CRLs.

These are very good point. I have largely rewritten the RFC in response to your comment.

The new text proposes to implement CRLs and OCSP side-by-side. The two mechanism are complementary IMHO so there is no conflict. We could choose to have the MVP implementation only implement one of the two, and then implement the other in a later phase. I'll take your and @thtruo 's suggestion about what is best.

docs/RFCS/20200624_cert_revocation.md, line 73 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Or we could store the automated-refresh CRLs and the manual CRL data separately and merge them. It would be bad if certificates got un-revoked because one of these sources clobbered the other.

I changed this completely. PTAL

docs/RFCS/20200624_cert_revocation.md, line 76 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

At least one user has requested the option to hard-fail and disallow all logins if it's been too long since a successful CRL update.

We should have metrics (not just logging) for this (e.g. a metric for time of last successful CRL refresh).

Can you look at the new text and reformulate your comment/suggestion in that light? thanks

docs/RFCS/20200624_cert_revocation.md, line 85 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

How would users ensure this?

The new text ensures that this verification is synchronous. PTAL

docs/RFCS/20200624_cert_revocation.md, line 98 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Cluster settings are not propagated via gossip until the KV write commits. But I'm OK if CRL updates are impossible when KV is unavailable. Settings do give you read availability when KV is down.

This has become moot.

docs/RFCS/20200624_cert_revocation.md, line 133 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Node-local caching of the CRL may be necessary for supporting revocation of node certificates, since we need to validate node certs before gossip and KV become available.

Agreed. Changed the text accordingly.

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @DuskEagle, @joshimhoff, @knz, @tbg, and @thtruo)

docs/RFCS/20200624_cert_revocation.md, line 12 at r1 (raw file):

Previously, knz (kena) wrote…

This sounds weird to me. The configurable CA should not be revocable - instead you should just rotate the CA. The problem I see if that if the CA is revoked then everything breaks.

I don't mean revocation of the CA cert itself (as you say, the remedy for that is to rotate it externally). I mean revocation of certs issued by those CAs (for example, revocation of node client/server certs).

docs/RFCS/20200624_cert_revocation.md, line 170 at r2 (raw file):

command.

When using OCSP: pass the `--cert-ocsp-server` flag and provide it

I would assume that the certificates contain their CRL/OCSP URLs, and wouldn't add the option to configure URLs separately unless we are told that there is a real customer need for this.

OTOH, even if the certificates include their own URLs, revocation checking should be optional so we need a three-valued option for each of CRL and OCSP (don't check, check and fail open, check and fail closed)

docs/RFCS/20200624_cert_revocation.md, line 195 at r2 (raw file):

the cache is refreshed periodically in the background. The maximum
time between refreshes is configured via the cluster setting
`server.certificate_revocation_cache.refresh_interval`, which defaults to 24 hours.

OCSP responses include an optional NextUpdate timestamp field which defines the validity period of the response. If this is set, we may want to use it to set the cache expiration time. We should ask the customer how long they want the cache to be valid for (24h seems long to me; I'd expect a default more like an hour since this is opt-in for users who care about revocation) and whether they use this field.

docs/RFCS/20200624_cert_revocation.md, line 223 at r2 (raw file):

- `--cert-ocsp-server`: supports an OCSP server URL.

## Changes to the HTTP endpoints

This information should be added to http://127.0.0.1:49456/#/reports/certificates/local

I'm not sure about the backend APIs used by this page, but consider adding the revocation cache to the existing endpoint instead of adding a new one.

docs/RFCS/20200624_cert_revocation.md, line 257 at r2 (raw file):

manual refresh requests do not get lost, especially when
`server.certificate_revocation_cache.refresh_interval` is configured
to a large interval (default 24 hours).

If the cache is short-lived, we may be able to avoid creating a system to manually force a refresh (or make it testing-only and it doesn't need to handle disconnected nodes).

docs/RFCS/20200624_cert_revocation.md, line 344 at r2 (raw file):

## Unresolved questions

Should we implement only CRLs? Only OCSP? Or both? Which one is more urgent+important? Which one is more common? What did customers tell us they want?

Based on my current estimates of the difficulty of each option, I think we should implement OCSP first, with CRLs coming later, if at all.

I have one customer data point, from a customer who would prefer OCSP but considers CRLs an acceptable alternative.

Release note: None

knz

I amended this RFC with the information from #53218 and a note that that PR does not implement caching.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @bdarnell, @DuskEagle, @joshimhoff, @tbg, and @thtruo)

docs/RFCS/20200624_cert_revocation.md, line 12 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't mean revocation of the CA cert itself (as you say, the remedy for that is to rotate it externally). I mean revocation of certs issued by those CAs (for example, revocation of node client/server certs).

Done.

docs/RFCS/20200624_cert_revocation.md, line 28 at r1 (raw file):

Previously, knz (kena) wrote…

These are very good point. I have largely rewritten the RFC in response to your comment.

The new text proposes to implement CRLs and OCSP side-by-side. The two mechanism are complementary IMHO so there is no conflict. We could choose to have the MVP implementation only implement one of the two, and then implement the other in a later phase. I'll take your and @thtruo 's suggestion about what is best.

Done.

docs/RFCS/20200624_cert_revocation.md, line 170 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I would assume that the certificates contain their CRL/OCSP URLs, and wouldn't add the option to configure URLs separately unless we are told that there is a real customer need for this.

OTOH, even if the certificates include their own URLs, revocation checking should be optional so we need a three-valued option for each of CRL and OCSP (don't check, check and fail open, check and fail closed)

Done.

docs/RFCS/20200624_cert_revocation.md, line 195 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

OCSP responses include an optional NextUpdate timestamp field which defines the validity period of the response. If this is set, we may want to use it to set the cache expiration time. We should ask the customer how long they want the cache to be valid for (24h seems long to me; I'd expect a default more like an hour since this is opt-in for users who care about revocation) and whether they use this field.

Added your remark to the text.

docs/RFCS/20200624_cert_revocation.md, line 223 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This information should be added to http://127.0.0.1:49456/#/reports/certificates/local

I'm not sure about the backend APIs used by this page, but consider adding the revocation cache to the existing endpoint instead of adding a new one.

Done.

docs/RFCS/20200624_cert_revocation.md, line 257 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If the cache is short-lived, we may be able to avoid creating a system to manually force a refresh (or make it testing-only and it doesn't need to handle disconnected nodes).

Added your remark to the text.

docs/RFCS/20200624_cert_revocation.md, line 344 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Based on my current estimates of the difficulty of each option, I think we should implement OCSP first, with CRLs coming later, if at all.

I have one customer data point, from a customer who would prefer OCSP but considers CRLs an acceptable alternative.

Done.

knz · 2020-08-24T08:27:53Z

bors r=bdarnell

craig · 2020-08-24T09:41:59Z

Build failed:

GitHub CI (Cockroach)

knz · 2020-08-24T10:56:09Z

bors r=bdarnell

craig · 2020-08-24T11:27:40Z

Build succeeded:

GitHub CI (Cockroach)

thtruo · 2020-09-08T15:24:04Z

@knz circling back to our conversation offline: could you update the RFC to include a guide/steps on how a user can set up an OCSP server and complete an e2e journey around revoking certs with CRDB clusters? The intended audience is for a PM and/or docs writer who would need to do acceptance testing and to write a guide for docs, respectively.

cc @Amruta-Ranade for awareness

knz · 2020-09-08T15:25:58Z

thanks for the reminder. Filed a separate issue here: #54031

knz requested a review from a team as a code owner June 24, 2020 12:46

knz force-pushed the 20200624-cert-revoke-rfc branch from e3e57b5 to b1c3e3b Compare June 24, 2020 12:47

knz requested review from aaron-crl, bdarnell, DuskEagle and joshimhoff June 24, 2020 12:47

knz mentioned this pull request Jun 24, 2020

security: support certificate revocation #29641

Closed

knz force-pushed the 20200624-cert-revoke-rfc branch from b1c3e3b to 85c7b7f Compare June 24, 2020 13:00

tbg reviewed Jun 24, 2020

View reviewed changes

bdarnell reviewed Jun 29, 2020

View reviewed changes

knz force-pushed the 20200624-cert-revoke-rfc branch from 85c7b7f to ebfb9bb Compare July 21, 2020 11:33

knz commented Jul 21, 2020

View reviewed changes

knz force-pushed the 20200624-cert-revoke-rfc branch 7 times, most recently from ed42d48 to 0ee9d68 Compare July 21, 2020 12:04

bdarnell reviewed Jul 27, 2020

View reviewed changes

knz force-pushed the 20200624-cert-revoke-rfc branch from 0ee9d68 to f216618 Compare August 24, 2020 08:25

rfcs: new RFC on certificate revocation

33539e0

Release note: None

knz force-pushed the 20200624-cert-revoke-rfc branch from f216618 to 33539e0 Compare August 24, 2020 08:26

knz commented Aug 24, 2020

View reviewed changes

craig bot merged commit a337502 into cockroachdb:master Aug 24, 2020

knz deleted the 20200624-cert-revoke-rfc branch August 24, 2020 11:32

knz mentioned this pull request Sep 8, 2020

docs: provide more/better examples in cert revocation RFC #54031

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfcs: new RFC on certificate revocation #50602

rfcs: new RFC on certificate revocation #50602

knz commented Jun 24, 2020 •

edited

Loading

cockroach-teamcity commented Jun 24, 2020

tbg Jun 24, 2020

lunevalex Jun 24, 2020

bdarnell left a comment

knz left a comment

bdarnell left a comment

knz left a comment

knz commented Aug 24, 2020

craig bot commented Aug 24, 2020

knz commented Aug 24, 2020

craig bot commented Aug 24, 2020

thtruo commented Sep 8, 2020

knz commented Sep 8, 2020

rfcs: new RFC on certificate revocation #50602

rfcs: new RFC on certificate revocation #50602

Conversation

knz commented Jun 24, 2020 • edited Loading

cockroach-teamcity commented Jun 24, 2020

tbg Jun 24, 2020

Choose a reason for hiding this comment

lunevalex Jun 24, 2020

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

knz commented Aug 24, 2020

craig bot commented Aug 24, 2020

knz commented Aug 24, 2020

craig bot commented Aug 24, 2020

thtruo commented Sep 8, 2020

knz commented Sep 8, 2020

knz commented Jun 24, 2020 •

edited

Loading