Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfcs: new RFC on certificate revocation #50602

Merged
merged 1 commit into from
Aug 24, 2020

Conversation

knz
Copy link
Contributor

@knz knz commented Jun 24, 2020

@knz knz requested a review from a team as a code owner June 24, 2020 12:46
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz knz force-pushed the 20200624-cert-revoke-rfc branch from e3e57b5 to b1c3e3b Compare June 24, 2020 12:47
@knz knz force-pushed the 20200624-cert-revoke-rfc branch from b1c3e3b to 85c7b7f Compare June 24, 2020 13:00
Note:

- if the operator sets the `server.user_login.certificate_revocation_list` manually,
and then configures `...update_url`, their manual configuration will be overwritten
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a footgun, no? If I specify it manually, it should stick forever (i.e. the data sources should merge).

Note that there isn't an issue if we say that you only get to provide a URL, but we have a custom scheme such as inline://revoke1,revoke2,...` (or even say "if it doesn't look like a URL but looks like a list, it's a list")?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other option would be to produce an error to the operator if they try to set both and only data source to be active at a time. This might be less surprising in practice to the operator and prevent errors.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @bdarnell, @DuskEagle, @joshimhoff, @knz, and @tbg)


docs/RFCS/20200624_cert_revocation.md, line 12 at r1 (raw file):

This RFC proposes to introduce a certificate revocation list (CRL) in
CockroachDB to validate against TLS client certificates used to
authenticate SQL client connections.

This shouldn't be limited to SQL client connections - we should support revocation for each of the CAs we use.


docs/RFCS/20200624_cert_revocation.md, line 16 at r1 (raw file):

The motivation is to enable revoking certificates earlier than their
expiration date, in case a certificate was compromised or the user
account is temporarily blocked from access (e.g. account suspension).

Certificate revocation is a poor fit for temporary blocking - it is in practice an irreversible decision and a new cert will need to be issued. We have a better solution for temporary blocking, the NOLOGIN user attribute.


docs/RFCS/20200624_cert_revocation.md, line 28 at r1 (raw file):

- introduce two additional cluster setting to refresh
  the primary setting above periodically:
  `server.user_login.certificate_revocation_list.update_url`,

The CA certificate may contain CRL URLs (called CRLDistributionPoints). We should probably use these instead of requiring separate configuration.

On the other hand, that would potentially mean we'd have to check CRLs for every CA, not only in the root ca.crt but also for intermediate certs that we may not have in our on-disk configuration but may only learn about at connection time. (one customer with whom we've recently discussed needs for revocation has a configuration with intermediate CAs; we should find out how they configure CRL distribution points. Of course, that customer also indicated a preference for OCSP instead of CRLs if possible).

The issue of intermediate CAs really complicates matters; if we need to support revocation at that level it may turn out to be simpler to use OSCP instead of CRLs.


docs/RFCS/20200624_cert_revocation.md, line 73 at r1 (raw file):

Previously, lunevalex wrote…

The other option would be to produce an error to the operator if they try to set both and only data source to be active at a time. This might be less surprising in practice to the operator and prevent errors.

Or we could store the automated-refresh CRLs and the manual CRL data separately and merge them. It would be bad if certificates got un-revoked because one of these sources clobbered the other.


docs/RFCS/20200624_cert_revocation.md, line 76 at r1 (raw file):

  the first time a CRL can be fetched successfully from the URL.

- if an URL fetch is unsuccessful, a message will be logged to indicate this fact,

At least one user has requested the option to hard-fail and disallow all logins if it's been too long since a successful CRL update.

We should have metrics (not just logging) for this (e.g. a metric for time of last successful CRL refresh).


docs/RFCS/20200624_cert_revocation.md, line 85 at r1 (raw file):

- when RESTOREing an entire cluster, the CRL stored in the backup will be restored.
  It is advised to ensure the CRL is refreshed at least once after a RESTORE

How would users ensure this?


docs/RFCS/20200624_cert_revocation.md, line 98 at r1 (raw file):

A cluster setting as the main store for the CRL answers two needs:

- the setting can be modified cluster-wide even if the KV layer is

Cluster settings are not propagated via gossip until the KV write commits. But I'm OK if CRL updates are impossible when KV is unavailable. Settings do give you read availability when KV is down.


docs/RFCS/20200624_cert_revocation.md, line 133 at r1 (raw file):

    node can become an attack vector because its CRL is outdated.

- Store the CRL on-disk in files, pass the directory via a command-line flag. Rejected:

Node-local caching of the CRL may be necessary for supporting revocation of node certificates, since we need to validate node certs before gossip and KV become available.

@knz knz force-pushed the 20200624-cert-revoke-rfc branch from 85c7b7f to ebfb9bb Compare July 21, 2020 11:33
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @bdarnell, @DuskEagle, @joshimhoff, @tbg, and @thtruo)


docs/RFCS/20200624_cert_revocation.md, line 12 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This shouldn't be limited to SQL client connections - we should support revocation for each of the CAs we use.

This sounds weird to me. The configurable CA should not be revocable - instead you should just rotate the CA. The problem I see if that if the CA is revoked then everything breaks.


docs/RFCS/20200624_cert_revocation.md, line 16 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Certificate revocation is a poor fit for temporary blocking - it is in practice an irreversible decision and a new cert will need to be issued. We have a better solution for temporary blocking, the NOLOGIN user attribute.

Good point. Removed the second part of the sentence.


docs/RFCS/20200624_cert_revocation.md, line 28 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The CA certificate may contain CRL URLs (called CRLDistributionPoints). We should probably use these instead of requiring separate configuration.

On the other hand, that would potentially mean we'd have to check CRLs for every CA, not only in the root ca.crt but also for intermediate certs that we may not have in our on-disk configuration but may only learn about at connection time. (one customer with whom we've recently discussed needs for revocation has a configuration with intermediate CAs; we should find out how they configure CRL distribution points. Of course, that customer also indicated a preference for OCSP instead of CRLs if possible).

The issue of intermediate CAs really complicates matters; if we need to support revocation at that level it may turn out to be simpler to use OSCP instead of CRLs.

These are very good point. I have largely rewritten the RFC in response to your comment.

The new text proposes to implement CRLs and OCSP side-by-side. The two mechanism are complementary IMHO so there is no conflict. We could choose to have the MVP implementation only implement one of the two, and then implement the other in a later phase. I'll take your and @thtruo 's suggestion about what is best.


docs/RFCS/20200624_cert_revocation.md, line 73 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Or we could store the automated-refresh CRLs and the manual CRL data separately and merge them. It would be bad if certificates got un-revoked because one of these sources clobbered the other.

I changed this completely. PTAL


docs/RFCS/20200624_cert_revocation.md, line 76 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

At least one user has requested the option to hard-fail and disallow all logins if it's been too long since a successful CRL update.

We should have metrics (not just logging) for this (e.g. a metric for time of last successful CRL refresh).

Can you look at the new text and reformulate your comment/suggestion in that light? thanks


docs/RFCS/20200624_cert_revocation.md, line 85 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

How would users ensure this?

The new text ensures that this verification is synchronous. PTAL


docs/RFCS/20200624_cert_revocation.md, line 98 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Cluster settings are not propagated via gossip until the KV write commits. But I'm OK if CRL updates are impossible when KV is unavailable. Settings do give you read availability when KV is down.

This has become moot.


docs/RFCS/20200624_cert_revocation.md, line 133 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Node-local caching of the CRL may be necessary for supporting revocation of node certificates, since we need to validate node certs before gossip and KV become available.

Agreed. Changed the text accordingly.

@knz knz force-pushed the 20200624-cert-revoke-rfc branch 7 times, most recently from ed42d48 to 0ee9d68 Compare July 21, 2020 12:04
Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @DuskEagle, @joshimhoff, @knz, @tbg, and @thtruo)


docs/RFCS/20200624_cert_revocation.md, line 12 at r1 (raw file):

Previously, knz (kena) wrote…

This sounds weird to me. The configurable CA should not be revocable - instead you should just rotate the CA. The problem I see if that if the CA is revoked then everything breaks.

I don't mean revocation of the CA cert itself (as you say, the remedy for that is to rotate it externally). I mean revocation of certs issued by those CAs (for example, revocation of node client/server certs).


docs/RFCS/20200624_cert_revocation.md, line 170 at r2 (raw file):

command.

When using OCSP: pass the `--cert-ocsp-server` flag and provide it

I would assume that the certificates contain their CRL/OCSP URLs, and wouldn't add the option to configure URLs separately unless we are told that there is a real customer need for this.

OTOH, even if the certificates include their own URLs, revocation checking should be optional so we need a three-valued option for each of CRL and OCSP (don't check, check and fail open, check and fail closed)


docs/RFCS/20200624_cert_revocation.md, line 195 at r2 (raw file):

the cache is refreshed periodically in the background. The maximum
time between refreshes is configured via the cluster setting
`server.certificate_revocation_cache.refresh_interval`, which defaults to 24 hours.

OCSP responses include an optional NextUpdate timestamp field which defines the validity period of the response. If this is set, we may want to use it to set the cache expiration time. We should ask the customer how long they want the cache to be valid for (24h seems long to me; I'd expect a default more like an hour since this is opt-in for users who care about revocation) and whether they use this field.


docs/RFCS/20200624_cert_revocation.md, line 223 at r2 (raw file):

- `--cert-ocsp-server`: supports an OCSP server URL.

## Changes to the HTTP endpoints

This information should be added to http://127.0.0.1:49456/#/reports/certificates/local

I'm not sure about the backend APIs used by this page, but consider adding the revocation cache to the existing endpoint instead of adding a new one.


docs/RFCS/20200624_cert_revocation.md, line 257 at r2 (raw file):

manual refresh requests do not get lost, especially when
`server.certificate_revocation_cache.refresh_interval` is configured
to a large interval (default 24 hours).

If the cache is short-lived, we may be able to avoid creating a system to manually force a refresh (or make it testing-only and it doesn't need to handle disconnected nodes).


docs/RFCS/20200624_cert_revocation.md, line 344 at r2 (raw file):

## Unresolved questions

Should we implement only CRLs? Only OCSP? Or both? Which one is more urgent+important? Which one is more common? What did customers tell us they want?

Based on my current estimates of the difficulty of each option, I think we should implement OCSP first, with CRLs coming later, if at all.

I have one customer data point, from a customer who would prefer OCSP but considers CRLs an acceptable alternative.

@knz knz force-pushed the 20200624-cert-revoke-rfc branch from 0ee9d68 to f216618 Compare August 24, 2020 08:25
@knz knz force-pushed the 20200624-cert-revoke-rfc branch from f216618 to 33539e0 Compare August 24, 2020 08:26
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I amended this RFC with the information from #53218 and a note that that PR does not implement caching.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @bdarnell, @DuskEagle, @joshimhoff, @tbg, and @thtruo)


docs/RFCS/20200624_cert_revocation.md, line 12 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't mean revocation of the CA cert itself (as you say, the remedy for that is to rotate it externally). I mean revocation of certs issued by those CAs (for example, revocation of node client/server certs).

Done.


docs/RFCS/20200624_cert_revocation.md, line 28 at r1 (raw file):

Previously, knz (kena) wrote…

These are very good point. I have largely rewritten the RFC in response to your comment.

The new text proposes to implement CRLs and OCSP side-by-side. The two mechanism are complementary IMHO so there is no conflict. We could choose to have the MVP implementation only implement one of the two, and then implement the other in a later phase. I'll take your and @thtruo 's suggestion about what is best.

Done.


docs/RFCS/20200624_cert_revocation.md, line 170 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I would assume that the certificates contain their CRL/OCSP URLs, and wouldn't add the option to configure URLs separately unless we are told that there is a real customer need for this.

OTOH, even if the certificates include their own URLs, revocation checking should be optional so we need a three-valued option for each of CRL and OCSP (don't check, check and fail open, check and fail closed)

Done.


docs/RFCS/20200624_cert_revocation.md, line 195 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

OCSP responses include an optional NextUpdate timestamp field which defines the validity period of the response. If this is set, we may want to use it to set the cache expiration time. We should ask the customer how long they want the cache to be valid for (24h seems long to me; I'd expect a default more like an hour since this is opt-in for users who care about revocation) and whether they use this field.

Added your remark to the text.


docs/RFCS/20200624_cert_revocation.md, line 223 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This information should be added to http://127.0.0.1:49456/#/reports/certificates/local

I'm not sure about the backend APIs used by this page, but consider adding the revocation cache to the existing endpoint instead of adding a new one.

Done.


docs/RFCS/20200624_cert_revocation.md, line 257 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If the cache is short-lived, we may be able to avoid creating a system to manually force a refresh (or make it testing-only and it doesn't need to handle disconnected nodes).

Added your remark to the text.


docs/RFCS/20200624_cert_revocation.md, line 344 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Based on my current estimates of the difficulty of each option, I think we should implement OCSP first, with CRLs coming later, if at all.

I have one customer data point, from a customer who would prefer OCSP but considers CRLs an acceptable alternative.

Done.

@knz
Copy link
Contributor Author

knz commented Aug 24, 2020

bors r=bdarnell

@craig
Copy link
Contributor

craig bot commented Aug 24, 2020

Build failed:

@knz
Copy link
Contributor Author

knz commented Aug 24, 2020

bors r=bdarnell

@craig
Copy link
Contributor

craig bot commented Aug 24, 2020

Build succeeded:

@craig craig bot merged commit a337502 into cockroachdb:master Aug 24, 2020
@knz knz deleted the 20200624-cert-revoke-rfc branch August 24, 2020 11:32
@thtruo
Copy link
Contributor

thtruo commented Sep 8, 2020

@knz circling back to our conversation offline: could you update the RFC to include a guide/steps on how a user can set up an OCSP server and complete an e2e journey around revoking certs with CRDB clusters? The intended audience is for a PM and/or docs writer who would need to do acceptance testing and to write a guide for docs, respectively.

cc @Amruta-Ranade for awareness

@knz
Copy link
Contributor Author

knz commented Sep 8, 2020

thanks for the reminder. Filed a separate issue here: #54031

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants