-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfcs: new RFC on certificate revocation #50602
Conversation
e3e57b5
to
b1c3e3b
Compare
b1c3e3b
to
85c7b7f
Compare
Note: | ||
|
||
- if the operator sets the `server.user_login.certificate_revocation_list` manually, | ||
and then configures `...update_url`, their manual configuration will be overwritten |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a footgun, no? If I specify it manually, it should stick forever (i.e. the data sources should merge).
Note that there isn't an issue if we say that you only get to provide a URL, but we have a custom scheme such as inline://revoke1,revoke2,...` (or even say "if it doesn't look like a URL but looks like a list, it's a list")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other option would be to produce an error to the operator if they try to set both and only data source to be active at a time. This might be less surprising in practice to the operator and prevent errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @bdarnell, @DuskEagle, @joshimhoff, @knz, and @tbg)
docs/RFCS/20200624_cert_revocation.md, line 12 at r1 (raw file):
This RFC proposes to introduce a certificate revocation list (CRL) in CockroachDB to validate against TLS client certificates used to authenticate SQL client connections.
This shouldn't be limited to SQL client connections - we should support revocation for each of the CAs we use.
docs/RFCS/20200624_cert_revocation.md, line 16 at r1 (raw file):
The motivation is to enable revoking certificates earlier than their expiration date, in case a certificate was compromised or the user account is temporarily blocked from access (e.g. account suspension).
Certificate revocation is a poor fit for temporary blocking - it is in practice an irreversible decision and a new cert will need to be issued. We have a better solution for temporary blocking, the NOLOGIN
user attribute.
docs/RFCS/20200624_cert_revocation.md, line 28 at r1 (raw file):
- introduce two additional cluster setting to refresh the primary setting above periodically: `server.user_login.certificate_revocation_list.update_url`,
The CA certificate may contain CRL URLs (called CRLDistributionPoints). We should probably use these instead of requiring separate configuration.
On the other hand, that would potentially mean we'd have to check CRLs for every CA, not only in the root ca.crt
but also for intermediate certs that we may not have in our on-disk configuration but may only learn about at connection time. (one customer with whom we've recently discussed needs for revocation has a configuration with intermediate CAs; we should find out how they configure CRL distribution points. Of course, that customer also indicated a preference for OCSP instead of CRLs if possible).
The issue of intermediate CAs really complicates matters; if we need to support revocation at that level it may turn out to be simpler to use OSCP instead of CRLs.
docs/RFCS/20200624_cert_revocation.md, line 73 at r1 (raw file):
Previously, lunevalex wrote…
The other option would be to produce an error to the operator if they try to set both and only data source to be active at a time. This might be less surprising in practice to the operator and prevent errors.
Or we could store the automated-refresh CRLs and the manual CRL data separately and merge them. It would be bad if certificates got un-revoked because one of these sources clobbered the other.
docs/RFCS/20200624_cert_revocation.md, line 76 at r1 (raw file):
the first time a CRL can be fetched successfully from the URL. - if an URL fetch is unsuccessful, a message will be logged to indicate this fact,
At least one user has requested the option to hard-fail and disallow all logins if it's been too long since a successful CRL update.
We should have metrics (not just logging) for this (e.g. a metric for time of last successful CRL refresh).
docs/RFCS/20200624_cert_revocation.md, line 85 at r1 (raw file):
- when RESTOREing an entire cluster, the CRL stored in the backup will be restored. It is advised to ensure the CRL is refreshed at least once after a RESTORE
How would users ensure this?
docs/RFCS/20200624_cert_revocation.md, line 98 at r1 (raw file):
A cluster setting as the main store for the CRL answers two needs: - the setting can be modified cluster-wide even if the KV layer is
Cluster settings are not propagated via gossip until the KV write commits. But I'm OK if CRL updates are impossible when KV is unavailable. Settings do give you read availability when KV is down.
docs/RFCS/20200624_cert_revocation.md, line 133 at r1 (raw file):
node can become an attack vector because its CRL is outdated. - Store the CRL on-disk in files, pass the directory via a command-line flag. Rejected:
Node-local caching of the CRL may be necessary for supporting revocation of node certificates, since we need to validate node certs before gossip and KV become available.
85c7b7f
to
ebfb9bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @bdarnell, @DuskEagle, @joshimhoff, @tbg, and @thtruo)
docs/RFCS/20200624_cert_revocation.md, line 12 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
This shouldn't be limited to SQL client connections - we should support revocation for each of the CAs we use.
This sounds weird to me. The configurable CA should not be revocable - instead you should just rotate the CA. The problem I see if that if the CA is revoked then everything breaks.
docs/RFCS/20200624_cert_revocation.md, line 16 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Certificate revocation is a poor fit for temporary blocking - it is in practice an irreversible decision and a new cert will need to be issued. We have a better solution for temporary blocking, the
NOLOGIN
user attribute.
Good point. Removed the second part of the sentence.
docs/RFCS/20200624_cert_revocation.md, line 28 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
The CA certificate may contain CRL URLs (called CRLDistributionPoints). We should probably use these instead of requiring separate configuration.
On the other hand, that would potentially mean we'd have to check CRLs for every CA, not only in the root
ca.crt
but also for intermediate certs that we may not have in our on-disk configuration but may only learn about at connection time. (one customer with whom we've recently discussed needs for revocation has a configuration with intermediate CAs; we should find out how they configure CRL distribution points. Of course, that customer also indicated a preference for OCSP instead of CRLs if possible).The issue of intermediate CAs really complicates matters; if we need to support revocation at that level it may turn out to be simpler to use OSCP instead of CRLs.
These are very good point. I have largely rewritten the RFC in response to your comment.
The new text proposes to implement CRLs and OCSP side-by-side. The two mechanism are complementary IMHO so there is no conflict. We could choose to have the MVP implementation only implement one of the two, and then implement the other in a later phase. I'll take your and @thtruo 's suggestion about what is best.
docs/RFCS/20200624_cert_revocation.md, line 73 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Or we could store the automated-refresh CRLs and the manual CRL data separately and merge them. It would be bad if certificates got un-revoked because one of these sources clobbered the other.
I changed this completely. PTAL
docs/RFCS/20200624_cert_revocation.md, line 76 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
At least one user has requested the option to hard-fail and disallow all logins if it's been too long since a successful CRL update.
We should have metrics (not just logging) for this (e.g. a metric for time of last successful CRL refresh).
Can you look at the new text and reformulate your comment/suggestion in that light? thanks
docs/RFCS/20200624_cert_revocation.md, line 85 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
How would users ensure this?
The new text ensures that this verification is synchronous. PTAL
docs/RFCS/20200624_cert_revocation.md, line 98 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Cluster settings are not propagated via gossip until the KV write commits. But I'm OK if CRL updates are impossible when KV is unavailable. Settings do give you read availability when KV is down.
This has become moot.
docs/RFCS/20200624_cert_revocation.md, line 133 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Node-local caching of the CRL may be necessary for supporting revocation of node certificates, since we need to validate node certs before gossip and KV become available.
Agreed. Changed the text accordingly.
ed42d48
to
0ee9d68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @DuskEagle, @joshimhoff, @knz, @tbg, and @thtruo)
docs/RFCS/20200624_cert_revocation.md, line 12 at r1 (raw file):
Previously, knz (kena) wrote…
This sounds weird to me. The configurable CA should not be revocable - instead you should just rotate the CA. The problem I see if that if the CA is revoked then everything breaks.
I don't mean revocation of the CA cert itself (as you say, the remedy for that is to rotate it externally). I mean revocation of certs issued by those CAs (for example, revocation of node client/server certs).
docs/RFCS/20200624_cert_revocation.md, line 170 at r2 (raw file):
command. When using OCSP: pass the `--cert-ocsp-server` flag and provide it
I would assume that the certificates contain their CRL/OCSP URLs, and wouldn't add the option to configure URLs separately unless we are told that there is a real customer need for this.
OTOH, even if the certificates include their own URLs, revocation checking should be optional so we need a three-valued option for each of CRL and OCSP (don't check, check and fail open, check and fail closed)
docs/RFCS/20200624_cert_revocation.md, line 195 at r2 (raw file):
the cache is refreshed periodically in the background. The maximum time between refreshes is configured via the cluster setting `server.certificate_revocation_cache.refresh_interval`, which defaults to 24 hours.
OCSP responses include an optional NextUpdate timestamp field which defines the validity period of the response. If this is set, we may want to use it to set the cache expiration time. We should ask the customer how long they want the cache to be valid for (24h seems long to me; I'd expect a default more like an hour since this is opt-in for users who care about revocation) and whether they use this field.
docs/RFCS/20200624_cert_revocation.md, line 223 at r2 (raw file):
- `--cert-ocsp-server`: supports an OCSP server URL. ## Changes to the HTTP endpoints
This information should be added to http://127.0.0.1:49456/#/reports/certificates/local
I'm not sure about the backend APIs used by this page, but consider adding the revocation cache to the existing endpoint instead of adding a new one.
docs/RFCS/20200624_cert_revocation.md, line 257 at r2 (raw file):
manual refresh requests do not get lost, especially when `server.certificate_revocation_cache.refresh_interval` is configured to a large interval (default 24 hours).
If the cache is short-lived, we may be able to avoid creating a system to manually force a refresh (or make it testing-only and it doesn't need to handle disconnected nodes).
docs/RFCS/20200624_cert_revocation.md, line 344 at r2 (raw file):
## Unresolved questions Should we implement only CRLs? Only OCSP? Or both? Which one is more urgent+important? Which one is more common? What did customers tell us they want?
Based on my current estimates of the difficulty of each option, I think we should implement OCSP first, with CRLs coming later, if at all.
I have one customer data point, from a customer who would prefer OCSP but considers CRLs an acceptable alternative.
0ee9d68
to
f216618
Compare
Release note: None
f216618
to
33539e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I amended this RFC with the information from #53218 and a note that that PR does not implement caching.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @bdarnell, @DuskEagle, @joshimhoff, @tbg, and @thtruo)
docs/RFCS/20200624_cert_revocation.md, line 12 at r1 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
I don't mean revocation of the CA cert itself (as you say, the remedy for that is to rotate it externally). I mean revocation of certs issued by those CAs (for example, revocation of node client/server certs).
Done.
docs/RFCS/20200624_cert_revocation.md, line 28 at r1 (raw file):
Previously, knz (kena) wrote…
These are very good point. I have largely rewritten the RFC in response to your comment.
The new text proposes to implement CRLs and OCSP side-by-side. The two mechanism are complementary IMHO so there is no conflict. We could choose to have the MVP implementation only implement one of the two, and then implement the other in a later phase. I'll take your and @thtruo 's suggestion about what is best.
Done.
docs/RFCS/20200624_cert_revocation.md, line 170 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
I would assume that the certificates contain their CRL/OCSP URLs, and wouldn't add the option to configure URLs separately unless we are told that there is a real customer need for this.
OTOH, even if the certificates include their own URLs, revocation checking should be optional so we need a three-valued option for each of CRL and OCSP (don't check, check and fail open, check and fail closed)
Done.
docs/RFCS/20200624_cert_revocation.md, line 195 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
OCSP responses include an optional NextUpdate timestamp field which defines the validity period of the response. If this is set, we may want to use it to set the cache expiration time. We should ask the customer how long they want the cache to be valid for (24h seems long to me; I'd expect a default more like an hour since this is opt-in for users who care about revocation) and whether they use this field.
Added your remark to the text.
docs/RFCS/20200624_cert_revocation.md, line 223 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
This information should be added to http://127.0.0.1:49456/#/reports/certificates/local
I'm not sure about the backend APIs used by this page, but consider adding the revocation cache to the existing endpoint instead of adding a new one.
Done.
docs/RFCS/20200624_cert_revocation.md, line 257 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
If the cache is short-lived, we may be able to avoid creating a system to manually force a refresh (or make it testing-only and it doesn't need to handle disconnected nodes).
Added your remark to the text.
docs/RFCS/20200624_cert_revocation.md, line 344 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Based on my current estimates of the difficulty of each option, I think we should implement OCSP first, with CRLs coming later, if at all.
I have one customer data point, from a customer who would prefer OCSP but considers CRLs an acceptable alternative.
Done.
bors r=bdarnell |
Build failed: |
bors r=bdarnell |
Build succeeded: |
@knz circling back to our conversation offline: could you update the RFC to include a guide/steps on how a user can set up an OCSP server and complete an e2e journey around revoking certs with CRDB clusters? The intended audience is for a PM and/or docs writer who would need to do acceptance testing and to write a guide for docs, respectively. cc @Amruta-Ranade for awareness |
thanks for the reminder. Filed a separate issue here: #54031 |
The RFC is here https://github.com/knz/cockroach/blob/20200624-cert-revoke-rfc/docs/RFCS/20200624_cert_revocation.md