Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

security: figure out cert/key rotation #1674

Closed
mberhault opened this issue Jul 14, 2015 · 6 comments
Closed

security: figure out cert/key rotation #1674

mberhault opened this issue Jul 14, 2015 · 6 comments
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Milestone

Comments

@mberhault
Copy link
Contributor

We need a good way to handle cert/key rotation without having to restart all the nodes.
Pretty low priority.

@mberhault mberhault self-assigned this Jul 14, 2015
@petermattis petermattis changed the title Figure out cert/key rotation security: figure out cert/key rotation Feb 12, 2016
@petermattis petermattis added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Feb 14, 2016
@petermattis petermattis modified the milestone: 1.0 Feb 14, 2016
@mberhault
Copy link
Contributor Author

Throwing a basic outline of requirements and future improvements:

Requirements:

  • be able to specify multiple CA certs and node cert/key pairs for a single node. needed to be able to read/validate "old" certs and "new" certs.
  • ability to reload certificates without restarting the process

Future improvements:

  • add support for cert revocation
  • warnings/metrics about certificate expiration time
  • customizable certificate lifetime in cert command

Specifying multiple CA certs or cert/key pairs can be done in a number of ways:

  • repeated flags (does not help if we do not know the names in advance)
  • overwrite existing files (if a node gets restarted before all nodes have the new certificate, it would effectively be disconnected from the cluster)
  • glob/naming scheme/directory for certs/keys. Potential security issue if someone else can write files with the name. Can partially avoid by checking ownership/mode.

Reloading certs is probably best done through issuing the SIGHUP signal, unless we want to watch the directory/files. The former would probably be less surprising.
Using new client/server certificates can be done by setting tls.Config.GetCertificate and tls.Config.GetClientCertificate. CA certs can just be added to the pool

detailed outline of cert rotation in go

@petermattis
Copy link
Collaborator

Cc @bdarnell

@bdarnell
Copy link
Contributor

Let's separate CA cert rotation from the other cases - it's significantly easier because no cockroach process needs to touch the CA private key (or should even have access to it), and it's trivial to use multiple CA certs at once. In fact, when we have multiple CA certs, I don't think we ever need to use multiple keypairs at any node or client. On the other hand, clients need the CA cert too, so we must consider clients in the key rotation story.

The rotation process (assuming all the certs are expiring at the same time, although pieces of this process also work on their own) would be:

  1. Deploy old+new CA cert .pem (or equivalent) files to all nodes (and clients). Reload with SIGHUP or equivalent.
  2. Once all nodes and clients are using the new CA configuration (need monitoring to make this visible - must report on at least the number of known CA certs; maybe a fingerprint to identify the exact config), begin the process of rolling out new node and client keypairs, signed by the new CA. Each node needs to load its new TLS config and start using it for all new connections.
  3. Once all nodes and clients have been updated (again, need monitoring), do another CA cert update to remove the old CA cert from all nodes and clients.

@mberhault
Copy link
Contributor Author

Yeah, I think we may want to decouple CA and server/client cert lifetimes, it's insane to have them expire at the same time, potentially forcing a refresh of everything.

Ultimately, we won't be able to monitor client CA expiration, the best we can do is figure out which CA was used to sign the client cert.

I propose we keep things reasonably simple for now. Specifically:

Cert expiration:

  • make default CA expiration much longer (eg: 10 years)
  • keep node/client cert expiration at 1 year
  • metrics for CA/node cert expirations
  • alert on CA expiration within 6 months
  • alert on node expiration within 1 month

The actual expiration time can be set at cert creation-time, but the defaults should differ for CA/server/client certs.
If we want to make these more configurable, we can have alerts based of fraction of lifetime remaining.
Alerts should be both in the admin UI and in publicly-available prometheus rules.

CA rotation:

  • allow an arbitrary number of CA certs. Any non-expired CA cert gets added to the pool (I'm pretty sure the tls lib does that for us anyway).

Admins can cleanup old CAs quickly, but don't have to.

Server cert rotation:

  • allow multiple node certs to be present
  • reload certs on SIGHUP
  • use cert with NotBefore < time.Now and latest NotAfter.

I don't think we want to go much more complex than this. Any sort of cluster-wide gating system will fail if certs expire and nodes can't talk to each other.
CA renewal too close to node cert renewal will always cause problems if clients aren't updated quickly enough.

Concurrent CA/Server cert rotation:

If this really needs to happen, there's very little we can do to make sure clients have the updated CA.

Even gating server cert rollout is tricky, any type of global switch requires functional communication between nodes. We can signal a node to load new certs but we can't control when a node restarts, picking up new certs in the process. The gating process is effectively sighup/restart after pushing new certs.

Loading multiple certs:

We really want to support multiple files, especially for CA certs, having more than one is perfectly reasonable and allows for slow rollout.

Multiple node cert files is debatable. If we always use the more recent valid one, we could overwrite the file, but if we want to allow not-yet-valid certs, we need more than one file.

How to specify multiple files is open for debate. Both globs and auto-detection within a directory are fragile in the face of odd filenames, but still doable. We could make things simpler by having our PEM contain both certificate and private key, we currently move both around together anyway.

Rotating certs for new connections:

Refreshing CAs and node certs are easy through the standard tls.Config:

  • add CA cert to the shared root/client CA pools (we need a wrapper to lock those properly)
  • set tls.Config.Get(Client)Certificate to return the latest valid certificate

We need to make sure that those are properly used by grpc. We may need to go through grpc's TranportCredentials and swap the entire tls.Config. Both are described in the Hitless TLS Certificate Rotation post listed above.

@mberhault mberhault self-assigned this Mar 18, 2017
@mberhault
Copy link
Contributor Author

Draft RFC for online certificate rotation can be found in #14254

@spencerkimball spencerkimball removed this from the 1.0 milestone Mar 28, 2017
mberhault pushed a commit that referenced this issue Mar 31, 2017
This is the first step towards #1674.

Introduces two new security objects:
* certificate_loader: scans certsDir for certs/keys
* certificate_manager: long-lived: builds tls.Config objects

The certificate_manager is currently only used by the `cert debug-list`
command and the current cert system remains in place.

A few small notes:
* embedded certs were renamed to the new naming scheme
* most test changes are to use the new embedded assets setter
mberhault pushed a commit that referenced this issue Apr 4, 2017
This is the first step towards #1674.

Introduces two new security objects:
* certificate_loader: scans certsDir for certs/keys
* certificate_manager: long-lived: builds tls.Config objects

The certificate_manager is currently only used by the `cert debug-list`
command and the current cert system remains in place.

A few small notes:
* embedded certs were renamed to the new naming scheme
* most test changes are to use the new embedded assets setter
@mberhault mberhault added this to the 1.0 milestone Apr 5, 2017
mberhault pushed a commit that referenced this issue Apr 16, 2017
Part of #1674.

The `--certs-dir` flag replaces all the certs flags, with a naming
scheme used to determine what type each file is.
The user (needed for `client.<user>.{crt,key}`) comes from
`base.Config.User`.
mberhault pushed a commit that referenced this issue Apr 17, 2017
Part of #1674.

The `--certs-dir` flag replaces all the certs flags, with a naming
scheme used to determine what type each file is.
The user (needed for `client.<user>.{crt,key}`) comes from
`base.Config.User`.
mberhault pushed a commit that referenced this issue Apr 17, 2017
Part of #1674

Watch for SIGHUP and rotate embedded server tls config.
mberhault pushed a commit that referenced this issue Apr 17, 2017
Part of #1674.

* use `--certs-dir` for all commands accepting certs
* accept old certs flags for `start` command only, but log deprecation warning
* rework certs creation logic
* make `--insecure` always default to false and independent of `--host`
mberhault pushed a commit that referenced this issue Apr 17, 2017
The certificate directory can now be re-read by issuing a SIGHUP signal
to the server.

Part of #1674.
@mberhault
Copy link
Contributor Author

This is merged.

A few issues remain but can mostly be addressed later.
#14985: figure out cert lifetimes
#15027: certs debug page and monitoring
#15028: other methods to trigger certificate reload

knz added a commit to knz/cockroach that referenced this issue Jul 6, 2022
Back in CockroachDB v1.1 (v17.2 in the new calver scheme), we
introduced a certificate rotation mechanism. To help
teach/troubleshoot that feature, we also provided a way for the
operator to view the certificate details in the DB Console (expiration
time, addresses, etc.)

This work was done in PR cockroachdb#16087, to solve issues cockroachdb#15027/cockroachdb#1674.

However, as part of that PR, the implementation of the back-end API
also included the *data* of the cert (including the cert signature
and the signature chain) in the response payload.

This additional payload was never used in a user-facing feature: the
DB Console does not display it nor does it contain a link to "download
the cert file". The back-end API is not public either, so we are not
expecting end-users to have legitimate uses for this feature.

Meanwhile, leaking cert data through an API runs dangerously close
to violating PCI guidelines (not quite, since keys are not exposed,
but still...).

So in order to avoid a remark on this during PCI review cycles, and to
remove the chance this will be misused, this patch removes the
data payload from the cert response.

The DB Console screen corresponding to the original work remains
unaffected.

Release note: None
knz added a commit to knz/cockroach that referenced this issue Jul 6, 2022
Back in CockroachDB v1.1 (v17.2 in the new calver scheme), we
introduced a certificate rotation mechanism. To help
teach/troubleshoot that feature, we also provided a way for the
operator to view the certificate details in the DB Console (expiration
time, addresses, etc.)

This work was done in PR cockroachdb#16087, to solve issues cockroachdb#15027/cockroachdb#1674.

However, as part of that PR, the implementation of the back-end API
also included the *data* of the cert (including the cert signature
and the signature chain) in the response payload.

This additional payload was never used in a user-facing feature: the
DB Console does not display it nor does it contain a link to "download
the cert file". The back-end API is not public either, so we are not
expecting end-users to have legitimate uses for this feature.

Meanwhile, leaking cert data through an API runs dangerously close
to violating PCI guidelines (not quite, since keys are not exposed,
but still...).

So in order to avoid a remark on this during PCI review cycles, and to
remove the chance this will be misused, this patch removes the
data payload from the cert response.

The DB Console screen corresponding to the original work remains
unaffected.

Release note: None
craig bot pushed a commit that referenced this issue Jul 6, 2022
83616: sql, sqlstats: create temporary stats container for all txns r=xinhaoz a=xinhaoz

# Commit 1
### sql: add txn_fingerprint_id to node_statement_statistics

This commit adds the `txn_fingerprint_id` column to
`crdb_internal.node_statement_statistics`.

Release note (sql): `txn_fingerprin_id` has been added to
`crdb_internal.node_statement_statistics`. The type of the
column is NULL or STRING.

------------------------------
# Commit 2
### sql, sqlstats: create temporary stats container for all txns

Fixes: #81470

Previously, the stats collector followed different procedures for stats
collection depending on whether or not the txn was explicit.

For explicit transactions, all the stmts in the txn must be recorded with
the same `transactionFingerprintID`, which is  only known after all stmts
in the txn have been executed. In order to record the correct
txnFingerprintID, a temporary stats container was created for stmts in the
current transaction. The `transactionFingerprintID` was then populated for
all stmts in the temp container, and the temp container was merged with
the parent.

For implict transactions, the assumption was there would only be a single
stmt in the txn, and so no temporary container was created, with stmts
being written directly to the application stats.

This assumption was incorrect, as it is possible for implicit txns to have
multiple stmts, such as stmts sent in a batch.

This commit ensures that stats are properly collected for implicit txns
with multiple stmts. The stats collector now follows the same procedure
for both explicit and implicit txns, creating a temporary container for
local txn stmts and merging on txn finish.

Release note (bug fix): Statement and transaction stats are now properly
recorded for implicit txns with multiple stmts.

83762: docs: add MVCC range tombstones tech note r=sumeerbhola,jbowens,nicktrav a=erikgrinaker

[Rendered version](https://github.com/erikgrinaker/cockroach/blob/mvcc-range-tombstones-tech-note/docs/tech-notes/mvcc-range-tombstones.md)

---

This describes the current state, but the details will likely change as
we iterate on the implementation.

Resolves #83406.

Release note: None

83873: tenantsettingswatcher: remove the version gate r=yuzefovich a=yuzefovich

This commit removes the version gate for the tenant settings.

Release note: None

83902: server: remove TLS cert data retrieval over HTTP r=catj-cockroach a=knz

Back in CockroachDB v1.1 (v17.2 in the new calver scheme), we
introduced a certificate rotation mechanism. To help
teach/troubleshoot that feature, we also provided a way for the
operator to view the certificate details in the DB Console (expiration
time, addresses, etc.)

This work was done in PR #16087, to solve issues #15027/#1674.

However, as part of that PR, the implementation of the back-end API
also included the *data* of the cert (including the cert signature
and the signature chain) in the response payload.

This additional payload was never used in a user-facing feature: the
DB Console does not display it nor does it contain a link to "download
the cert file". The back-end API is not public either, so we are not
expecting end-users to have legitimate uses for this feature.

Meanwhile, leaking cert data through an API runs dangerously close
to violating PCI guidelines (not quite, since keys are not exposed,
but still...).

So in order to avoid a remark on this during PCI review cycles, and to
remove the chance this will be misused, this patch removes the
data payload from the cert response.

The DB Console screen corresponding to the original work remains
unaffected.

For reference here's how the console screen looks: 
![image](https://user-images.githubusercontent.com/642886/177591040-f554fdf0-2b04-48f6-af36-0b94c0bcaf4c.png)


Co-authored-by: Xin Hao Zhang <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
knz added a commit to knz/cockroach that referenced this issue Jul 12, 2022
Back in CockroachDB v1.1 (v17.2 in the new calver scheme), we
introduced a certificate rotation mechanism. To help
teach/troubleshoot that feature, we also provided a way for the
operator to view the certificate details in the DB Console (expiration
time, addresses, etc.)

This work was done in PR cockroachdb#16087, to solve issues cockroachdb#15027/cockroachdb#1674.

However, as part of that PR, the implementation of the back-end API
also included the *data* of the cert (including the cert signature
and the signature chain) in the response payload.

This additional payload was never used in a user-facing feature: the
DB Console does not display it nor does it contain a link to "download
the cert file". The back-end API is not public either, so we are not
expecting end-users to have legitimate uses for this feature.

Meanwhile, leaking cert data through an API runs dangerously close
to violating PCI guidelines (not quite, since keys are not exposed,
but still...).

So in order to avoid a remark on this during PCI review cycles, and to
remove the chance this will be misused, this patch removes the
data payload from the cert response.

The DB Console screen corresponding to the original work remains
unaffected.

Release note: None
knz added a commit to knz/cockroach that referenced this issue Aug 3, 2022
Back in CockroachDB v1.1 (v17.2 in the new calver scheme), we
introduced a certificate rotation mechanism. To help
teach/troubleshoot that feature, we also provided a way for the
operator to view the certificate details in the DB Console (expiration
time, addresses, etc.)

This work was done in PR cockroachdb#16087, to solve issues cockroachdb#15027/cockroachdb#1674.

However, as part of that PR, the implementation of the back-end API
also included the *data* of the cert (including the cert signature
and the signature chain) in the response payload.

This additional payload was never used in a user-facing feature: the
DB Console does not display it nor does it contain a link to "download
the cert file". The back-end API is not public either, so we are not
expecting end-users to have legitimate uses for this feature.

Meanwhile, leaking cert data through an API runs dangerously close
to violating PCI guidelines (not quite, since keys are not exposed,
but still...).

So in order to avoid a remark on this during PCI review cycles, and to
remove the chance this will be misused, this patch removes the
data payload from the cert response.

The DB Console screen corresponding to the original work remains
unaffected.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

4 participants