Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

security: add a cache to store client cert expirations for users #103592

Merged
merged 1 commit into from
May 31, 2023

Conversation

cameronnunez
Copy link
Contributor

@cameronnunez cameronnunez commented May 18, 2023

Fixes #100699.

This change introduces a new node-level, FIFO cache for storing the minimum time-until-expiration of the set of client certs seen (per user).

Entries in the cache are username-gauge pairs, where the gauges contain the expiration values. Periodically, gauges with
expiration values that are not in the future will be cleared from the cache. The purging process triggers hourly.

The gauges in the cache are children of an aggregate gauge ClientExpiration, so the aggregated value for that metric
is not meaningful and only serves as a link to the children. ClientExpiration (security.certificate.expiration.client)
is exported as a metric and the children gauges are labeled by SQL user.

Release note (security update): There is a new server.client_cert_expiration_cache.capacity setting which, when set to a non-zero number, makes it so that the minimum time-until-expiration of the set of client certificates seen is stored (for every user). This setting can be used to ensure client cert expirations are exported as a metric (if set to zero, the metric security.certificate.expiration.client will have a value of zero).

@cameronnunez cameronnunez requested review from a team as code owners May 18, 2023 14:29
@cameronnunez cameronnunez requested a review from yuzefovich May 18, 2023 14:29
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@cameronnunez cameronnunez marked this pull request as draft May 18, 2023 14:30
@yuzefovich yuzefovich removed request for a team and yuzefovich May 18, 2023 14:30
@cameronnunez cameronnunez force-pushed the user-certs-cache branch 3 times, most recently from 334437d to b9aeb67 Compare May 18, 2023 21:10
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that stands out to me here is that if say, a cert expires on Jan 10, then a new cert is issued with a new expiry on March 10, and no other cert is ever used (i.e. capacity is never exceeded) the crdb cluster will retain Jan 10 as the minimum value.

This is undesirable UX - after the expiry date of a cert has been reached in real time (or shortly afterwards), we should not keep it in the gauge any more and instead report all the certs with an expiry in the future in the metric.

Reviewed 8 of 12 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cameronnunez)


pkg/security/cert_expiry_cache.go line 110 at r1 (raw file):

		gauge.Update(newExpiry)
		c.mu.cache.Add(key, gauge)
		if err := c.mu.acc.Grow(ctx, int64(unsafe.Sizeof(*gauge))); err != nil {

the purpose of Grow is to verify that the allocation is possible before it is effectively performed. This call should be pulled up to before the call to parentGauge.AddChild.

@cameronnunez cameronnunez marked this pull request as ready for review May 22, 2023 14:22
@cameronnunez cameronnunez changed the title [WIP] security: add a cache to store client cert expirations for users security: add a cache to store client cert expirations for users May 22, 2023
@cameronnunez cameronnunez requested a review from a team May 22, 2023 14:23
Copy link
Collaborator

@sean- sean- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share @knz's concern.

Instead of caching just the expiration value of the cert, can we cache a struct that has the value and a per-entry TTL? I dry-coded a hypothetical below for consideration.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cameronnunez and @knz)


pkg/security/cert_expiry_cache.go line 27 at r1 (raw file):

// TODO(cameron): best to update this after understanding performance more
const CacheCapacityMax = 1500
const CacheEntryTTL time.Duration = 8 * time.Hour

pkg/security/cert_expiry_cache.go line 91 at r1 (raw file):

	if !ok {
		return 0, ok
	}
type ExpirableTTL stuct {
  CertTTL int64 // seconds until this cert expires
  ValueTTL int64 // seconds until this cached entry expires
}

// in Get(), add a check so that on-access we expire the cached values in advance of the FIFO queue being exhausted
if time.Now().After(time.Unix(value.CertTTL, 0)) {
  c.mu.cache.Del(key)
  return 0, false
}

Many environments will only have half-a-dozen entries in this cache.


pkg/security/cert_expiry_cache.go line 109 at r1 (raw file):

		gauge := parentGauge.AddChild(key)
		gauge.Update(newExpiry)
		c.mu.cache.Add(key, gauge)
expiringGauge := ExpirableTTL{
  CertTTL: time.Now().Add(CacheEntryTTL).Unix(),
  ValueTTL: gauge,
}
c.mu.cache.Add(key, expiringGauge)

pkg/security/cert_expiry_cache.go line 115 at r1 (raw file):

	} else {
		if gauge := value.(*aggmetric.Gauge); newExpiry < gauge.Value() || gauge.Value() == 0 {
			gauge.Update(newExpiry)

This also needs to be updated to have an ExpirableTTL value that matches the convention in Get() above.

Copy link
Contributor Author

@cameronnunez cameronnunez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have opted for an approach that involves periodically cleaning up the cache. This approach also avoids any ambiguity surrounding the use of the term "expiration", but let me know if there is a preference for the approach Sean has pitched

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @sean-)


pkg/security/cert_expiry_cache.go line 110 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

the purpose of Grow is to verify that the allocation is possible before it is effectively performed. This call should be pulled up to before the call to parentGauge.AddChild.

Good catch, fixed.

@cameronnunez cameronnunez requested review from sean- and knz May 24, 2023 16:36
Copy link
Collaborator

@sean- sean- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your approach more; it's memory efficient and works since we're only talking about <1500 values atm (though I suspect it'll only be ~20 in the common case, and ~1k in the case of dyn usernames).

Reviewed 1 of 12 files at r1, 7 of 7 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/security/cert_expiry_cache.go line 31 at r2 (raw file):

// TODO(cameron): best to update this after understanding performance more
const CacheCapacityMax = 1500

Maybe just set this to something "obscenely" high, like 65K or 256K, but prevent it from being an unlimited value. Argument: if someone has 65K certs, they should expect to spend a bit of RAM on this. Environments that use dynamically generated TLS certs could, conceivably, use more than 1500 usernames in the normal course of operation.


pkg/security/cert_expiry_cache.go line 202 at r2 (raw file):

			case <-ctx.Done():
				return
			}

Can we throw in a:

const expirationSmearing = 5 * time.Second
time.Sleep(time.Duration(int(expirationSmearing) / CacheCapacityMax))

to smear the expiration over a 5s period of time (i.e., sleep 5ms between expiring values)?

Copy link
Contributor Author

@cameronnunez cameronnunez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the max capacity

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/security/cert_expiry_cache.go line 202 at r2 (raw file):

Previously, sean- (Sean Chittenden) wrote…

Can we throw in a:

const expirationSmearing = 5 * time.Second
time.Sleep(time.Duration(int(expirationSmearing) / CacheCapacityMax))

to smear the expiration over a 5s period of time (i.e., sleep 5ms between expiring values)?

Did you mean that we should add that amount to the current period here (an hour)? The purging process is set to fire every hour (after an initial jittered duration)

@cameronnunez cameronnunez force-pushed the user-certs-cache branch 2 times, most recently from 8710af2 to d6933ba Compare May 25, 2023 01:42
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nits - but please update commit message + PR description with the new behavior.

Reviewed 1 of 12 files at r1, 3 of 7 files at r2, 4 of 4 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cameronnunez)


pkg/security/cert_expiry_cache.go line 91 at r1 (raw file):

Previously, sean- (Sean Chittenden) wrote…
type ExpirableTTL stuct {
  CertTTL int64 // seconds until this cert expires
  ValueTTL int64 // seconds until this cached entry expires
}

// in Get(), add a check so that on-access we expire the cached values in advance of the FIFO queue being exhausted
if time.Now().After(time.Unix(value.CertTTL, 0)) {
  c.mu.cache.Del(key)
  return 0, false
}

Many environments will only have half-a-dozen entries in this cache.

This change should be communicated in the commit message and the PR description as well.


pkg/security/cert_expiry_cache.go line 59 at r3 (raw file):

}

func NewClientCertExpirationCache(

nit: exported functions should get a docstring.


pkg/security/cert_expiry_cache.go line 214 at r3 (raw file):

	c.mu.cache.Do(func(entry *cache.Entry) {
		gauge := entry.Value.(*aggmetric.Gauge)
		if gauge.Value() <= c.timeNow() {

nit: compute c.timeNow() only once for the entire loop.

Copy link
Contributor Author

@cameronnunez cameronnunez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/security/cert_expiry_cache.go line 59 at r3 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

nit: exported functions should get a docstring.

Fixed.


pkg/security/cert_expiry_cache.go line 214 at r3 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

nit: compute c.timeNow() only once for the entire loop.

Done.

@cameronnunez cameronnunez requested a review from knz May 30, 2023 15:29
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cameronnunez)

This change introduces a new node-level, FIFO cache for storing
the minimum time-until-expiration of the set of client certs
seen (per user).

Entries in the cache are username-gauge pairs, where the gauges
contain the expiration values. Periodically, gauges with
expiration values that are not in the future will be cleared
from the cache. The purging process triggers hourly.

The gauges in the cache are children of an aggregate gauge
`ClientExpiration`, so the aggregated value for that metric
is not meaningful and only serves as a link to the children.
`ClientExpiration` (`security.certificate.expiration.client`)
is exported as a metric and the children gauges are labeled by
SQL user.

Release note (security update): There is a new
`server.client_cert_expiration_cache.capacity` setting which,
when set to a non-zero number, makes it so that the minimum
time-until-expiration of the set of client certificates seen
is stored (for every user). This setting can be used to
ensure client cert expirations are exported as a metric (if
set to zero, the metric `security.certificate.expiration.client`
will have a value of zero.
@cameronnunez
Copy link
Contributor Author

TFTRs!

@cameronnunez
Copy link
Contributor Author

bors r=sean-,knz

@craig
Copy link
Contributor

craig bot commented May 30, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented May 30, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented May 30, 2023

Build failed:

@cameronnunez
Copy link
Contributor Author

bors r-

@cameronnunez
Copy link
Contributor Author

bors r=sean-,knz

@craig
Copy link
Contributor

craig bot commented May 31, 2023

Build succeeded:

@cameronnunez
Copy link
Contributor Author

blathers backport 22.2

@blathers-crl
Copy link

blathers-crl bot commented May 31, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from abdaefa to blathers/backport-release-22.2-103592: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Surface a metric to enable users to alert on when a client's cert age is close to expiring
4 participants