-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
security: add a cache to store client cert expirations for users #103592
Conversation
334437d
to
b9aeb67
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that stands out to me here is that if say, a cert expires on Jan 10, then a new cert is issued with a new expiry on March 10, and no other cert is ever used (i.e. capacity is never exceeded) the crdb cluster will retain Jan 10 as the minimum value.
This is undesirable UX - after the expiry date of a cert has been reached in real time (or shortly afterwards), we should not keep it in the gauge any more and instead report all the certs with an expiry in the future in the metric.
Reviewed 8 of 12 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cameronnunez)
pkg/security/cert_expiry_cache.go
line 110 at r1 (raw file):
gauge.Update(newExpiry) c.mu.cache.Add(key, gauge) if err := c.mu.acc.Grow(ctx, int64(unsafe.Sizeof(*gauge))); err != nil {
the purpose of Grow is to verify that the allocation is possible before it is effectively performed. This call should be pulled up to before the call to parentGauge.AddChild
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I share @knz's concern.
Instead of caching just the expiration value of the cert, can we cache a struct that has the value and a per-entry TTL? I dry-coded a hypothetical below for consideration.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cameronnunez and @knz)
pkg/security/cert_expiry_cache.go
line 27 at r1 (raw file):
// TODO(cameron): best to update this after understanding performance more const CacheCapacityMax = 1500
const CacheEntryTTL time.Duration = 8 * time.Hour
pkg/security/cert_expiry_cache.go
line 91 at r1 (raw file):
if !ok { return 0, ok }
type ExpirableTTL stuct {
CertTTL int64 // seconds until this cert expires
ValueTTL int64 // seconds until this cached entry expires
}
// in Get(), add a check so that on-access we expire the cached values in advance of the FIFO queue being exhausted
if time.Now().After(time.Unix(value.CertTTL, 0)) {
c.mu.cache.Del(key)
return 0, false
}
Many environments will only have half-a-dozen entries in this cache.
pkg/security/cert_expiry_cache.go
line 109 at r1 (raw file):
gauge := parentGauge.AddChild(key) gauge.Update(newExpiry) c.mu.cache.Add(key, gauge)
expiringGauge := ExpirableTTL{
CertTTL: time.Now().Add(CacheEntryTTL).Unix(),
ValueTTL: gauge,
}
c.mu.cache.Add(key, expiringGauge)
pkg/security/cert_expiry_cache.go
line 115 at r1 (raw file):
} else { if gauge := value.(*aggmetric.Gauge); newExpiry < gauge.Value() || gauge.Value() == 0 { gauge.Update(newExpiry)
This also needs to be updated to have an ExpirableTTL
value that matches the convention in Get()
above.
b9aeb67
to
4341be6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have opted for an approach that involves periodically cleaning up the cache. This approach also avoids any ambiguity surrounding the use of the term "expiration", but let me know if there is a preference for the approach Sean has pitched
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @sean-)
pkg/security/cert_expiry_cache.go
line 110 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
the purpose of Grow is to verify that the allocation is possible before it is effectively performed. This call should be pulled up to before the call to
parentGauge.AddChild
.
Good catch, fixed.
4341be6
to
8851cae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your approach more; it's memory efficient and works since we're only talking about <1500 values atm (though I suspect it'll only be ~20 in the common case, and ~1k in the case of dyn usernames).
Reviewed 1 of 12 files at r1, 7 of 7 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/security/cert_expiry_cache.go
line 31 at r2 (raw file):
// TODO(cameron): best to update this after understanding performance more const CacheCapacityMax = 1500
Maybe just set this to something "obscenely" high, like 65K or 256K, but prevent it from being an unlimited value. Argument: if someone has 65K certs, they should expect to spend a bit of RAM on this. Environments that use dynamically generated TLS certs could, conceivably, use more than 1500 usernames in the normal course of operation.
pkg/security/cert_expiry_cache.go
line 202 at r2 (raw file):
case <-ctx.Done(): return }
Can we throw in a:
const expirationSmearing = 5 * time.Second
time.Sleep(time.Duration(int(expirationSmearing) / CacheCapacityMax))
to smear the expiration over a 5s period of time (i.e., sleep 5ms between expiring values)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the max capacity
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/security/cert_expiry_cache.go
line 202 at r2 (raw file):
Previously, sean- (Sean Chittenden) wrote…
Can we throw in a:
const expirationSmearing = 5 * time.Second time.Sleep(time.Duration(int(expirationSmearing) / CacheCapacityMax))
to smear the expiration over a 5s period of time (i.e., sleep 5ms between expiring values)?
Did you mean that we should add that amount to the current period here (an hour)? The purging process is set to fire every hour (after an initial jittered duration)
8710af2
to
d6933ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with nits - but please update commit message + PR description with the new behavior.
Reviewed 1 of 12 files at r1, 3 of 7 files at r2, 4 of 4 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cameronnunez)
pkg/security/cert_expiry_cache.go
line 91 at r1 (raw file):
Previously, sean- (Sean Chittenden) wrote…
type ExpirableTTL stuct { CertTTL int64 // seconds until this cert expires ValueTTL int64 // seconds until this cached entry expires } // in Get(), add a check so that on-access we expire the cached values in advance of the FIFO queue being exhausted if time.Now().After(time.Unix(value.CertTTL, 0)) { c.mu.cache.Del(key) return 0, false }
Many environments will only have half-a-dozen entries in this cache.
This change should be communicated in the commit message and the PR description as well.
pkg/security/cert_expiry_cache.go
line 59 at r3 (raw file):
} func NewClientCertExpirationCache(
nit: exported functions should get a docstring.
pkg/security/cert_expiry_cache.go
line 214 at r3 (raw file):
c.mu.cache.Do(func(entry *cache.Entry) { gauge := entry.Value.(*aggmetric.Gauge) if gauge.Value() <= c.timeNow() {
nit: compute c.timeNow()
only once for the entire loop.
d6933ba
to
8e86266
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/security/cert_expiry_cache.go
line 59 at r3 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
nit: exported functions should get a docstring.
Fixed.
pkg/security/cert_expiry_cache.go
line 214 at r3 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
nit: compute
c.timeNow()
only once for the entire loop.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cameronnunez)
8e86266
to
0283de7
Compare
This change introduces a new node-level, FIFO cache for storing the minimum time-until-expiration of the set of client certs seen (per user). Entries in the cache are username-gauge pairs, where the gauges contain the expiration values. Periodically, gauges with expiration values that are not in the future will be cleared from the cache. The purging process triggers hourly. The gauges in the cache are children of an aggregate gauge `ClientExpiration`, so the aggregated value for that metric is not meaningful and only serves as a link to the children. `ClientExpiration` (`security.certificate.expiration.client`) is exported as a metric and the children gauges are labeled by SQL user. Release note (security update): There is a new `server.client_cert_expiration_cache.capacity` setting which, when set to a non-zero number, makes it so that the minimum time-until-expiration of the set of client certificates seen is stored (for every user). This setting can be used to ensure client cert expirations are exported as a metric (if set to zero, the metric `security.certificate.expiration.client` will have a value of zero.
0283de7
to
abdaefa
Compare
TFTRs! |
bors r=sean-,knz |
Build failed (retrying...): |
Build failed (retrying...): |
Build failed: |
bors r- |
bors r=sean-,knz |
Build succeeded: |
blathers backport 22.2 |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from abdaefa to blathers/backport-release-22.2-103592: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Fixes #100699.
This change introduces a new node-level, FIFO cache for storing the minimum time-until-expiration of the set of client certs seen (per user).
Entries in the cache are username-gauge pairs, where the gauges contain the expiration values. Periodically, gauges with
expiration values that are not in the future will be cleared from the cache. The purging process triggers hourly.
The gauges in the cache are children of an aggregate gauge
ClientExpiration
, so the aggregated value for that metricis not meaningful and only serves as a link to the children.
ClientExpiration
(security.certificate.expiration.client
)is exported as a metric and the children gauges are labeled by SQL user.
Release note (security update): There is a new
server.client_cert_expiration_cache.capacity
setting which, when set to a non-zero number, makes it so that the minimum time-until-expiration of the set of client certificates seen is stored (for every user). This setting can be used to ensure client cert expirations are exported as a metric (if set to zero, the metricsecurity.certificate.expiration.client
will have a value of zero).