docs: new RFC on always-available users #44134

knz · 2020-01-20T10:23:26Z

This works enables the creation of multiple accounts besides root that have the same special properties: the ability to log in, monitor and troubleshoot a cluster when system ranges are unavailable.

Link to RFC

Release note: None

cockroach-teamcity · 2020-01-20T10:23:34Z

This change is

Release note: None

aaron-crl · 2020-01-23T16:11:18Z

I like this approach because it extracts functionality from the root account and prepares for more granular and accountable access for other customer defined accounts.

knz · 2020-01-24T11:43:50Z

Remark from @tbg: de-emphasize the use of gossip. It's possible to achieve an eventually consistent RAM-only copy of a table by other means than gossip.

bdarnell

Overall I like the approach here. There are a lot of reasons to get the KV ops out of the critical path, and I think gossip is the right alternative. I'm not concerned about the potential for propagation delays, although this is something we should validate with our more security-sensitive customers.

de-emphasize the use of gossip. It's possible to achieve an eventually consistent RAM-only copy of a table by other means than gossip.

It's possible, but why? Gossip is a pretty good way to maintain this kind of cache, and it seems to be reasonably robust in practice. What advantages would the hypothetical alternative have?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)

docs/RFCS/20200119_always_available_users.md, line 45 at r1 (raw file):

authorize a regular user includes:

- for authentication: `system.namespace`, `system.users`,

The system.namespace table is not addressed in this document (it used to be gossiped, but now it has been moved out of the system config span).

Do we need to add the role_options table here to bring it up to date to current CRDB?

docs/RFCS/20200119_always_available_users.md, line 83 at r1 (raw file):

Previously, only `root` could access and monitor a
unavailable cluster. With this change enabled, all users
with the `admin` role can access and monitor when the cluster is

With the admin role being broken out into multiple permissions, I don't think the always-available capability should be tied to the admin role specifically. A read-only user with enough permissions to view the console would be a good candidate for this treatment. Maybe this should be a permission that could be granted to any user (and would be included in the bundle of permissions for the admin role).

docs/RFCS/20200119_always_available_users.md, line 95 at r1 (raw file):

The RFC also works under the following assumptions:

- regular user accounts can be numerous (tens of thousands).

There could be huge numbers of users, but this is rare. For users with more modest needs, I think it could be desirable to mark all their users as cached. Maybe we should facilitate this by marking the first N users created as cached (although that sets up a bit of a delayed surprise).

Note that to support truly unbounded numbers of users we'd need to move the system.users table out of the system config span.

docs/RFCS/20200119_always_available_users.md, line 98 at r1 (raw file):

- only few (hundreds or less) of these accounts are meant
  to access and monitor/operate an unavailable cluster.
- small propagation delays in updates to user accounts and

One of the reasons we do a KV read to the system.users table on every login is because of an assumption when this was implemented that small delays would not be tolerated - dropping a user or revoking a grant needed to be effectively synchronous.

I've always been skeptical of this assumption, but we should validate which of these assumptions is more valid. And if propagation delays are tolerated, we should be able to do more caching of user lookups even for users that are not covered by this always-available mechanism to reduce load on the system ranges.

Another possible answer is that some propagation delay can be tolerated, but there should be an upper bound. (I've seen users get nervous about the lack of formal guarantees around gossip, even though it's usually quite fast). This would argue for some sort of time-limited lease, although the lease duration would need to be quite large to be useful for debugging struggling clusters.

docs/RFCS/20200119_always_available_users.md, line 113 at r1 (raw file):

  username           STRING NOT NULL,
  hashedPassword     STRING,
  roles              []STRING NOT NULL,

Note that the system.users table is already gossipped via the system config range. So really what we're doing here is denormalizing the role membership into a users table for a subset of users.

Should we just do that into the existing users table for everyone? This would be more expensive for clusters with many users, but the fact that the users table is already gossiped implies an effective limit on the number of users already.

docs/RFCS/20200119_always_available_users.md, line 165 at r1 (raw file):

would remain blocked.

The RFC is considering either of the following three solutions.

Another option would be to encode the necessary contents of the web_sessions record into the cookie itself (signed and possibly encrypted) so it would be self-contained.

Again, the original reason for the KV read on every access was to ensure that revocation of a session token would be immediate. As long as we're OK with a possible delay in the effect of revocation, a self-contained session cookie would let us handle this by gossiping only revoked tokens, or just failing open if the KV read times out.

docs/RFCS/20200119_always_available_users.md, line 187 at r1 (raw file):

before the stored table.

The difference between 1.1 and 1.2 is that 1.1 would make the Admin UI

1.2 also doesn't have any cache invalidation - without updates via gossip, a revoked session token would still be usable until the cache times out .

docs/RFCS/20200119_always_available_users.md, line 204 at r1 (raw file):

on the admin UI (or active within the same one-hour period).

Observing that web tokens are only valid for one hour each, we can

Where does this one hour value come from? server.web_session_timeout defaults to one week. I would not assume that we can afford to cache the entire web sessions table.

docs/RFCS/20200119_always_available_users.md, line 245 at r1 (raw file):

table nor RAM caches.

However, any deployment of TLS certificates needs a *revocation

We now support revocation of TLS certificates via OCSP. So this is an option for sophisticated users who run their own CA and OCSP server.

We don't provide such tools out of the box, but I'm also not convinced that they're necessary - if "any deployment of TLS certs" needed it, we shouldn't be using client certs at all without providing it. I think we need to solve the authorization problem for TLS certs to be a possible solution here, but I'd be OK with the status quo for revocation.

From a UX perspective, though, I don't think it's a good idea to have TLS certs as the emergency-access path for clusters that primarily operate by passwords. So I agree with the overall decision to focus on a more highly-available users table instead of moving things out of band via certs.

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @knz)

docs/RFCS/20200119_always_available_users.md, line 95 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

There could be huge numbers of users, but this is rare. For users with more modest needs, I think it could be desirable to mark all their users as cached. Maybe we should facilitate this by marking the first N users created as cached (although that sets up a bit of a delayed surprise).

Note that to support truly unbounded numbers of users we'd need to move the system.users table out of the system config span.

I have thought a little bit more about this and I'd like to propose this feature to be enabled for an account or a role depending on a new role option CACHED. There would be a background process on each node which would periodically refresh the cache from system.users by taking entries that have this attribute set.

docs/RFCS/20200119_always_available_users.md, line 98 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

One of the reasons we do a KV read to the system.users table on every login is because of an assumption when this was implemented that small delays would not be tolerated - dropping a user or revoking a grant needed to be effectively synchronous.

I've always been skeptical of this assumption, but we should validate which of these assumptions is more valid. And if propagation delays are tolerated, we should be able to do more caching of user lookups even for users that are not covered by this always-available mechanism to reduce load on the system ranges.

Another possible answer is that some propagation delay can be tolerated, but there should be an upper bound. (I've seen users get nervous about the lack of formal guarantees around gossip, even though it's usually quite fast). This would argue for some sort of time-limited lease, although the lease duration would need to be quite large to be useful for debugging struggling clusters.

I generally agree on the idea of cache + expiration lease.

However even with that we still need to mark some privileged accounts as non-expirable for the purpose of troubleshooting. This is the primary goal of this RFC. So I suggest we keep the distinction between "general users, who may have cached credentials with expiry" and "always-available users who always have cached credentials with no expiry".

docs/RFCS/20200119_always_available_users.md, line 113 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Note that the system.users table is already gossipped via the system config range. So really what we're doing here is denormalizing the role membership into a users table for a subset of users.

Should we just do that into the existing users table for everyone? This would be more expensive for clusters with many users, but the fact that the users table is already gossiped implies an effective limit on the number of users already.

I was assuming that system.users is not gossiped. If it is that is a bug and it needs to be corrected.
this RFC proposes to have a cache for certain users, and since we are bound to support multiple data sources for user credentials, I can see this proposal ot be a cache for any user across all sources.

rafiss · 2022-07-26T18:50:18Z

I'm linking this issue here as it seems related: #82916 In this case, the loss of an internal (timeseries) range rendered the entire cluster inoperable, so even the root user could not log in.

knz requested a review from a team as a code owner January 20, 2020 10:23

knz force-pushed the 20200119-users-rfc branch 6 times, most recently from a931e7e to 374ba55 Compare January 20, 2020 10:45

docs: new RFC on always-available users

eb1b55d

Release note: None

knz force-pushed the 20200119-users-rfc branch from 374ba55 to eb1b55d Compare January 20, 2020 10:48

tbg removed the request for review from a team June 25, 2020 09:50

bdarnell reviewed Nov 19, 2020

View reviewed changes

knz commented Nov 23, 2020

View reviewed changes

knz mentioned this pull request Jan 25, 2021

server: create a way to access a cluster for troubleshooting when the cluster is not healthy #59376

Open

knz mentioned this pull request Mar 31, 2021

server: emergency file-based authentication for the HTTP APIs + cookie-as-password authentication #60723

Draft

bdarnell mentioned this pull request Apr 7, 2021

pgwire: verifying user passwords is too aggressive #36160

Closed

tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021

knz mentioned this pull request Jun 25, 2021

sql,server: cache password hashes and user info using a leasing protocol #58869

Closed

knz mentioned this pull request May 28, 2022

cli/sqlshell: starting SQL shell hangs when SQL leases range is stalled #81673

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: new RFC on always-available users #44134

docs: new RFC on always-available users #44134

knz commented Jan 20, 2020 •

edited

Loading

cockroach-teamcity commented Jan 20, 2020

aaron-crl commented Jan 23, 2020

knz commented Jan 24, 2020

bdarnell left a comment

knz left a comment

rafiss commented Jul 26, 2022 •

edited

Loading

docs: new RFC on always-available users #44134

Are you sure you want to change the base?

docs: new RFC on always-available users #44134

Conversation

knz commented Jan 20, 2020 • edited Loading

cockroach-teamcity commented Jan 20, 2020

aaron-crl commented Jan 23, 2020

knz commented Jan 24, 2020

bdarnell left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

rafiss commented Jul 26, 2022 • edited Loading

knz commented Jan 20, 2020 •

edited

Loading

rafiss commented Jul 26, 2022 •

edited

Loading