pgwire: verifying user passwords is too aggressive #36160

andreimatei · 2019-03-26T16:04:59Z

Every time a non-root pgwire conn is opened, we check the password's hash (or lack of password) by reading the users table. This seems very wasteful. A customer has seen these queries sometimes get stuck (although that was probably a symptom of other badness), they've caused goroutine leaks (addressed in #35776), so they're definitely generating some noise. Also, generally, we should be controlling traffic to system ranges as much as possible. Also, our tests generally use root requests, which don't do that query, so this is a source of difference between our benchmarks and client apps.

Checking these damn passwords over and over seems silly. We could cache the results per user per node for some time, and so decrease the cost of new connection storms considerably.

Don't really know who this falls under. @jordanlewis, you it?

andreimatei · 2019-03-26T16:09:44Z

cc @awoods187

jordanlewis · 2019-04-02T18:22:56Z

I discussed this with @mberhault last week. We think there are some steps we could take going forward to mitigate the cost of the hashing, like cheap scheduler-directed sleeps combined with an in-memory cache of valid password hashes. But there are complications like how do you invalidate the cache?

I'm downgrading this from S-2 to S-3 because there's a clear workaround, which is to use certs. We should recommend that customers with unusual demands around frequently recycled connections use certs instead of password hashing. Also, it seems likely to me that the root cause problems here were the ones you mentioned, not necessarily the password hashing itself.

andreimatei · 2019-04-02T20:37:30Z

I'm downgrading this from S-2 to S-3 because there's a clear workaround, which is to use certs.

Do certs help in any way?
The user existence check (and the read of the pwd hash) seems unconditional:

cockroach/pkg/sql/pgwire/conn.go

Line 1531 in 104cf10

exists, hashedPassword, err := sql.GetUserHashedPassword(

So I was wrong before when i said anything about the root user being special in this regard. Which makes the issue even worse - if you lose access to the users range, you can't open new conns?
Bumping back to S-2.

mberhault · 2019-04-02T20:49:31Z

Passwords are stored using bcrypt and verification is intentionally expensive (in time and CPU). Certs will be orders of magnitude faster.
I think it's worth at least having a metric about it so we can easily tell when this is happening. #36280 was filed for that. Further mitigation of password-checking cost is probably not worth doing given that this is 1) bad behavior and 2) can easily be avoided.

andreimatei · 2019-04-02T20:59:02Z

Just in case we're talking over each other, I'm complaining that a KV read (in the form of a SQL query none the less) is made for every new connection. I didn't even know about what pushback against brute forces we may have.
So I'm not sure what you mean by either "bad behavior" and "can easily be avoided". I'm not asking for making password checks in particular less expensive, in case that was the impression.

jordanlewis · 2019-04-02T21:37:20Z

OK, my mistake - I'll look into this more.

jordanlewis · 2019-04-02T21:40:57Z

I still don't think this is S-2 because it's not clear to me that it's really causing cluster instability. We haven't seen any problems like this without the other underlying problems, right? Would these issues have happened without the stuck query and leaked goroutines issues?

andreimatei · 2019-04-02T22:59:48Z

Would these issues have happened without the stuck query and leaked goroutines issues?

There are no particular "issues", this is mostly a sanity / risk / performance thing. Not sure what S label it warrants; S-2 sounded like what I'd expect the "default" to be, but maybe it's not. Feel free to change.
https://cockroachlabs.atlassian.net/wiki/spaces/TS/pages/65503575/Issue+metadata is... not too much help :)

andreimatei · 2019-04-02T23:02:13Z

I guess maybe I don't want an S-label, but a P-label? Except there's no P-2...

jordanlewis · 2019-04-04T12:44:42Z

Thanks - the default should be no S or P labels! So I've removed the severity label. Thanks for filing this.

bdarnell · 2019-04-10T16:22:28Z

I'm downgrading this from S-2 to S-3 because there's a clear workaround, which is to use certs.

Certs don't support revocation (#29641), and the workaround for that is to use passwords :)

There are two sides to this: authorized applications sometimes create a storm of connections, and it's very easy for unauthorized users to DoS a cockroach server if they have network access to it. The latter means that firewalls are extremely important (they're a good idea anyway, but this is one of the biggest downsides to exposing a cockroach server to an untrusted network).

We need to implement some sort of rate limiting of password guesses. We also need to be sure that we don't do anything to speed up correct password attempts in a way that makes things easier for a brute-force attacker.

One thing that I think would be safe would be to scale the bcrypt cost factor with the length of the password. If you're using a 32-character random password, brute force is infeasible even without the deliberate slowness of bcrypt. And since database passwords (at least the ones where authentication cost is a concern) are typically machine-to-machine, there's no need for passwords to be short enough for humans to memorize them and good long passwords are more common.

jordanlewis · 2019-04-10T17:20:37Z

I believe the limitation here isn't actually the bcrypt, but the database lookups that occur to check the hash. We need to do something about that and re-measure afterward to see if bcrypt really is a problem (I suspect not).

bdarnell · 2019-04-10T17:24:03Z

The database lookups can be cached fairly easily. When that's done, maybe bcrypt won't be a problem any more for authorized connection storms. But it will definitely still be a DoS vector.

tim-o · 2019-10-24T16:07:24Z

Adding $0.02, I think we should prioritize this pre 20.1 if we can. We've run into several issues over the past couple of months where connection churn either made a cluster unavailable or contributed heavily to latency due to queries retrieving the hashed password. This seems like a footgun that we should remove, and I could easily see it coming up with the cockroach cloud beta.

One clarifying question: would this connection string require the CPU-heavy password lookup or rely on the cert? If the former, this'll be more important for Cloud:

postgres://user:password@cloudUrl:26257/defaultdb?sslmode=verify-full&sslrootcert=certs_dir/cluster-ca.crt

mberhault · 2019-10-25T23:59:15Z

The connection string you pasted uses a password with an SSL connection and verification of the server certificate. This will trigger the bcrypt password verification that is being discussed here.

For info: we use the default bcrypt (golang version) cost of 10. This means that checking a single password takes somewhere between 50 and 100ms (approximately).
The computation time is based on the cost (not the password length) with the algorithm performing 2^cost iterations. A cost of 4 takes about 1ms per password.

Here's a simple benchmark running on a GCP n1-standard-2:

package main

import (
  "fmt"
  "testing"

  "golang.org/x/crypto/bcrypt"
)

func BenchmarkBcrypt(b *testing.B) {
  password := []byte("1234567890abcdefghijklmnopqrztuvwxyz")
  for cost := 4; cost <= 10; cost += 2 {
    pass, _ := bcrypt.GenerateFromPassword(password, cost)

    b.Run(fmt.Sprintf("cost=%d", cost),
      func(b *testing.B) {
        for i := 0; i < b.N; i++ {
          bcrypt.CompareHashAndPassword(pass, password)
        }
      })
  }
}

Output:

$ go test -v --bench=.
goos: linux
goarch: amd64
BenchmarkBcrypt/cost=4-2         	    1000	   1360523 ns/op
BenchmarkBcrypt/cost=6-2         	     300	   5261548 ns/op
BenchmarkBcrypt/cost=8-2         	     100	  20799518 ns/op
BenchmarkBcrypt/cost=10-2        	      20	  83481737 ns/op

Ben's proposal is to scale down the bcrypt cost as the length of the password increases. This makes bcrypt cheaper but the search space of a brute force attack much larger due to the larger number of characters. If we assume 6 bits of entropy per character, a 12 character password has a search space of 2^72 multiplied by a bcrypt cost of 10: 2^10. If you double the number of characters, you massively increase the search space to 2^144 at which point you can drop the bcrypt cost considerably and not make brute force searches cheaper.

Of course, the caveat with those numbers is that they're only nice and easy when searching the entire set of possible passwords (assume you know the length since you got into the DB). Many searches start with dictionary attacks and lists of known password (from various leaks). We would probably want to be very conservative when lowering the bcrypt cost, I don't think 1ms per verification is reasonable especially without rate-limiting password authentication attempts.

andreimatei · 2019-10-27T20:51:49Z

The CPU cost of hashing I don't know/understand, but my problem with what's going on here is that we need to do a query to a system table on every connection open (non-root). I think we should find a solution such that we don't do that query over and over. Right? Or am I missing something?

ajwerner · 2021-04-07T20:09:26Z

On some level isn't #42519 the answer here?

bdarnell · 2021-04-07T23:30:02Z

SCRAM (#42519) is half of it. As in Marc's last message, there are two problems: the password hashing (solved by SCRAM) and the lookups on the users table (which needs some sort of caching, raising questions around what level of consistency is required when revoking a user's access. Related to discussions we're having around availability, such as in #44134).

rafiss · 2021-04-08T02:04:10Z

When I assigned this to myself/SQL Experience I was imagining that we'd do some type of caching as described in #58869

Another thing this sounds related to is allowing system tables such as users or role_options be GLOBAL tables -- if we're fine making them slow on writes in order to make reads faster. I bring up this idea because there would be other benefits to allowing system data to be GLOBAL. It would great for the read performance of pg_catalog queries in multiregion clusters, we've also seen be slow and cause issues.

bdarnell · 2021-04-08T15:23:41Z

Making system tables GLOBAL for production use sounds like a good idea (I'm not sure how easy that would be given the way we bootstrap access to these tables). However, it would be very bad for many test/CI scenarios that need to run through their setup/teardown processes as quickly as possible.

ajwerner · 2021-04-08T16:36:47Z

Making system tables GLOBAL for production use sounds like a good idea (I'm not sure how easy that would be given the way we bootstrap access to these tables). However, it would be very bad for many test/CI scenarios that need to run through their setup/teardown processes as quickly as possible.

For the in-memory test cluster we could set the clock uncertainty to 0.

ajwerner · 2021-04-08T17:14:56Z

Just learned that external users of cockroach-go have been doing this unwittingly: https://github.com/cockroachdb/cockroach-go/blob/968ed42ffdd3d25b88cdaca11c0df7fdf7b71d4c/testserver/testserver.go#L279

ajwerner · 2021-04-09T03:45:27Z

I put #63365 up to centralize the broader discussion of LOCALITY GLOBAL system tables.

mlazowik · 2021-05-12T02:48:18Z

FYI even when using cert auth we're seeing 9 requests sent to a remote node on each connection: 5 for get-hashed-pwd and 4 for get-login-dependencies. Order of magnitude higher than node latency seems like a high cost.

rafiss · 2021-05-12T03:00:38Z

@mlazowik that's a bit more than i would have expected. could you share how you uncovered those numbers?

also, which version are you using? there are some improvements to this that landed in v20.2.4 (https://www.cockroachlabs.com/docs/releases/v20.2.4.html#performance-improvements)

mlazowik · 2021-05-12T03:12:51Z

I used /debug/logspy?count=1000&duration=10s&grep=<client ip>, we're on v20.2.8. FWIW the remote region is ~150ms away.

rafiss · 2021-07-29T16:16:26Z

Update: #58869 has been closed, so v21.2 will include an improvement that avoids the cost of looking up the password from the KV layer.

The remaining work is to avoid hashing using SCRAM; see #42519. (Hashing wasn't the original complaint in this issue, but based on all the comments it looks like it has become an equal focus here).

I'm removing my assignment and taking this off the SQL Experience board since the SCRAM work is being done by the Server team.

knz · 2022-02-15T13:32:42Z

work finished.

andreimatei · 2022-02-15T14:36:06Z

Do we no longer read the users table?

andreimatei · 2022-02-15T14:37:11Z

Oh I see that d4516d5 says we now cache the passowrds. Nice!

andreimatei assigned jordanlewis Mar 26, 2019

jordanlewis added S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) and removed S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors labels Apr 2, 2019

andreimatei added S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors and removed S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) labels Apr 2, 2019

jordanlewis removed the S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors label Apr 4, 2019

bdarnell added the A-security label Apr 10, 2019

rafiss self-assigned this Apr 7, 2021

ajwerner mentioned this issue Apr 9, 2021

sql,*: make some or all system tables LOCALITY GLOBAL #63365

Closed

rafiss added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label May 12, 2021

jlinder added the T-server-and-security DB Server & Security label Jun 16, 2021

knz added A-authentication Pertains to authn subsystems A-cc-enablement Pertains to current CC production issues or short-term projects labels Jul 29, 2021

rafiss removed their assignment Jul 29, 2021

rafiss removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jul 29, 2021

ajstorm removed the A-multitenancy Related to multi-tenancy label Jan 19, 2022

knz closed this as completed Feb 15, 2022

jlinder added the sync-me-3 label May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pgwire: verifying user passwords is too aggressive #36160

pgwire: verifying user passwords is too aggressive #36160

andreimatei commented Mar 26, 2019 •

edited by tbg

Loading

andreimatei commented Mar 26, 2019

jordanlewis commented Apr 2, 2019

andreimatei commented Apr 2, 2019

mberhault commented Apr 2, 2019

andreimatei commented Apr 2, 2019

jordanlewis commented Apr 2, 2019

jordanlewis commented Apr 2, 2019

andreimatei commented Apr 2, 2019

andreimatei commented Apr 2, 2019

jordanlewis commented Apr 4, 2019

bdarnell commented Apr 10, 2019

jordanlewis commented Apr 10, 2019

bdarnell commented Apr 10, 2019

tim-o commented Oct 24, 2019 •

edited

Loading

mberhault commented Oct 25, 2019

andreimatei commented Oct 27, 2019

ajwerner commented Apr 7, 2021

bdarnell commented Apr 7, 2021

rafiss commented Apr 8, 2021

bdarnell commented Apr 8, 2021

ajwerner commented Apr 8, 2021

ajwerner commented Apr 8, 2021

ajwerner commented Apr 9, 2021

mlazowik commented May 12, 2021

rafiss commented May 12, 2021

mlazowik commented May 12, 2021 •

edited

Loading

rafiss commented Jul 29, 2021 •

edited

Loading

knz commented Feb 15, 2022

andreimatei commented Feb 15, 2022

andreimatei commented Feb 15, 2022

pgwire: verifying user passwords is too aggressive #36160

pgwire: verifying user passwords is too aggressive #36160

Comments

andreimatei commented Mar 26, 2019 • edited by tbg Loading

andreimatei commented Mar 26, 2019

jordanlewis commented Apr 2, 2019

andreimatei commented Apr 2, 2019

mberhault commented Apr 2, 2019

andreimatei commented Apr 2, 2019

jordanlewis commented Apr 2, 2019

jordanlewis commented Apr 2, 2019

andreimatei commented Apr 2, 2019

andreimatei commented Apr 2, 2019

jordanlewis commented Apr 4, 2019

bdarnell commented Apr 10, 2019

jordanlewis commented Apr 10, 2019

bdarnell commented Apr 10, 2019

tim-o commented Oct 24, 2019 • edited Loading

mberhault commented Oct 25, 2019

andreimatei commented Oct 27, 2019

ajwerner commented Apr 7, 2021

bdarnell commented Apr 7, 2021

rafiss commented Apr 8, 2021

bdarnell commented Apr 8, 2021

ajwerner commented Apr 8, 2021

ajwerner commented Apr 8, 2021

ajwerner commented Apr 9, 2021

mlazowik commented May 12, 2021

rafiss commented May 12, 2021

mlazowik commented May 12, 2021 • edited Loading

rafiss commented Jul 29, 2021 • edited Loading

knz commented Feb 15, 2022

andreimatei commented Feb 15, 2022

andreimatei commented Feb 15, 2022

andreimatei commented Mar 26, 2019 •

edited by tbg

Loading

tim-o commented Oct 24, 2019 •

edited

Loading

mlazowik commented May 12, 2021 •

edited

Loading

rafiss commented Jul 29, 2021 •

edited

Loading