Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pgwire: verifying user passwords is too aggressive #36160

Closed
andreimatei opened this issue Mar 26, 2019 · 31 comments
Closed

pgwire: verifying user passwords is too aggressive #36160

andreimatei opened this issue Mar 26, 2019 · 31 comments
Labels
A-authentication Pertains to authn subsystems A-cc-enablement Pertains to current CC production issues or short-term projects A-security A-sql-pgwire pgwire protocol issues. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-server-and-security DB Server & Security

Comments

@andreimatei
Copy link
Contributor

andreimatei commented Mar 26, 2019

Every time a non-root pgwire conn is opened, we check the password's hash (or lack of password) by reading the users table. This seems very wasteful. A customer has seen these queries sometimes get stuck (although that was probably a symptom of other badness), they've caused goroutine leaks (addressed in #35776), so they're definitely generating some noise. Also, generally, we should be controlling traffic to system ranges as much as possible. Also, our tests generally use root requests, which don't do that query, so this is a source of difference between our benchmarks and client apps.

Checking these damn passwords over and over seems silly. We could cache the results per user per node for some time, and so decrease the cost of new connection storms considerably.

Don't really know who this falls under. @jordanlewis, you it?

@andreimatei andreimatei added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-sql-pgwire pgwire protocol issues. S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors labels Mar 26, 2019
@andreimatei
Copy link
Contributor Author

cc @awoods187

@jordanlewis jordanlewis added S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) and removed S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors labels Apr 2, 2019
@jordanlewis
Copy link
Member

I discussed this with @mberhault last week. We think there are some steps we could take going forward to mitigate the cost of the hashing, like cheap scheduler-directed sleeps combined with an in-memory cache of valid password hashes. But there are complications like how do you invalidate the cache?

I'm downgrading this from S-2 to S-3 because there's a clear workaround, which is to use certs. We should recommend that customers with unusual demands around frequently recycled connections use certs instead of password hashing. Also, it seems likely to me that the root cause problems here were the ones you mentioned, not necessarily the password hashing itself.

@andreimatei
Copy link
Contributor Author

I'm downgrading this from S-2 to S-3 because there's a clear workaround, which is to use certs.

Do certs help in any way?
The user existence check (and the read of the pwd hash) seems unconditional:

exists, hashedPassword, err := sql.GetUserHashedPassword(

So I was wrong before when i said anything about the root user being special in this regard. Which makes the issue even worse - if you lose access to the users range, you can't open new conns?
Bumping back to S-2.

@andreimatei andreimatei added S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors and removed S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) labels Apr 2, 2019
@mberhault
Copy link
Contributor

Passwords are stored using bcrypt and verification is intentionally expensive (in time and CPU). Certs will be orders of magnitude faster.
I think it's worth at least having a metric about it so we can easily tell when this is happening. #36280 was filed for that. Further mitigation of password-checking cost is probably not worth doing given that this is 1) bad behavior and 2) can easily be avoided.

@andreimatei
Copy link
Contributor Author

Just in case we're talking over each other, I'm complaining that a KV read (in the form of a SQL query none the less) is made for every new connection. I didn't even know about what pushback against brute forces we may have.
So I'm not sure what you mean by either "bad behavior" and "can easily be avoided". I'm not asking for making password checks in particular less expensive, in case that was the impression.

@jordanlewis
Copy link
Member

OK, my mistake - I'll look into this more.

@jordanlewis
Copy link
Member

I still don't think this is S-2 because it's not clear to me that it's really causing cluster instability. We haven't seen any problems like this without the other underlying problems, right? Would these issues have happened without the stuck query and leaked goroutines issues?

@andreimatei
Copy link
Contributor Author

Would these issues have happened without the stuck query and leaked goroutines issues?

There are no particular "issues", this is mostly a sanity / risk / performance thing. Not sure what S label it warrants; S-2 sounded like what I'd expect the "default" to be, but maybe it's not. Feel free to change.
https://cockroachlabs.atlassian.net/wiki/spaces/TS/pages/65503575/Issue+metadata is... not too much help :)

@andreimatei
Copy link
Contributor Author

I guess maybe I don't want an S-label, but a P-label? Except there's no P-2...

@jordanlewis jordanlewis removed the S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors label Apr 4, 2019
@jordanlewis
Copy link
Member

Thanks - the default should be no S or P labels! So I've removed the severity label. Thanks for filing this.

@bdarnell
Copy link
Contributor

I'm downgrading this from S-2 to S-3 because there's a clear workaround, which is to use certs.

Certs don't support revocation (#29641), and the workaround for that is to use passwords :)

There are two sides to this: authorized applications sometimes create a storm of connections, and it's very easy for unauthorized users to DoS a cockroach server if they have network access to it. The latter means that firewalls are extremely important (they're a good idea anyway, but this is one of the biggest downsides to exposing a cockroach server to an untrusted network).

We need to implement some sort of rate limiting of password guesses. We also need to be sure that we don't do anything to speed up correct password attempts in a way that makes things easier for a brute-force attacker.

One thing that I think would be safe would be to scale the bcrypt cost factor with the length of the password. If you're using a 32-character random password, brute force is infeasible even without the deliberate slowness of bcrypt. And since database passwords (at least the ones where authentication cost is a concern) are typically machine-to-machine, there's no need for passwords to be short enough for humans to memorize them and good long passwords are more common.

@jordanlewis
Copy link
Member

I believe the limitation here isn't actually the bcrypt, but the database lookups that occur to check the hash. We need to do something about that and re-measure afterward to see if bcrypt really is a problem (I suspect not).

@bdarnell
Copy link
Contributor

The database lookups can be cached fairly easily. When that's done, maybe bcrypt won't be a problem any more for authorized connection storms. But it will definitely still be a DoS vector.

@tim-o
Copy link
Contributor

tim-o commented Oct 24, 2019

Adding $0.02, I think we should prioritize this pre 20.1 if we can. We've run into several issues over the past couple of months where connection churn either made a cluster unavailable or contributed heavily to latency due to queries retrieving the hashed password. This seems like a footgun that we should remove, and I could easily see it coming up with the cockroach cloud beta.

One clarifying question: would this connection string require the CPU-heavy password lookup or rely on the cert? If the former, this'll be more important for Cloud:

postgres://user:password@cloudUrl:26257/defaultdb?sslmode=verify-full&sslrootcert=certs_dir/cluster-ca.crt

@mberhault
Copy link
Contributor

The connection string you pasted uses a password with an SSL connection and verification of the server certificate. This will trigger the bcrypt password verification that is being discussed here.

For info: we use the default bcrypt (golang version) cost of 10. This means that checking a single password takes somewhere between 50 and 100ms (approximately).
The computation time is based on the cost (not the password length) with the algorithm performing 2^cost iterations. A cost of 4 takes about 1ms per password.

Here's a simple benchmark running on a GCP n1-standard-2:

package main

import (
  "fmt"
  "testing"

  "golang.org/x/crypto/bcrypt"
)

func BenchmarkBcrypt(b *testing.B) {
  password := []byte("1234567890abcdefghijklmnopqrztuvwxyz")
  for cost := 4; cost <= 10; cost += 2 {
    pass, _ := bcrypt.GenerateFromPassword(password, cost)

    b.Run(fmt.Sprintf("cost=%d", cost),
      func(b *testing.B) {
        for i := 0; i < b.N; i++ {
          bcrypt.CompareHashAndPassword(pass, password)
        }
      })
  }
}

Output:

$ go test -v --bench=.
goos: linux
goarch: amd64
BenchmarkBcrypt/cost=4-2         	    1000	   1360523 ns/op
BenchmarkBcrypt/cost=6-2         	     300	   5261548 ns/op
BenchmarkBcrypt/cost=8-2         	     100	  20799518 ns/op
BenchmarkBcrypt/cost=10-2        	      20	  83481737 ns/op

Ben's proposal is to scale down the bcrypt cost as the length of the password increases. This makes bcrypt cheaper but the search space of a brute force attack much larger due to the larger number of characters. If we assume 6 bits of entropy per character, a 12 character password has a search space of 2^72 multiplied by a bcrypt cost of 10: 2^10. If you double the number of characters, you massively increase the search space to 2^144 at which point you can drop the bcrypt cost considerably and not make brute force searches cheaper.

Of course, the caveat with those numbers is that they're only nice and easy when searching the entire set of possible passwords (assume you know the length since you got into the DB). Many searches start with dictionary attacks and lists of known password (from various leaks). We would probably want to be very conservative when lowering the bcrypt cost, I don't think 1ms per verification is reasonable especially without rate-limiting password authentication attempts.

@andreimatei
Copy link
Contributor Author

The CPU cost of hashing I don't know/understand, but my problem with what's going on here is that we need to do a query to a system table on every connection open (non-root). I think we should find a solution such that we don't do that query over and over. Right? Or am I missing something?

@rafiss rafiss self-assigned this Apr 7, 2021
@ajwerner
Copy link
Contributor

ajwerner commented Apr 7, 2021

On some level isn't #42519 the answer here?

@bdarnell
Copy link
Contributor

bdarnell commented Apr 7, 2021

SCRAM (#42519) is half of it. As in Marc's last message, there are two problems: the password hashing (solved by SCRAM) and the lookups on the users table (which needs some sort of caching, raising questions around what level of consistency is required when revoking a user's access. Related to discussions we're having around availability, such as in #44134).

@rafiss
Copy link
Collaborator

rafiss commented Apr 8, 2021

When I assigned this to myself/SQL Experience I was imagining that we'd do some type of caching as described in #58869

Another thing this sounds related to is allowing system tables such as users or role_options be GLOBAL tables -- if we're fine making them slow on writes in order to make reads faster. I bring up this idea because there would be other benefits to allowing system data to be GLOBAL. It would great for the read performance of pg_catalog queries in multiregion clusters, we've also seen be slow and cause issues.

@bdarnell
Copy link
Contributor

bdarnell commented Apr 8, 2021

Making system tables GLOBAL for production use sounds like a good idea (I'm not sure how easy that would be given the way we bootstrap access to these tables). However, it would be very bad for many test/CI scenarios that need to run through their setup/teardown processes as quickly as possible.

@ajwerner
Copy link
Contributor

ajwerner commented Apr 8, 2021

Making system tables GLOBAL for production use sounds like a good idea (I'm not sure how easy that would be given the way we bootstrap access to these tables). However, it would be very bad for many test/CI scenarios that need to run through their setup/teardown processes as quickly as possible.

For the in-memory test cluster we could set the clock uncertainty to 0.

@ajwerner
Copy link
Contributor

ajwerner commented Apr 8, 2021

Just learned that external users of cockroach-go have been doing this unwittingly: https://github.com/cockroachdb/cockroach-go/blob/968ed42ffdd3d25b88cdaca11c0df7fdf7b71d4c/testserver/testserver.go#L279

@ajwerner
Copy link
Contributor

ajwerner commented Apr 9, 2021

I put #63365 up to centralize the broader discussion of LOCALITY GLOBAL system tables.

@mlazowik
Copy link

FYI even when using cert auth we're seeing 9 requests sent to a remote node on each connection: 5 for get-hashed-pwd and 4 for get-login-dependencies. Order of magnitude higher than node latency seems like a high cost.

@rafiss
Copy link
Collaborator

rafiss commented May 12, 2021

@mlazowik that's a bit more than i would have expected. could you share how you uncovered those numbers?

also, which version are you using? there are some improvements to this that landed in v20.2.4 (https://www.cockroachlabs.com/docs/releases/v20.2.4.html#performance-improvements)

@mlazowik
Copy link

mlazowik commented May 12, 2021

I used /debug/logspy?count=1000&duration=10s&grep=<client ip>, we're on v20.2.8. FWIW the remote region is ~150ms away.

@rafiss rafiss added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label May 12, 2021
@jlinder jlinder added the T-server-and-security DB Server & Security label Jun 16, 2021
@knz knz added A-authentication Pertains to authn subsystems A-cc-enablement Pertains to current CC production issues or short-term projects labels Jul 29, 2021
@rafiss
Copy link
Collaborator

rafiss commented Jul 29, 2021

Update: #58869 has been closed, so v21.2 will include an improvement that avoids the cost of looking up the password from the KV layer.

The remaining work is to avoid hashing using SCRAM; see #42519. (Hashing wasn't the original complaint in this issue, but based on all the comments it looks like it has become an equal focus here).

I'm removing my assignment and taking this off the SQL Experience board since the SCRAM work is being done by the Server team.

@rafiss rafiss removed their assignment Jul 29, 2021
@rafiss rafiss removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jul 29, 2021
@ajstorm ajstorm removed the A-multitenancy Related to multi-tenancy label Jan 19, 2022
@knz
Copy link
Contributor

knz commented Feb 15, 2022

work finished.

@knz knz closed this as completed Feb 15, 2022
@andreimatei
Copy link
Contributor Author

Do we no longer read the users table?

@andreimatei
Copy link
Contributor Author

Oh I see that d4516d5 says we now cache the passowrds. Nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-authentication Pertains to authn subsystems A-cc-enablement Pertains to current CC production issues or short-term projects A-security A-sql-pgwire pgwire protocol issues. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-server-and-security DB Server & Security
Projects
None yet
Development

No branches or pull requests