-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pgwire: verifying user passwords is too aggressive #36160
Comments
cc @awoods187 |
I discussed this with @mberhault last week. We think there are some steps we could take going forward to mitigate the cost of the hashing, like cheap scheduler-directed sleeps combined with an in-memory cache of valid password hashes. But there are complications like how do you invalidate the cache? I'm downgrading this from S-2 to S-3 because there's a clear workaround, which is to use certs. We should recommend that customers with unusual demands around frequently recycled connections use certs instead of password hashing. Also, it seems likely to me that the root cause problems here were the ones you mentioned, not necessarily the password hashing itself. |
Do certs help in any way? cockroach/pkg/sql/pgwire/conn.go Line 1531 in 104cf10
So I was wrong before when i said anything about the |
Passwords are stored using bcrypt and verification is intentionally expensive (in time and CPU). Certs will be orders of magnitude faster. |
Just in case we're talking over each other, I'm complaining that a KV read (in the form of a SQL query none the less) is made for every new connection. I didn't even know about what pushback against brute forces we may have. |
OK, my mistake - I'll look into this more. |
I still don't think this is S-2 because it's not clear to me that it's really causing cluster instability. We haven't seen any problems like this without the other underlying problems, right? Would these issues have happened without the stuck query and leaked goroutines issues? |
There are no particular "issues", this is mostly a sanity / risk / performance thing. Not sure what S label it warrants; S-2 sounded like what I'd expect the "default" to be, but maybe it's not. Feel free to change. |
I guess maybe I don't want an S-label, but a P-label? Except there's no P-2... |
Thanks - the default should be no S or P labels! So I've removed the severity label. Thanks for filing this. |
Certs don't support revocation (#29641), and the workaround for that is to use passwords :) There are two sides to this: authorized applications sometimes create a storm of connections, and it's very easy for unauthorized users to DoS a cockroach server if they have network access to it. The latter means that firewalls are extremely important (they're a good idea anyway, but this is one of the biggest downsides to exposing a cockroach server to an untrusted network). We need to implement some sort of rate limiting of password guesses. We also need to be sure that we don't do anything to speed up correct password attempts in a way that makes things easier for a brute-force attacker. One thing that I think would be safe would be to scale the bcrypt cost factor with the length of the password. If you're using a 32-character random password, brute force is infeasible even without the deliberate slowness of bcrypt. And since database passwords (at least the ones where authentication cost is a concern) are typically machine-to-machine, there's no need for passwords to be short enough for humans to memorize them and good long passwords are more common. |
I believe the limitation here isn't actually the bcrypt, but the database lookups that occur to check the hash. We need to do something about that and re-measure afterward to see if bcrypt really is a problem (I suspect not). |
The database lookups can be cached fairly easily. When that's done, maybe bcrypt won't be a problem any more for authorized connection storms. But it will definitely still be a DoS vector. |
Adding $0.02, I think we should prioritize this pre 20.1 if we can. We've run into several issues over the past couple of months where connection churn either made a cluster unavailable or contributed heavily to latency due to queries retrieving the hashed password. This seems like a footgun that we should remove, and I could easily see it coming up with the cockroach cloud beta. One clarifying question: would this connection string require the CPU-heavy password lookup or rely on the cert? If the former, this'll be more important for Cloud:
|
The connection string you pasted uses a password with an SSL connection and verification of the server certificate. This will trigger the bcrypt password verification that is being discussed here. For info: we use the default bcrypt (golang version) cost of 10. This means that checking a single password takes somewhere between 50 and 100ms (approximately). Here's a simple benchmark running on a GCP n1-standard-2:
Output:
Ben's proposal is to scale down the bcrypt cost as the length of the password increases. This makes bcrypt cheaper but the search space of a brute force attack much larger due to the larger number of characters. If we assume 6 bits of entropy per character, a 12 character password has a search space of 2^72 multiplied by a bcrypt cost of 10: 2^10. If you double the number of characters, you massively increase the search space to 2^144 at which point you can drop the bcrypt cost considerably and not make brute force searches cheaper. Of course, the caveat with those numbers is that they're only nice and easy when searching the entire set of possible passwords (assume you know the length since you got into the DB). Many searches start with dictionary attacks and lists of known password (from various leaks). We would probably want to be very conservative when lowering the bcrypt cost, I don't think 1ms per verification is reasonable especially without rate-limiting password authentication attempts. |
The CPU cost of hashing I don't know/understand, but my problem with what's going on here is that we need to do a query to a system table on every connection open (non-root). I think we should find a solution such that we don't do that query over and over. Right? Or am I missing something? |
On some level isn't #42519 the answer here? |
SCRAM (#42519) is half of it. As in Marc's last message, there are two problems: the password hashing (solved by SCRAM) and the lookups on the users table (which needs some sort of caching, raising questions around what level of consistency is required when revoking a user's access. Related to discussions we're having around availability, such as in #44134). |
When I assigned this to myself/SQL Experience I was imagining that we'd do some type of caching as described in #58869 Another thing this sounds related to is allowing system tables such as |
Making system tables GLOBAL for production use sounds like a good idea (I'm not sure how easy that would be given the way we bootstrap access to these tables). However, it would be very bad for many test/CI scenarios that need to run through their setup/teardown processes as quickly as possible. |
For the in-memory test cluster we could set the clock uncertainty to 0. |
Just learned that external users of cockroach-go have been doing this unwittingly: https://github.com/cockroachdb/cockroach-go/blob/968ed42ffdd3d25b88cdaca11c0df7fdf7b71d4c/testserver/testserver.go#L279 |
I put #63365 up to centralize the broader discussion of |
FYI even when using cert auth we're seeing 9 requests sent to a remote node on each connection: 5 for |
@mlazowik that's a bit more than i would have expected. could you share how you uncovered those numbers? also, which version are you using? there are some improvements to this that landed in v20.2.4 (https://www.cockroachlabs.com/docs/releases/v20.2.4.html#performance-improvements) |
I used |
Update: #58869 has been closed, so v21.2 will include an improvement that avoids the cost of looking up the password from the KV layer. The remaining work is to avoid hashing using SCRAM; see #42519. (Hashing wasn't the original complaint in this issue, but based on all the comments it looks like it has become an equal focus here). I'm removing my assignment and taking this off the SQL Experience board since the SCRAM work is being done by the Server team. |
work finished. |
Do we no longer read the |
Oh I see that d4516d5 says we now cache the passowrds. Nice! |
Every time a non-root pgwire conn is opened, we check the password's hash (or lack of password) by reading the users table. This seems very wasteful. A customer has seen these queries sometimes get stuck (although that was probably a symptom of other badness), they've caused goroutine leaks (addressed in #35776), so they're definitely generating some noise. Also, generally, we should be controlling traffic to system ranges as much as possible. Also, our tests generally use root requests, which don't do that query, so this is a source of difference between our benchmarks and client apps.
Checking these damn passwords over and over seems silly. We could cache the results per user per node for some time, and so decrease the cost of new connection storms considerably.
Don't really know who this falls under. @jordanlewis, you it?
The text was updated successfully, but these errors were encountered: