Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: add generalized circuit-breaking to catch, e.g., mutex deadlocks #77366

Open
tbg opened this issue Mar 3, 2022 · 4 comments
Open
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Mar 3, 2022

Is your feature request related to a problem? Please describe.

The work in #33007 has given us good blast radius mitigations should a replica be unable to serve requests as a result of a loss of quorum. However, a replica can also become unavailable for other reasons, the most drastic of them being an inability to acquire a given mutex (e.g. a deadlock), but there could be others too.

Describe the solution you'd like

We could add a circuit breaker at the top of Replica.Send and trip it appropriately when the replica is "stuck", which we would also need to add ways to detect.

Additional context

#72092 could be helpful to determine when to trip.
Also, I want to point out that if any replica mutex actually deadlocks, this will likely deadlock the entire store and then the node, so doing this project to specifically address deadlocks is likely not a good use of our time. However, the circuit breaker proposed here could play a part in #75944.

Jira issue: CRDB-13552

@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Mar 3, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Mar 3, 2022
@tbg
Copy link
Member Author

tbg commented Mar 3, 2022

cc @mwang1026 heads up. We originally attempted to address this problem with the circuit breakers too, but have refocused for loss of quorum because that allows us to do a lot better for follower reads (not blocking them when replication is down).
Having typed out this issue it also doesn't seem obvious what the solution for the general problem of moving traffic off a "stuck replica" is, at least deadlock mitigation seems really thorny and wouldn't be handled satisfactorily by a circuit breaker.

@nvanbenschoten
Copy link
Member

The current behavior of a mutex deadlock swelling to deadlock processing across an entire node or even an entire cluster, all while being entirely opaque to a would-be debugger, is terrible. And yet, gracefully living with/cordoning off mutex deadlocks seems like a very hard problem. Depending on which mutex hits a deadlock, it's difficult to understand the full scope of the operations that will also transitively get caught up in the deadlock. It's also not clear what the best recourse is for each of these operations to recover — meaning that it would be a lot of work to generalize this and the solution would still likely be limited to specific mutexes.

Have we considered a less graceful means of detecting and limiting the blast radius of mutex deadlocks? For instance, assuming we could detect mutex deadlocks without false positives, crashing the node that hit the deadlock (with sufficient debug information to help engineers diagnose the situation after the fact) would be a step in the right direction.

@tbg
Copy link
Member Author

tbg commented Mar 16, 2022

#66765 comes to mind. If we tracked our mutexes and checked every so often that they can be acquired with ~reasonable delay (for example, 10s) we would get very close.

I share your concerns about a generalized solution via the circuit breaker.

@erikgrinaker
Copy link
Contributor

Instead of requiring active cooperation of a faulty node, the DistSender and lease protocol should instead be robust to faulty replicas. This will be handled by expiration-based leases and DistSender lease detection and request redirection (#105168).

I'll leave this open in case that doesn't pan out, or we need this for other reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

No branches or pull requests

3 participants