kvserver: add generalized circuit-breaking to catch, e.g., mutex deadlocks #77366

tbg · 2022-03-03T22:42:23Z

Is your feature request related to a problem? Please describe.

The work in #33007 has given us good blast radius mitigations should a replica be unable to serve requests as a result of a loss of quorum. However, a replica can also become unavailable for other reasons, the most drastic of them being an inability to acquire a given mutex (e.g. a deadlock), but there could be others too.

Describe the solution you'd like

We could add a circuit breaker at the top of Replica.Send and trip it appropriately when the replica is "stuck", which we would also need to add ways to detect.

Additional context

#72092 could be helpful to determine when to trip.
Also, I want to point out that if any replica mutex actually deadlocks, this will likely deadlock the entire store and then the node, so doing this project to specifically address deadlocks is likely not a good use of our time. However, the circuit breaker proposed here could play a part in #75944.

Jira issue: CRDB-13552

The text was updated successfully, but these errors were encountered:

tbg · 2022-03-03T22:44:18Z

cc @mwang1026 heads up. We originally attempted to address this problem with the circuit breakers too, but have refocused for loss of quorum because that allows us to do a lot better for follower reads (not blocking them when replication is down).
Having typed out this issue it also doesn't seem obvious what the solution for the general problem of moving traffic off a "stuck replica" is, at least deadlock mitigation seems really thorny and wouldn't be handled satisfactorily by a circuit breaker.

nvanbenschoten · 2022-03-08T17:57:54Z

The current behavior of a mutex deadlock swelling to deadlock processing across an entire node or even an entire cluster, all while being entirely opaque to a would-be debugger, is terrible. And yet, gracefully living with/cordoning off mutex deadlocks seems like a very hard problem. Depending on which mutex hits a deadlock, it's difficult to understand the full scope of the operations that will also transitively get caught up in the deadlock. It's also not clear what the best recourse is for each of these operations to recover — meaning that it would be a lot of work to generalize this and the solution would still likely be limited to specific mutexes.

Have we considered a less graceful means of detecting and limiting the blast radius of mutex deadlocks? For instance, assuming we could detect mutex deadlocks without false positives, crashing the node that hit the deadlock (with sufficient debug information to help engineers diagnose the situation after the fact) would be a step in the right direction.

tbg · 2022-03-16T10:25:08Z

#66765 comes to mind. If we tracked our mutexes and checked every so often that they can be acquired with ~reasonable delay (for example, 10s) we would get very close.

I share your concerns about a generalized solution via the circuit breaker.

erikgrinaker · 2023-08-08T11:37:44Z

Instead of requiring active cooperation of a faulty node, the DistSender and lease protocol should instead be robust to faulty replicas. This will be handled by expiration-based leases and DistSender lease detection and request redirection (#105168).

I'll leave this open in case that doesn't pan out, or we need this for other reasons.

tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Mar 3, 2022

blathers-crl bot added the T-kv KV Team label Mar 3, 2022

erikgrinaker mentioned this issue Mar 17, 2022

kvserver: reduce blast radius of Raft application errors #75944

Open

erikgrinaker mentioned this issue May 19, 2022

kvserver: disk stall prevents lease transfer #81100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: add generalized circuit-breaking to catch, e.g., mutex deadlocks #77366

kvserver: add generalized circuit-breaking to catch, e.g., mutex deadlocks #77366

tbg commented Mar 3, 2022 •

edited by cockroach-jira-scripts

Loading

tbg commented Mar 3, 2022

nvanbenschoten commented Mar 8, 2022

tbg commented Mar 16, 2022

erikgrinaker commented Aug 8, 2023

kvserver: add generalized circuit-breaking to catch, e.g., mutex deadlocks #77366

kvserver: add generalized circuit-breaking to catch, e.g., mutex deadlocks #77366

Comments

tbg commented Mar 3, 2022 • edited by cockroach-jira-scripts Loading

tbg commented Mar 3, 2022

nvanbenschoten commented Mar 8, 2022

tbg commented Mar 16, 2022

erikgrinaker commented Aug 8, 2023

tbg commented Mar 3, 2022 •

edited by cockroach-jira-scripts

Loading