kvprober: find single range issues by repeatedly probing problem ranges with issues after randomly finding a candidate problem range #74407

joshimhoff · 2022-01-04T15:13:53Z

Is your feature request related to a problem? Please describe.
kvprober probes all ranges. Single range issues happen. kvprober will detect such issues but the resulting error rate will be extremely low (1 / number of ranges in the cluster). This makes alerting on such an issue hard.

Describe the solution you'd like
kvprober could "remember" when it probes a range and doesn't get back a successful (or fast) response. kvprober could then probe that range regularly, in a separate goroutine from the one in which it is probing all ranges. kvprober could generate metrics on the error rate & latency profile of the candidate problem range. SRE could alert on this. Basically, when kvprober discovers a candidate problem range, it focuses on producing data about that range.

Describe alternatives you've considered
An alternative is to not do this, but write a long time-window log-based alert on multiple errors in a row for RPCs to a specific range. This might be workable, tho the time to page would be much lower than with above. Also would need #74405.

Additional context
N/A

@tbg & @andreimatei: Wdyt about this idea?

Jira issue: CRDB-12069

The text was updated successfully, but these errors were encountered:

andreimatei · 2022-01-10T15:38:39Z

I think it'd be nice to keep an eye on ranges with problems. I think the kvprober should integrate with the circuit-breaking that Tobi I think is introducing in #71806.

joshimhoff · 2022-01-10T16:33:32Z

What do you think about what @andreimatei is saying, @tbg? Do you see benefits to kvprober integrating with circuit breaking? What would the integration look like concretely?

If you do, I still think there is an argument for something like this:

kvprober could "remember" when it probes a range and doesn't get back a successful (or fast) response. kvprober could then probe that range regularly, in a separate goroutine from the one in which it is probing all ranges. kvprober could generate metrics on the error rate & latency profile of the candidate problem range. SRE could alert on this. Basically, when kvprober discovers a candidate problem range, it focuses on producing data about that range.

The argument is that having two mechanisms for finding single range issues may provide a lower false negative rate. This is only true if there are some issues that circuit breaking may miss but kvprober would (eventually) find. My instinct is such a thing is possible, e.g. perhaps the storj deadlocks could be an example? But wondering what you both think.

tbg · 2022-01-17T07:32:03Z

Hi Josh,

Just double checking what problem we're trying to solve: I assume in your alerting you set relatively conservative error rates, since you don't want to be paged on short-lived issues. This means that you'll catch widespread problems after (say) 5 minutes of alerting. But now you're worried about single-range failures, where one in 10000 ranges is persistently down, but since it will be visited by the prober only every so often it will never generate the signal you need to get paged. Is this about right?

I think the approach you mention does make sense to solve this problem. I expect the per-Replica circuit breakers (#33007) will be pretty good at shouting when replicas are unavailable, and this should lend itself nicely to use in alerting rules. The big value add I see in kvprober is that it exercises the stack above that, notably the DistSender routing layer, where we have two relatively common unavailability scenarios: routing problems, where we somehow never reach out to the right replica but keep spinning in circles, and complete range loss, where there is no reachable replica left to route to (in which case DistSender also retries indefinitely). There's a case here that we should also have per-Range circuit breakers (sitting at DistSender), but until we have this, kvprober is our coverage for these problems and so it makes sense to let it hone in on ranges it has observed problems with. Starting from the top here probably makes sense: imagine we have "something" - what metrics and settings would it expose? How would the system behave in a few toy scenarios (the most obvious being that of a single range that for some reason does not respond to any probes indefinitely?)

One interesting way this could be implemented is as a second kvprober, which instead of drawing ranges from the meta2 entries, pulls from the first kvprober. In other words, the primary kvprober can push ranges into a holding area, which the secondary kvprober consumes from. Ranges get removed from the holding area only if the secondary kvprober has cleared them. I'm not confident enough saying that this is how it should be built, maybe this doesn't make sense in light of a need to distinguish these probers by metrics and cluster settings, etc, just pointing out this potential approach.

Do you see benefits to kvprober integrating with circuit breaking? What would the integration look like concretely?

There would be some value in kvprober being notified specifically about tripped ranges. However, the information flow from Replica to some higher-level black-box prober is not obvious (the simple solution is to gossip the down replicas).

joshimhoff · 2022-01-18T22:48:12Z

Hi hi!

Is this about right?

Yes, that's about right.

The big value add I see in kvprober is that it exercises the stack above that, notably the DistSender routing layer, where we have two relatively common unavailability scenarios..

Agreed! Thanks for describing these two concrete scenarios especially.

There's a case here that we should also have per-Range circuit breakers (sitting at DistSender)

Could you very roughly estimate the chances of this being built by 22.1, 22.2, & 23.1 for my planning?

One interesting way this could be implemented is as a second kvprober, which instead of drawing ranges from the meta2 entries, pulls from the first kvprober.

Yes, this is roughly how I imagine it, at least at a conceptual level. I don't think we need to figure out exactly how this will work at a code level here, but I imagine we will export the following metrics

Read candidate problem range counter
Read candidate problem range error counter
Read candidate problem range latency histogram
Write candidate problem range counter
Write candidate problem range error counter
Write candidate problem range latency histogram

In addition to what we already have:

Read random range counter
Read random range error counter
Read random range latency histogram
Write random range counter
Write random range error counter
Write random range latency histogram

The candidate problem range is a range that kvprober thinks is behaving poorly based on recent "random" probe. One thing I wonder about is persisting the current problem range so it is not lost in case of restart. Tho this does add complexity.

One nice thing is we can also detect single range latency issues, which might be hard to do with the breaker. If latency of some "random" read / write is high, make it the candidate problem range. I'm not sure if this will be useful in practice but I wonder if it would be. Perhaps it would be more useful if we could say at medium confidence how fast a range should be to probe, as has been discussed before.

There would be some value in kvprober being notified specifically about tripped ranges.

What would be the value? I don't actually see it, since the breaker will already be screaming at us.

jmcarp · 2022-02-02T17:35:13Z

@joshimhoff suggested this as something I might be able to pick up and I have a few questions:

How large should the holding area for problem ranges be? I had initially guessed it should hold all ranges with failed probes, or possibly all ranges with recently failed probes, but @joshimhoff pointed out that we already have good coverage for multi-range problems, so we might just want to hold a single problem range at a time and probe until it's healthy again.
When should a range leave the holding area? After a single successful probe, or several? If we cap the size of the holding area (e.g. at one item per @joshimhoff), should new problem ranges detected by the random kvprober displace ranges in the holding area?

joshimhoff · 2022-02-02T17:44:12Z

How large should the holding area for problem ranges be?

I think I vote we store N=3 ranges to which probes recently failed or were high latency in a FIFO queue. The loop that probes them probes one entry in the queue per tick of the timer.

Part of the reason I suggest ^^ is that having a FIFO queue of N=3 ranges doesn't seem much more complicated than holding a single range, but tell me if you think that is wrong, Josh!

When should a range leave the holding area? After a single successful probe, or several? If we cap the size of the holding area (e.g. at one item per @joshimhoff), should new problem ranges detected by the random kvprober displace ranges in the holding area?

I'd leave em in the list unless a failed or slow probe to a range not in the list knocks em out. There is no harm in probing a healthy range.

We should also inversely scale the rate kvprober probes problem ranges by number of nodes in the cluster. Else kvprober running in a really big cluster might hammer problem ranges, if all KV nodes classify the same range as a problem range (as would be eventually expected).

Wdyt about all that, Josh & KV folks?

tbg · 2022-02-02T19:30:12Z

Sounds good.

Part of the reason I suggest ^^ is that having a FIFO queue of N=3 ranges doesn't seem much more complicated than holding a single range, but tell me if you think that is wrong, Josh!

The implementation should be the same for any N, so it probably makes sense to pick the N later. That said, N=3 is probably fine as a starting point but if the impl special-cased on that, I would be concerned.

We should also inversely scale the rate kvprober probes problem ranges by number of nodes in the cluster. Else kvprober running in a really big cluster might hammer problem ranges, if all KV nodes classify the same range as a problem range (as would be eventually expected).

Yeah, that generally makes sense, though I don't think it would be a problem with cluster sizes we can expect in the next couple of years, given that kvprober isn't probing very aggressively in the first place.

The kvprober provides good coverage of issues that affect many ranges, but has a lower probability of detecting individual bad ranges. To improve coverage of the latter case, remember failed ranges from the existing prober and probe them more frequently in a second pair of probe loops. Resolves cockroachdb#74407. Release note: None

joshimhoff · 2022-07-21T16:26:00Z

I've been talking with @nkodali about this recently. I think we should return to this issue, especially since the larger alerting deliverable captured in #71169 will prob slip in 23.1. We've seen some KV outages recently that involved a single range becoming hard unavailable. In one recent case, the outage happened to be on the range holding the system.lease data. This is unlucky for the customer, as it means full cluster down outage, but "lucky" (lol lol) for the CC alerting stack, as it means an outage the sqlprober promptly caught (which is good for the customer as it meant CRL fixed it quickly & without customer involvement). Regardless of what range is affected, I think we want to catch a range becoming hard unavailable, at very high confidence.

Quick recap:

kvprober as is will probe the single down range eventually, but the error rate measured by kvprober will be extremely low, and so SRE will not page on it (if we did, we'd get paged on noise too much).
To fix this, we can repeatedly probe ranges after kvprober receives an error.
@jmcarp made an attempt at this at kvprober: add probes for problem ranges #76081. Needs some reworking, e.g. @tbg makes a suggestion at kvprober: add probes for problem ranges #76081 (comment), but the basic idea is there.
As discussed in this ticket, the KV circuit breaker can catch some KV outages, but not all, as what it fires on today is a long replication times. Plenty of KV outages don't involve the symptom of long replication times, e.g. here is such a bug; it borked a CC cluster two weeks ago: kvserver: unacquired leases during liveness failures #84655

I think if we do this, we should backport it. kvprober is only on in CC; the cluster settings are private.

Santamaura · 2022-08-29T15:51:41Z

Hi @joshimhoff I saw this on the kv obs board and after reading through the comments here (and on that draft PR) I was wondering If that draft PR is salvageable for the current idea of quarantining ranges? It sounded like the sentiment was that there was complexity being added to the prober that was maybe undesirable/unnecessary?

joshimhoff · 2022-08-29T15:57:44Z

Hi @Santamaura!

I think the complexity is warranted, and I think I've talked with @tbg enough about that to convince him of that. In fact, just last week we saw an outage in CC that didn't lead to an SRE page since only a small subset of ranges were affected! I think the draft PR provides a guide on how this might be implemented, but perhaps that implementation is not quite right in terms of code organization, etc. I also think the below comment from Tobias is useful in thinking about how to organize the code. Lastly I'm up to 1:1 about implementation with whoever might be taking this on, if that is useful.

Maybe we want a "quarantine" system. If you fail a probe, you get quarantined (if there's room, otherwise ignore quarantine). Ranges that are in quarantine get probed separately (to avoid slowing down the main prober) at an interval. We track the time of entrance into the quarantine pool and report the duration the oldest member has been there as a metric. Then you can alert on that (instead of on any metric about write failures), without a temporary blip messing up the alerting. With that setup, we don't have to rely on failing probes to determine whether there is an outage that's long enough to page someone. Instead, we get to look directly at the quantity we care about, namely how long it's been since we know of the outage. It would work like an exact mirror image of the "uptime" metric.

These changes update the kvprober to add ranges that fail probing into a quarantine pool where they are continuously probed. A metric which indicates the duration of the longest tenured range has also been added. Resolves cockroachdb#74407 Release justification: low risk, high benefit changes to existing functionality. Release note: None

87436: kv: update kvprober with quarantine pool r=Santamaura a=Santamaura These changes update the kvprober to add ranges that fail probing into a quarantine pool where they are continuously probed. A metric which indicates the duration of the longest tenured range has also been added. Resolves #74407 Release justification: low risk, high benefit changes to existing functionality. Release note: None Co-authored-by: Santamaura <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>

These changes update the kvprober to add ranges that fail probing into a quarantine pool where they are continuously probed. A metric which indicates the duration of the longest tenured range has also been added. Resolves #74407 Release justification: low risk, high benefit changes to existing functionality. Release note: None

joshimhoff · 2022-09-26T12:59:52Z

💯

As we roll this out to CC prod, I will keep you in the loop about prod issues it finds!

These changes update the kvprober to add ranges that fail probing into a quarantine pool where they are continuously probed. A metric which indicates the duration of the longest tenured range has also been added. Resolves #74407 Release justification: low risk, high benefit changes to existing functionality. Release note: None

joshimhoff added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking. labels Jan 4, 2022

joshimhoff changed the title ~~kvprober: find single range issues by repeatedly probing ranges with issues after randomly finding one~~ kvprober: find single range issues by repeatedly probing problem ranges with issues after randomly finding a candidate problem range Jan 4, 2022

joshimhoff added the A-kv-observability label Jan 4, 2022

joshimhoff mentioned this issue Jan 18, 2022

kvprober: allow a manual probe of the entire keyspace #61695

Closed

jmcarp mentioned this issue Feb 4, 2022

kvprober: add probes for problem ranges #76081

Closed

exalate-issue-sync bot added the T-kv-observability label Jul 25, 2022

exalate-issue-sync bot assigned Santamaura Aug 29, 2022

Santamaura mentioned this issue Sep 6, 2022

kv: update kvprober with quarantine pool #87436

Merged

craig bot closed this as completed in 01034a4 Sep 23, 2022

blathers-crl bot mentioned this issue Sep 23, 2022

release-22.2: kv: update kvprober with quarantine pool #88637

Merged

Santamaura mentioned this issue Oct 3, 2022

release-22.1: kv: update kvprober with quarantine pool #89194

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvprober: find single range issues by repeatedly probing problem ranges with issues after randomly finding a candidate problem range #74407

kvprober: find single range issues by repeatedly probing problem ranges with issues after randomly finding a candidate problem range #74407

joshimhoff commented Jan 4, 2022 •

edited by cockroach-jira-scripts

Loading

andreimatei commented Jan 10, 2022

joshimhoff commented Jan 10, 2022

tbg commented Jan 17, 2022

joshimhoff commented Jan 18, 2022 •

edited

Loading

jmcarp commented Feb 2, 2022

joshimhoff commented Feb 2, 2022 •

edited

Loading

tbg commented Feb 2, 2022

joshimhoff commented Jul 21, 2022 •

edited

Loading

Santamaura commented Aug 29, 2022

joshimhoff commented Aug 29, 2022 •

edited

Loading

joshimhoff commented Sep 26, 2022

kvprober: find single range issues by repeatedly probing problem ranges with issues after randomly finding a candidate problem range #74407

kvprober: find single range issues by repeatedly probing problem ranges with issues after randomly finding a candidate problem range #74407

Comments

joshimhoff commented Jan 4, 2022 • edited by cockroach-jira-scripts Loading

andreimatei commented Jan 10, 2022

joshimhoff commented Jan 10, 2022

tbg commented Jan 17, 2022

joshimhoff commented Jan 18, 2022 • edited Loading

jmcarp commented Feb 2, 2022

joshimhoff commented Feb 2, 2022 • edited Loading

tbg commented Feb 2, 2022

joshimhoff commented Jul 21, 2022 • edited Loading

Santamaura commented Aug 29, 2022

joshimhoff commented Aug 29, 2022 • edited Loading

joshimhoff commented Sep 26, 2022

joshimhoff commented Jan 4, 2022 •

edited by cockroach-jira-scripts

Loading

joshimhoff commented Jan 18, 2022 •

edited

Loading

joshimhoff commented Feb 2, 2022 •

edited

Loading

joshimhoff commented Jul 21, 2022 •

edited

Loading

joshimhoff commented Aug 29, 2022 •

edited

Loading