Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvprober: find single range issues by repeatedly probing problem ranges with issues after randomly finding a candidate problem range #74407

Closed
joshimhoff opened this issue Jan 4, 2022 · 11 comments · Fixed by #87436
Assignees
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking.

Comments

@joshimhoff
Copy link
Collaborator

joshimhoff commented Jan 4, 2022

Is your feature request related to a problem? Please describe.
kvprober probes all ranges. Single range issues happen. kvprober will detect such issues but the resulting error rate will be extremely low (1 / number of ranges in the cluster). This makes alerting on such an issue hard.

Describe the solution you'd like
kvprober could "remember" when it probes a range and doesn't get back a successful (or fast) response. kvprober could then probe that range regularly, in a separate goroutine from the one in which it is probing all ranges. kvprober could generate metrics on the error rate & latency profile of the candidate problem range. SRE could alert on this. Basically, when kvprober discovers a candidate problem range, it focuses on producing data about that range.

Describe alternatives you've considered
An alternative is to not do this, but write a long time-window log-based alert on multiple errors in a row for RPCs to a specific range. This might be workable, tho the time to page would be much lower than with above. Also would need #74405.

Additional context
N/A

@tbg & @andreimatei: Wdyt about this idea?

Jira issue: CRDB-12069

@joshimhoff joshimhoff added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking. labels Jan 4, 2022
@joshimhoff joshimhoff changed the title kvprober: find single range issues by repeatedly probing ranges with issues after randomly finding one kvprober: find single range issues by repeatedly probing problem ranges with issues after randomly finding a candidate problem range Jan 4, 2022
@andreimatei
Copy link
Contributor

I think it'd be nice to keep an eye on ranges with problems. I think the kvprober should integrate with the circuit-breaking that Tobi I think is introducing in #71806.

@joshimhoff
Copy link
Collaborator Author

What do you think about what @andreimatei is saying, @tbg? Do you see benefits to kvprober integrating with circuit breaking? What would the integration look like concretely?

If you do, I still think there is an argument for something like this:

kvprober could "remember" when it probes a range and doesn't get back a successful (or fast) response. kvprober could then probe that range regularly, in a separate goroutine from the one in which it is probing all ranges. kvprober could generate metrics on the error rate & latency profile of the candidate problem range. SRE could alert on this. Basically, when kvprober discovers a candidate problem range, it focuses on producing data about that range.

The argument is that having two mechanisms for finding single range issues may provide a lower false negative rate. This is only true if there are some issues that circuit breaking may miss but kvprober would (eventually) find. My instinct is such a thing is possible, e.g. perhaps the storj deadlocks could be an example? But wondering what you both think.

@tbg
Copy link
Member

tbg commented Jan 17, 2022

Hi Josh,

Just double checking what problem we're trying to solve: I assume in your alerting you set relatively conservative error rates, since you don't want to be paged on short-lived issues. This means that you'll catch widespread problems after (say) 5 minutes of alerting. But now you're worried about single-range failures, where one in 10000 ranges is persistently down, but since it will be visited by the prober only every so often it will never generate the signal you need to get paged. Is this about right?

I think the approach you mention does make sense to solve this problem. I expect the per-Replica circuit breakers (#33007) will be pretty good at shouting when replicas are unavailable, and this should lend itself nicely to use in alerting rules. The big value add I see in kvprober is that it exercises the stack above that, notably the DistSender routing layer, where we have two relatively common unavailability scenarios: routing problems, where we somehow never reach out to the right replica but keep spinning in circles, and complete range loss, where there is no reachable replica left to route to (in which case DistSender also retries indefinitely). There's a case here that we should also have per-Range circuit breakers (sitting at DistSender), but until we have this, kvprober is our coverage for these problems and so it makes sense to let it hone in on ranges it has observed problems with. Starting from the top here probably makes sense: imagine we have "something" - what metrics and settings would it expose? How would the system behave in a few toy scenarios (the most obvious being that of a single range that for some reason does not respond to any probes indefinitely?)

One interesting way this could be implemented is as a second kvprober, which instead of drawing ranges from the meta2 entries, pulls from the first kvprober. In other words, the primary kvprober can push ranges into a holding area, which the secondary kvprober consumes from. Ranges get removed from the holding area only if the secondary kvprober has cleared them. I'm not confident enough saying that this is how it should be built, maybe this doesn't make sense in light of a need to distinguish these probers by metrics and cluster settings, etc, just pointing out this potential approach.

Do you see benefits to kvprober integrating with circuit breaking? What would the integration look like concretely?

There would be some value in kvprober being notified specifically about tripped ranges. However, the information flow from Replica to some higher-level black-box prober is not obvious (the simple solution is to gossip the down replicas).

@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Jan 18, 2022

Hi hi!

Is this about right?

Yes, that's about right.

The big value add I see in kvprober is that it exercises the stack above that, notably the DistSender routing layer, where we have two relatively common unavailability scenarios..

Agreed! Thanks for describing these two concrete scenarios especially.

There's a case here that we should also have per-Range circuit breakers (sitting at DistSender)

Could you very roughly estimate the chances of this being built by 22.1, 22.2, & 23.1 for my planning?

One interesting way this could be implemented is as a second kvprober, which instead of drawing ranges from the meta2 entries, pulls from the first kvprober.

Yes, this is roughly how I imagine it, at least at a conceptual level. I don't think we need to figure out exactly how this will work at a code level here, but I imagine we will export the following metrics

  • Read candidate problem range counter
  • Read candidate problem range error counter
  • Read candidate problem range latency histogram
  • Write candidate problem range counter
  • Write candidate problem range error counter
  • Write candidate problem range latency histogram

In addition to what we already have:

  • Read random range counter
  • Read random range error counter
  • Read random range latency histogram
  • Write random range counter
  • Write random range error counter
  • Write random range latency histogram

The candidate problem range is a range that kvprober thinks is behaving poorly based on recent "random" probe. One thing I wonder about is persisting the current problem range so it is not lost in case of restart. Tho this does add complexity.

One nice thing is we can also detect single range latency issues, which might be hard to do with the breaker. If latency of some "random" read / write is high, make it the candidate problem range. I'm not sure if this will be useful in practice but I wonder if it would be. Perhaps it would be more useful if we could say at medium confidence how fast a range should be to probe, as has been discussed before.

There would be some value in kvprober being notified specifically about tripped ranges.

What would be the value? I don't actually see it, since the breaker will already be screaming at us.

@jmcarp
Copy link
Collaborator

jmcarp commented Feb 2, 2022

@joshimhoff suggested this as something I might be able to pick up and I have a few questions:

  • How large should the holding area for problem ranges be? I had initially guessed it should hold all ranges with failed probes, or possibly all ranges with recently failed probes, but @joshimhoff pointed out that we already have good coverage for multi-range problems, so we might just want to hold a single problem range at a time and probe until it's healthy again.
  • When should a range leave the holding area? After a single successful probe, or several? If we cap the size of the holding area (e.g. at one item per @joshimhoff), should new problem ranges detected by the random kvprober displace ranges in the holding area?

@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Feb 2, 2022

How large should the holding area for problem ranges be?

I think I vote we store N=3 ranges to which probes recently failed or were high latency in a FIFO queue. The loop that probes them probes one entry in the queue per tick of the timer.

Part of the reason I suggest ^^ is that having a FIFO queue of N=3 ranges doesn't seem much more complicated than holding a single range, but tell me if you think that is wrong, Josh!

When should a range leave the holding area? After a single successful probe, or several? If we cap the size of the holding area (e.g. at one item per @joshimhoff), should new problem ranges detected by the random kvprober displace ranges in the holding area?

I'd leave em in the list unless a failed or slow probe to a range not in the list knocks em out. There is no harm in probing a healthy range.

We should also inversely scale the rate kvprober probes problem ranges by number of nodes in the cluster. Else kvprober running in a really big cluster might hammer problem ranges, if all KV nodes classify the same range as a problem range (as would be eventually expected).

Wdyt about all that, Josh & KV folks?

@tbg
Copy link
Member

tbg commented Feb 2, 2022

Sounds good.

Part of the reason I suggest ^^ is that having a FIFO queue of N=3 ranges doesn't seem much more complicated than holding a single range, but tell me if you think that is wrong, Josh!

The implementation should be the same for any N, so it probably makes sense to pick the N later. That said, N=3 is probably fine as a starting point but if the impl special-cased on that, I would be concerned.

We should also inversely scale the rate kvprober probes problem ranges by number of nodes in the cluster. Else kvprober running in a really big cluster might hammer problem ranges, if all KV nodes classify the same range as a problem range (as would be eventually expected).

Yeah, that generally makes sense, though I don't think it would be a problem with cluster sizes we can expect in the next couple of years, given that kvprober isn't probing very aggressively in the first place.

jmcarp added a commit to jmcarp/cockroach that referenced this issue Feb 4, 2022
The kvprober provides good coverage of issues that affect many ranges,
but has a lower probability of detecting individual bad ranges. To
improve coverage of the latter case, remember failed ranges from the
existing prober and probe them more frequently in a second pair of probe
loops.

Resolves cockroachdb#74407.

Release note: None
@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Jul 21, 2022

I've been talking with @nkodali about this recently. I think we should return to this issue, especially since the larger alerting deliverable captured in #71169 will prob slip in 23.1. We've seen some KV outages recently that involved a single range becoming hard unavailable. In one recent case, the outage happened to be on the range holding the system.lease data. This is unlucky for the customer, as it means full cluster down outage, but "lucky" (lol lol) for the CC alerting stack, as it means an outage the sqlprober promptly caught (which is good for the customer as it meant CRL fixed it quickly & without customer involvement). Regardless of what range is affected, I think we want to catch a range becoming hard unavailable, at very high confidence.

Quick recap:

I think if we do this, we should backport it. kvprober is only on in CC; the cluster settings are private.

@Santamaura
Copy link
Contributor

Hi @joshimhoff I saw this on the kv obs board and after reading through the comments here (and on that draft PR) I was wondering If that draft PR is salvageable for the current idea of quarantining ranges? It sounded like the sentiment was that there was complexity being added to the prober that was maybe undesirable/unnecessary?

@joshimhoff
Copy link
Collaborator Author

joshimhoff commented Aug 29, 2022

Hi @Santamaura!

I think the complexity is warranted, and I think I've talked with @tbg enough about that to convince him of that. In fact, just last week we saw an outage in CC that didn't lead to an SRE page since only a small subset of ranges were affected! I think the draft PR provides a guide on how this might be implemented, but perhaps that implementation is not quite right in terms of code organization, etc. I also think the below comment from Tobias is useful in thinking about how to organize the code. Lastly I'm up to 1:1 about implementation with whoever might be taking this on, if that is useful.

Maybe we want a "quarantine" system. If you fail a probe, you get quarantined (if there's room, otherwise ignore quarantine). Ranges that are in quarantine get probed separately (to avoid slowing down the main prober) at an interval. We track the time of entrance into the quarantine pool and report the duration the oldest member has been there as a metric. Then you can alert on that (instead of on any metric about write failures), without a temporary blip messing up the alerting. With that setup, we don't have to rely on failing probes to determine whether there is an outage that's long enough to page someone. Instead, we get to look directly at the quantity we care about, namely how long it's been since we know of the outage. It would work like an exact mirror image of the "uptime" metric.

Santamaura added a commit to Santamaura/cockroach that referenced this issue Sep 8, 2022
These changes update the kvprober to add ranges that
fail probing into a quarantine pool where they are
continuously probed. A metric which indicates the
duration of the longest tenured range has also been
added.

Resolves cockroachdb#74407

Release justification: low risk, high benefit changes to
existing functionality.

Release note: None
Santamaura added a commit to Santamaura/cockroach that referenced this issue Sep 8, 2022
These changes update the kvprober to add ranges that
fail probing into a quarantine pool where they are
continuously probed. A metric which indicates the
duration of the longest tenured range has also been
added.

Resolves cockroachdb#74407

Release justification: low risk, high benefit changes to
existing functionality.

Release note: None
Santamaura added a commit to Santamaura/cockroach that referenced this issue Sep 8, 2022
These changes update the kvprober to add ranges that
fail probing into a quarantine pool where they are
continuously probed. A metric which indicates the
duration of the longest tenured range has also been
added.

Resolves cockroachdb#74407

Release justification: low risk, high benefit changes to
existing functionality.

Release note: None
Santamaura added a commit to Santamaura/cockroach that referenced this issue Sep 12, 2022
These changes update the kvprober to add ranges that
fail probing into a quarantine pool where they are
continuously probed. A metric which indicates the
duration of the longest tenured range has also been
added.

Resolves cockroachdb#74407

Release justification: low risk, high benefit changes to
existing functionality.

Release note: None
craig bot pushed a commit that referenced this issue Sep 23, 2022
87436: kv: update kvprober with quarantine pool r=Santamaura a=Santamaura

These changes update the kvprober to add ranges that
fail probing into a quarantine pool where they are
continuously probed. A metric which indicates the
duration of the longest tenured range has also been
added.

Resolves #74407

Release justification: low risk, high benefit changes to
existing functionality.

Release note: None

Co-authored-by: Santamaura <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
@craig craig bot closed this as completed in 01034a4 Sep 23, 2022
blathers-crl bot pushed a commit that referenced this issue Sep 23, 2022
These changes update the kvprober to add ranges that
fail probing into a quarantine pool where they are
continuously probed. A metric which indicates the
duration of the longest tenured range has also been
added.

Resolves #74407

Release justification: low risk, high benefit changes to
existing functionality.

Release note: None
@joshimhoff
Copy link
Collaborator Author

💯

As we roll this out to CC prod, I will keep you in the loop about prod issues it finds!

Santamaura added a commit that referenced this issue Oct 10, 2022
These changes update the kvprober to add ranges that
fail probing into a quarantine pool where they are
continuously probed. A metric which indicates the
duration of the longest tenured range has also been
added.

Resolves #74407

Release justification: low risk, high benefit changes to
existing functionality.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking.
Projects
None yet
5 participants