-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvprober: find single range issues by repeatedly probing problem ranges with issues after randomly finding a candidate problem range #74407
Comments
I think it'd be nice to keep an eye on ranges with problems. I think the |
What do you think about what @andreimatei is saying, @tbg? Do you see benefits to If you do, I still think there is an argument for something like this:
The argument is that having two mechanisms for finding single range issues may provide a lower false negative rate. This is only true if there are some issues that circuit breaking may miss but |
Hi Josh, Just double checking what problem we're trying to solve: I assume in your alerting you set relatively conservative error rates, since you don't want to be paged on short-lived issues. This means that you'll catch widespread problems after (say) 5 minutes of alerting. But now you're worried about single-range failures, where one in 10000 ranges is persistently down, but since it will be visited by the prober only every so often it will never generate the signal you need to get paged. Is this about right? I think the approach you mention does make sense to solve this problem. I expect the per-Replica circuit breakers (#33007) will be pretty good at shouting when replicas are unavailable, and this should lend itself nicely to use in alerting rules. The big value add I see in One interesting way this could be implemented is as a second kvprober, which instead of drawing ranges from the meta2 entries, pulls from the first kvprober. In other words, the primary kvprober can push ranges into a holding area, which the secondary kvprober consumes from. Ranges get removed from the holding area only if the secondary kvprober has cleared them. I'm not confident enough saying that this is how it should be built, maybe this doesn't make sense in light of a need to distinguish these probers by metrics and cluster settings, etc, just pointing out this potential approach.
There would be some value in kvprober being notified specifically about tripped ranges. However, the information flow from Replica to some higher-level black-box prober is not obvious (the simple solution is to gossip the down replicas). |
Hi hi!
Yes, that's about right.
Agreed! Thanks for describing these two concrete scenarios especially.
Could you very roughly estimate the chances of this being built by 22.1, 22.2, & 23.1 for my planning?
Yes, this is roughly how I imagine it, at least at a conceptual level. I don't think we need to figure out exactly how this will work at a code level here, but I imagine we will export the following metrics
In addition to what we already have:
The candidate problem range is a range that One nice thing is we can also detect single range latency issues, which might be hard to do with the breaker. If latency of some "random" read / write is high, make it the candidate problem range. I'm not sure if this will be useful in practice but I wonder if it would be. Perhaps it would be more useful if we could say at medium confidence how fast a range should be to probe, as has been discussed before.
What would be the value? I don't actually see it, since the breaker will already be screaming at us. |
@joshimhoff suggested this as something I might be able to pick up and I have a few questions:
|
I think I vote we store N=3 ranges to which probes recently failed or were high latency in a FIFO queue. The loop that probes them probes one entry in the queue per tick of the timer. Part of the reason I suggest ^^ is that having a FIFO queue of N=3 ranges doesn't seem much more complicated than holding a single range, but tell me if you think that is wrong, Josh!
I'd leave em in the list unless a failed or slow probe to a range not in the list knocks em out. There is no harm in probing a healthy range. We should also inversely scale the rate Wdyt about all that, Josh & KV folks? |
Sounds good.
The implementation should be the same for any
Yeah, that generally makes sense, though I don't think it would be a problem with cluster sizes we can expect in the next couple of years, given that kvprober isn't probing very aggressively in the first place. |
The kvprober provides good coverage of issues that affect many ranges, but has a lower probability of detecting individual bad ranges. To improve coverage of the latter case, remember failed ranges from the existing prober and probe them more frequently in a second pair of probe loops. Resolves cockroachdb#74407. Release note: None
I've been talking with @nkodali about this recently. I think we should return to this issue, especially since the larger alerting deliverable captured in #71169 will prob slip in 23.1. We've seen some KV outages recently that involved a single range becoming hard unavailable. In one recent case, the outage happened to be on the range holding the Quick recap:
I think if we do this, we should backport it. |
Hi @joshimhoff I saw this on the kv obs board and after reading through the comments here (and on that draft PR) I was wondering If that draft PR is salvageable for the current idea of quarantining ranges? It sounded like the sentiment was that there was complexity being added to the prober that was maybe undesirable/unnecessary? |
Hi @Santamaura! I think the complexity is warranted, and I think I've talked with @tbg enough about that to convince him of that. In fact, just last week we saw an outage in CC that didn't lead to an SRE page since only a small subset of ranges were affected! I think the draft PR provides a guide on how this might be implemented, but perhaps that implementation is not quite right in terms of code organization, etc. I also think the below comment from Tobias is useful in thinking about how to organize the code. Lastly I'm up to 1:1 about implementation with whoever might be taking this on, if that is useful.
|
These changes update the kvprober to add ranges that fail probing into a quarantine pool where they are continuously probed. A metric which indicates the duration of the longest tenured range has also been added. Resolves cockroachdb#74407 Release justification: low risk, high benefit changes to existing functionality. Release note: None
These changes update the kvprober to add ranges that fail probing into a quarantine pool where they are continuously probed. A metric which indicates the duration of the longest tenured range has also been added. Resolves cockroachdb#74407 Release justification: low risk, high benefit changes to existing functionality. Release note: None
These changes update the kvprober to add ranges that fail probing into a quarantine pool where they are continuously probed. A metric which indicates the duration of the longest tenured range has also been added. Resolves cockroachdb#74407 Release justification: low risk, high benefit changes to existing functionality. Release note: None
These changes update the kvprober to add ranges that fail probing into a quarantine pool where they are continuously probed. A metric which indicates the duration of the longest tenured range has also been added. Resolves cockroachdb#74407 Release justification: low risk, high benefit changes to existing functionality. Release note: None
87436: kv: update kvprober with quarantine pool r=Santamaura a=Santamaura These changes update the kvprober to add ranges that fail probing into a quarantine pool where they are continuously probed. A metric which indicates the duration of the longest tenured range has also been added. Resolves #74407 Release justification: low risk, high benefit changes to existing functionality. Release note: None Co-authored-by: Santamaura <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>
These changes update the kvprober to add ranges that fail probing into a quarantine pool where they are continuously probed. A metric which indicates the duration of the longest tenured range has also been added. Resolves #74407 Release justification: low risk, high benefit changes to existing functionality. Release note: None
💯 As we roll this out to CC prod, I will keep you in the loop about prod issues it finds! |
These changes update the kvprober to add ranges that fail probing into a quarantine pool where they are continuously probed. A metric which indicates the duration of the longest tenured range has also been added. Resolves #74407 Release justification: low risk, high benefit changes to existing functionality. Release note: None
Is your feature request related to a problem? Please describe.
kvprober
probes all ranges. Single range issues happen.kvprober
will detect such issues but the resulting error rate will be extremely low (1 / number of ranges in the cluster). This makes alerting on such an issue hard.Describe the solution you'd like
kvprober
could "remember" when it probes a range and doesn't get back a successful (or fast) response.kvprober
could then probe that range regularly, in a separate goroutine from the one in which it is probing all ranges.kvprober
could generate metrics on the error rate & latency profile of the candidate problem range. SRE could alert on this. Basically, whenkvprober
discovers a candidate problem range, it focuses on producing data about that range.Describe alternatives you've considered
An alternative is to not do this, but write a long time-window log-based alert on multiple errors in a row for RPCs to a specific range. This might be workable, tho the time to page would be much lower than with above. Also would need #74405.
Additional context
N/A
@tbg & @andreimatei: Wdyt about this idea?
Jira issue: CRDB-12069
The text was updated successfully, but these errors were encountered: