-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide E2E SQL and/or KV latency & error rate time-series metrics that indicate issues with CRDB rather than poor workload given cluster provisioning #71169
Comments
Some convo with @sumeerbhola, @JuanLeon1, and @tbg: Sumeer says:
Josh says:
Sumeer says:
Josh says:
Josh talks to Juan.
|
I think it's going to be quite hard to "exclude" the right latencies to make a clear signal here. SQL is such a complicated language that it makes it very difficult to know when a query is operating at expected latencies or not. The closest thing that we have to this idea is the coster, but the coster isn't designed to have units that are closely related to latency - costs will not nicely compare with latencies using some linear scale or anything like that. From SQL's perspective, if "SQL processing" itself is ever slower than some constant factor of the number of input rows, that's probably unexpected, modulo disk spilling, which is expected and makes everything many times slower (but is desirable nonetheless). I'd say that the most fruitful thing we could do would be measure the KV request latencies and try to pay attention to see whether they're slower than expected. KV is our simple interface. Maybe we ought to consider enhancing KV request latencies or tracing, rather than try to do it for all of SQL which will be extremely complex and error prone. Basically what I'm trying to say - if KV requests are slow, SQL queries will be slow. If KV requests are fast, SQL queries will probably be fast, modulo things like disk spilling or inter-datacenter latencies. |
Ya I agree with this idea. I like both of these ideas as ways to work around the difficulty of costing SQL:
I think I like 2 more than 1 tho 2 won't work for dedicated or on-prem obvi. Also 1 is prob needed to do 2 anyway.
Mhm. The coster does need to cost AHEAD of running the query, right? For the use case in this ticket, we can cost AFTER running the query. A bit easier, no? Still hard; I hear you. |
At the KV level, what I would like to achieve is to be able to account for the time spent in each "stage" via metrics. Anything that is recorded there can (for the most part) also be returned via structured trace payloads (for requests for which this is desired). What is not covered here are tight SLAs, but clearly you can't enforce an SLA when you don't even measure the underlying quantities. However, structured metadata are a natural stepping stone towards SLAs as they allow a programmatic construction of an expectation that can then be compared against the timings of the actual operation. |
Could some of the work done for this initiative Predictability of foreground performance help build out metrics for when foreground latency starts to tick up as a data point to correlate with SQL queries? |
This prototype is an exploration of [this comment]. Our server-side RPC timing metrics are not great and in particular they don't allow us to see which fraction of the end-to-end latency is spent in contention (which is generally not something indicative of the health of the KV layer, as users can easily generate as much contention as they like). [this comment]: cockroachdb#71169 (comment)
Returning to this ahead of 22.2 planning. The idea is to measure E2E latency and error rate while excluding latency and errors caused by workload choices. For example, latency in replication should be tracked, but latency waiting for latches should not be tracked. Metrics of this kind would allow for some great SRE-facing alerts!! Metrics of this kind would also make CRDB more observable. The discussion here suggests doing this at the KV layer rather than the SQL layer, mainly because costing SQL queries is hard given the complexity of the interface. I buy this. I think we should consider executing on this ticket (at the KV layer) during the 22.2 cycle. @tbg, I never saw https://github.com/cockroachdb/cockroach/pull/72092/files. It's so cool!!! Do you have a sense of scope for this ticket given your explorations? CC @jeffswenson as we have been talking about KV alerting in the context of a recent serverless incident. |
There are a few things that make good performance monitoring extra valuable to Serverless:
|
@andy-kimball makes a great point that we don't need to do this perfectly for it be very useful, def for cloud & prob for on-prem.
What do you think about this, @jordanlewis? This "choose an important subset" idea is an answer to your point that costing SQL queries is hard. It is hard to do in the general case! But it's not so bad, if we only "cost" SQL queries that are easy to cost.
|
One thing that comes up as a desire from time to time is the ability to have time series metrics labeled by app. This is cost prohibitive in general because people can put arbitrary app names, and you can't have a system that generates arbitrarily many metrics. But what if we implemented this with an allowlist that permitted configuration of which apps to emit bucketed metrics for? Then, you could allowlist queries that are emitted by systems that we understand, like the probers. Then we'd have metrics that we already understand getting exported, but scoped just to these apps. This doesn't accomplish the goal of automatically finding simple queries from user workloads, but maybe it would be helpful anyway? I feel like getting metrics just for simple queries is possibly interesting. There's definitely some challenges with it, because some of the work that the system does happens before we know whether a query is "simple" or not (the whole optimizer has to run), and the optimizer needs to call the leasing system which can do lookups and stuff. Then by the time we know the query is simple, we also know that the only real work to be done is starting in Speaking of leasing though, are we already collecting metrics on the leasing / namespace / descriptor table retrieval times? If not that would be a very low hanging fruit thing to add and I do think it'd help SRE understand whether the system is working well or not. cc @postamar @ajwerner |
I think could be useful! OTOH I don't see how it would help with SRE alerting.
I'm not seeing where the challenge is yet. We can just always measure latency. But we write the measured latency to metrics only if the query is simple (e.g. a single row read / write). We need to know whether to write to metrics when the query finishes executing but not before. So it's fine that the optimizer needs to run first. Can you explain a bit more if I'm missing the point you are making?
Ya good idea. But I really think we should pay the cost to get some nice E2E measurement of system performance that mostly doesn't depend on workload choices made by the customer. It's a really powerful thing to have in my experience. |
Oh, yeah, that's a good point! I was confused and misunderstanding something. I like this idea. We could have a latency metric that gets recorded for user-emitted queries under a certain cost. It seems reasonable that we could define an SLO on that. I don't know how we'd identify what that cost would be. @rytaft what do you think about this idea? |
I'll summarize for @rytaft since this ticket is long. We want latency metrics that track what our customers are experiencing E2E, as E2E behavior is what users actually care about (KV is an implementation detail of CRDB). In a cloud context, we want those metrics to indicate problems when CRDB is broken and/or needs CRL eng including SRE to take action but NOT when a customer simply needs to change their workload. Here is a strawman metric that has these properties:
Via something like the above, we have a measure of single row read & write latency that should be stable, regardless of what workload a customer throws at a cluster. A quick footnote!
This is an assumption! Bugs in CRDB can lead to contention and hardware overload. Automatic alerting that faces SREs must make such assumptions. The idea is that we don't track sources of latency that most often are caused by workload choices rather than CRDB misbehavior, as we need our alerting false positive rate to be very very low. |
Not really, for example if the customer has a long-running txn open it's really their fault but it can completely block single-row operations. So we need to measure only some components of these single-row operations, notably excluding contention time, which I started exploring a building block for in #72092. |
Apologies, you point out that exact thing, not sure how I skimmed past that. |
On the SQL side, it's actually not that hard to know if a query is "simple" (e.g., affecting a single row). We already do such a thing to determine whether a bounded staleness read is allowed: cockroach/pkg/sql/opt/exec/execbuilder/relational.go Lines 601 to 623 in c518952
I would not recommend using the optimizer cost for this. It is not designed to be meaningful outside the context of a single query. Assuming the query is "simple" according to our definition, we could potentially enable the use of @tbg's |
It's been a while since I typed the |
Anyone have a rough idea of how much eng work this one is? Obviously scope is not decided yet, and so we have some flexibility. Just looking for a rough rough idea. |
Probably a few weeks of work for one engineer on SQL Queries + whatever is needed from kv. (This is assuming that we can reuse the |
This prototype is an exploration of [this comment]. Our server-side RPC timing metrics are not great and in particular they don't allow us to see which fraction of the end-to-end latency is spent in contention (which is generally not something indicative of the health of the KV layer, as users can easily generate as much contention as they like). [this comment]: cockroachdb#71169 (comment)
cc @vy-ton |
I filed #82200 to cover the KV portion of this work (or at least get us to a place where it's not a big lift to integrate with whatever is happening "above KV" to do this issue. If my understanding is correct (cc @nkodali), KV Observability is going to work on #82200 this cycle (and I will advise). I'll add that to me it is pretty unclear what the commitment to this issue (E2E latency) is. #82200 is independently valuable for the KV area, so we'd like to do it regardless. I don't see anything scheduled in SQL, so my expectation is that this issue will sit dormant but hopefully #82200 can independently be of use for the SREs. |
A much smaller work item related to this ticket is tracked at #98305. I hope to implement the linked ticket in the next six months as part of work on disaggregated storage. Understanding what read latencies we achieve on CC will help us deliver that project. |
TLDR: Skip to the "Describe the solution you'd like" section!!!
Is your feature request related to a problem? Please describe.
In case of our cloud product:
The need for above is clear in case of our cloud product. But on-prem would benefit too. Being able to quickly tell apart issues with CRDB apart from issues with a customer workload makes DB eng, SRE, TSE, and customers more efficient.
Describe alternatives you've considered
The alternative is the current CC monitoring stack. We alert on various signals of definite trouble. For example:
sqlprober
can'tSELECT
a test row, page SRE.kvprober
can't read & shadow write a key dedicated to probing on a "randomly" chosen range, page SRE.This is fine. If above signals indicate trouble, there is trouble, as the the above signals don't close on contention or other workload issues (in case of prober the workload is controlled by CRL eng thus not problematic). That is, the false positive rate is low [1].
[1] modulo that until admission control rolls out we will paged when a dedicated cluster is out of CPU
What's the problem then? The false negative rate is too high! This means we miss outages in production. It also means we don't have confidence that IF no alerts are firing, it is a workload issue. The latter possibly matters more to operational efficiency than the former, tho both are important.
Examples of outages above alerting config might miss:
sqlprober
tables).kvprober
writes slowed down).kvserver
deadlocks (could be caught bykvprober
but only slowly given randomized range selection)Describe the solution you'd like
What matters to users is that SQL is available and served at low latencies. Many many many different concrete production issues lead to high SQL latency or error rates. Why don't we alert on SQL latency & error rate then? Doing this would lead to a low false negative rate, as if an issue actually matters to a customer, it translates to SQL to be either not available or at high latencies. Alerting on the symptoms the user cares about is great. Why alert on anything else? If we are not providing to users what they care about, of course we should page CRL eng including SRE!!!
The obvious issue is that doing this naively will lead to alerting on poor workload given how the cluster is provisioned, which will push ops load way way up & burn us all out. For example:
The crux is: How can we measure E2E SQL latency & error rate via time-series metrics in a way that doesn't include the above workload issues? For now let's just consider latency.
For 1 & 2, we need to measure latency while excluding the portion of latency contributed by certain components:
3 is harder! One idea is bucketing queries by their expected cost and setting different alerting thresholds on the buckets (e.g. full table scans wouldn't have SLA but single row reads would). Another idea is normalizing the measured latency by expected cost, e.g. you can imagine dividing each full table scan latency by the number of rows read or similar. These are half ideas only. To the best of my knowledge, the bucketing approach has worked well on a Google time-series database. That database does not provide an interface as flexible & thus easy to foot-gun yourself as SQL. But still I contend that 3 is possible for CRDB too.
CC @JuanLeon1 @bdarnell @tbg @andreimatei @jordanlewis @jreut @udnay @sumeerbhola for thoughts!
Additional context
If anyone wants links to a CC specific design doc about monitoring as extra context, plz say so.
Jira issue: CRDB-10421
Epic CRDB-32145
The text was updated successfully, but these errors were encountered: