metric: periodic gaps in histogram metrics #98266

kvoli · 2023-03-09T00:10:10Z

Describe the problem

When recording to a prometheus histogram every 10s, gaps in the dbconsole metric appear every HistogramOptions.Duration interval.

Screenshot taken with these histogram settings:

metric.HistogramOptions{
    Duration: 2 * histogramWindow, /* 120s */
    Buckets: metric.IOLatencyBuckets,
    Mode:    metric.HistogramModePrometheus,
}

To Reproduce

Build this commit 3515d1d
Run a workload e.g. ./cockroach workload run kv --read-percent=95 --drop --zipfian
Inspect output of rebalancing.replicas.cpunanospersecond

Jira issue: CRDB-25158

The text was updated successfully, but these errors were encountered:

aadityasondhi · 2023-03-13T17:17:21Z

Ran some experiments. The problem seems to be that the replica stats are stored every 10s and we scrape for tsdb at a similar cadence (I think it is also every 10s). So when we scrape for the first time after every tick, we get an empty histogram that we are calculating percentiles out of. This was documented as a limitation of our current windowed design:

cockroach/pkg/util/metric/metric.go

Line 290 in f96b21f

// TODO(obs-inf): the way we implement windowed histograms is not great. If

.

Some solutions I can think of:

Store stats for twice the rotation duration so that we have a combined view of the cur + prev bucket. This way we avoid reporting 0 after a fresh tick (seems like a simple solution)
Synchronize ticks with tsdb scrapes so that we always scrape before a tick. (seems complex and could have side effects since our rotation duration is configurable).
Record windowed stats in tsdb of the previous bucket instead of the current. We trade off having the most recent data but will always have a complete view of the rotation duration.

I think 3 is a good option if we are willing to make the tradeoff, otherwise, 1 would be my preference if we need to report the most recent bucket. 2 seems like the most accurate way to do it but I am hesitant to do that considering we are so late into the release and wouldn't want to break histograms.

cc @dhartunian

Screenshot taken after adjusting the interval for replica stats to 500ms instead of 10s. (There are no gaps in timeseries)

andreimatei · 2023-03-13T20:22:00Z

Synchronize ticks with tsdb scrapes so that we always scrape before a tick. (seems complex and could have side effects since our rotation duration is configurable).

To me, this one seems the most appealing. Unless we tie the rotation frequency with the tsdb poll, the problem you quote is common with option 3, isn't it?

aadityasondhi · 2023-03-13T20:38:36Z

Yeah with 3 we will always report stats that are one rotation interval old. Which on second thought seems less than ideal. I can take a look into how we can implement 2.

What are your thoughts on 1? When we scrape/calculate quantiles, we can use a merged view of current and previous windows.

andreimatei · 2023-03-13T22:03:41Z

I don't understand 1 and 3 very well, truth is. Our problem is not simply that we want to avoid reporting zeroes; it's that we want to report data that makes sense (i.e. is calculated over the past 10s, not more than that).
For Option 1), let's consider this scenario:

|---------|---- ------|-----
a          b   c      d     e

Let's say that our windowed histograms ticks at time a, then b, then d. And let's say that out poller comes around at time c. If I understand correctly, you're proposing that we record the data for (a,c), right?
Then, let's say that the poller comes again at time e. Now we're recording (b, e), right? So we're double-counting the data for (b,c). And each point on the quantile graphs will represent not 10s, not 20s, but some duration in between.
In the case where the window ticking interval is smaller than the polling interval, we'll also be missing some data (like we also do today).
Similarly, if the two intervals are the same, and they actually align (i.e. we start the timers ~the same time), the ticking and the polling are racing and you'll sometimes look at two buckets out of which one is empty, other times you'll look at two full buckets.
I find it a bit hard to reason about this scheme (assuming I got it right), because it's complicated.

So, it seems to me that we should synchronize the polling and the ticking, one way or another. It can be done explicitly, like you propose in 2), or maybe I can suggest
4) Lift the windowing into the poller: Get rid of any notion of windows in the histogram implementation; maintain only the data for one cumulative histogram. And have the poller capture a snapshot of the histograms it polls and record the delta from the previous snapshot.
It contrast to 2), this would keep histograms able to be polled by different pollers.

The problem seems to be that the replica stats are stored every 10s and we scrape for tsdb at a similar cadence (I think it is also every 10s).

I'm confused, because the window size seems to be 60s (from here). Which, if true, would make no sense to me... Is it true?

aadityasondhi · 2023-03-13T23:20:26Z

I see your point of view. Thanks for the detailed explanation. I don't think it will be double counting since we use these to store averages and quantiles which are a factor of the number of samples and not time that has passed. But I do agree that having an ambiguous time window for what that data represents is not the best way to do it.

I like the idea of moving the window into the poller. Just want to clarify that this would mean that we remove the windowing config option entirely and always have a 10s windowing (or whatever we decide we want our polling interval to be).

I'm confused, because the window size seems to be 60s (from here). Which, if true, would make no sense to me... Is it true?

The window size is indeed 60s by default. The problem here is that when we poll for the first time after a tick, we are polling an empty histogram since the histogram in question only reports values every 10s.

andreimatei · 2023-03-13T23:43:29Z

I don't think it will be double counting since we use these to store averages and quantiles which are a factor of the number of samples and not time that has passed.

By "double counting", I mean some measurements will affect multiple tsdb records, and others will affect only one (e.g. (c,d) if we continue the above example).

I like the idea of moving the window into the poller. Just want to clarify that this would mean that we remove the windowing config option entirely and always have a 10s windowing (or whatever we decide we want our polling interval to be).

I had no idea about the 60s window until today. So every 10s we have a point that represents the p99 over the past 60s? Or actually, even worse - the first point is p99[10s], the next is p99[20s],...p99[60s], p99[10s],...
It strikes me as bizarre; I think people expect every point to be the p99 over the past 10s. Otherwise, it gets really confusing when aggregating these percentiles over longer durations.
So, yeah, I'd get rid of it. Perhaps you could dig into the history a little bit and see if there was some reason for doing it this way.
If we really want the quantiles to be computed over a different duration, the poller can maintain multiple snapshots per histogram.

The window size is indeed 60s by default. The problem here is that when we poll for the first time after a tick, we are polling an empty histogram since the histogram in question only reports values every 10s.

Ack. I had misread the issue as complaining about gaps every 10s.

aadityasondhi · 2023-03-14T14:26:35Z

I had no idea about the 60s window until today. So every 10s we have a point that represents the p99 over the past 60s? Or actually, even worse - the first point is p99[10s], the next is p99[20s],...p99[60s], p99[10s],...

Yeah, I think we are on the same page now.

Perhaps you could dig into the history a little bit and see if there was some reason for doing it this way.

@tbg might have a better idea why this is the case. I believe the reasoning behind this is the same as the reason why we have issues with this histogram. By having window sizes configurable (and separate from polling) we let the users (for example, KV team) decide what they want their histogram to represent. p99 over a 10s window might be weird for histograms such as the one in question where we take a measurement every 10s. It will on average have 1 sample per tsdb record. And based on the timing of the sample being read, it might even have 0. Which can provide the wrong picture to the reader of the metric. The windows give us the ability to provide moving averages/percentiles over a larger window so that we don't have sudden drops as we see here.

I think the implementation had limitations (as were noted in the comments) but the underlying idea seems reasonable. The user can define the window they want their histogram to have moving averages/percentiles for. Maybe we should maintain a queue of 10s chunks and always provide the user with the last window size data (i.e. last 60s). This would keep the ability to configure windows while also making sure that we have consistent windowing behavior.

tbg · 2023-03-14T14:39:06Z

I don't think there was ever a desire to vary the window size by histogram. Rather, it's all a pile-up of unhappy historical accidents. I agree with Andrei's suggestion to more tightly couple the TSDB loop with the windowed histogram. I think so far we have tried to avoid this, but it's leading to the windowed histograms representing fairly unclear actual time windows. The only thing that really makes sense is that the datapoint at time T uses a windowed histogram that contains measurements that happened in the interval between the last scrape (so ~T-10s) and T. We were likely worried that there would be multiple scrapers sometimes (which, I think, can happen if you also scrape prometheus) so we should "simply" make sure that there is only one kind of scrape that actually affects the windowed histogram, and that this one scrape really only happens every 10s.

aadityasondhi · 2023-03-14T15:17:48Z

That makes sense. Thanks for the clarification.

If we make this change, what happens to the histogram mentioned in this issue? Should we discourage people from using histograms in cases where they only record data every 10s or higher? Using your example, a histogram with a low cadence of samples might have 0 data points within the [T-10, T] window. This will show up as 0 on the graphs and will paint the wrong picture about the underlying metric.

kvoli · 2023-03-14T16:57:47Z

From reading the thread, I'm thinking it would be easiest (for the use case at the top) to implement some histogram helpers for periodic recording, where all values we wish to record for a time window are known upfront. An example I found is in the client_go pkg.

To shed some more light on the use case:

I want to view, for that 10s what the 99th, 95th 50th etc percentile of replica load is in order to determine outliers and compare against other intervals and stores.

To implement this, I piggybacked existing logic where every 10s the store iterates over all its replica and grabs the replica's load stats. With the load stats, I record the replica's value to the histogram.

aadityasondhi · 2023-03-14T18:59:18Z

Based on an in-person discussion with Andrei, I propose the following:

Lift the windowing logic into the poller. This makes sure that each value that we store into tsdb is of the window [T-t, T], where t is the last poll time and T is the current time. This will work for histograms that are of cumulative nature.
For histograms such as the one above, I would suggest using ManualWindowHistogram. I do recognize that the current interface is cumbersome and I will rework it to make it more easily usable. When you create this histogram, you will be required to provide the promised update interval. The poller will report the last known value of these histograms as long as T-t < update interval. To avoid race conditions such as the one in this issue, we will allow for 1 missed update but start reporting 0 if the poller misses twice in a row.

I will prioritize 2 over 1 to unblock the PR. Let me know if this sounds good as a resolution.

cc @andreimatei @kvoli

kvoli · 2023-03-14T19:40:33Z

For histograms such as the one above, I would suggest using ManualWindowHistogram. I do recognize that the current interface is cumbersome and I will rework it to make it more easily usable. When you create this histogram, you will be required to provide the promised update interval. The poller will report the last known value of these histograms as long as T-t < update interval. To avoid race conditions such as the one in this issue, we will allow for 1 missed update but start reporting 0 if the poller misses twice in a row.

That seems perfect for my use case. Would the interface for recording be Observe(float64), called for every value?

aadityasondhi · 2023-03-14T20:39:23Z

Would the interface for recording be Observe(float64), called for every value?

I think the interface will require you to maintain a local histogram and call Update(Histogram) whenever you want to insert the new values and overwrite the previous histogram. This will mostly mimic what the batch histogram in Prometheus does https://github.com/prometheus/client_golang/blob/3d2cf0b338e19b0eaf277496058fc77bd3add440/prometheus/go_collector_latest.go#L454-L456.

This commit extends the ManualWindowHistogram to support RecordValue and Rotate. Previously, it was necessary to maintain duplicate cumulative histograms in order to batch update the manual histogram. This update adds a quality of life feature, enabling recording to the ManualWindowHistogram, then once finished, rotating the batch of recorded values into the current window for the internal tsdb to query. Touches: cockroachdb#98266 Release note: None

This commit extends the ManualWindowHistogram to support RecordValue and Rotate. Previously, it was necessary to maintain duplicate cumulative histograms in order to batch update the manual histogram. This update adds a quality of life feature, enabling recording to the ManualWindowHistogram, then once finished, rotating the batch of recorded values into the current window for the internal tsdb to query. Touches: #98266 Release note: None

This commit extends the ManualWindowHistogram to support RecordValue and Rotate. Previously, it was necessary to maintain duplicate cumulative histograms in order to batch update the manual histogram. This update adds a quality of life feature, enabling recording to the ManualWindowHistogram, then once finished, rotating the batch of recorded values into the current window for the internal tsdb to query. Touches: cockroachdb#98266 Release note: None

Prior to this change, the metric.Histogram maintained two seperate histograms under the hood: cumulative and windowed. The reason for the windowed histogram was to be able to export the most recent recordings to TSDB. However, the windowed histogram's ticks were not synchronized with the polling of metrics done by TSDB. This caused problems where the TSDB polling would read an empty window and report gaps in metrics as seen in: cockroachdb#98266. This patch seperates the windowed logic from pkg/metric and makes TSDB polling keep track of deltas using a singular cumulative histogram. Fixes cockroachdb#98266 Fixes cockroachdb#98621 Release note: None

Fixes cockroachdb#103814. Fixes cockroachdb#98266. This commit updates the windowed histogram merging approach to add the previous window's histogram bucket counts and sample count to those of the current one. As a result of this change, the histograms will no longer report under-sampled quantile values, and timeseries metrics-derived charts (e.g., the quantile-based SQL service latency charts on the DB console's Metrics page) will more accurately display metrics. Release note (bug fix): Updated the histogram window merge calculation to more accurately interpolate quantile values. This change will result in smoother, more accurate Metrics charts on the DB Console. Co-authored-by: Aaditya Sondhi <[email protected]>

103729: log: add headers, compression config to http servers r=healthy-pod a=dhartunian This commit adds support for custom headers and gzip compression on requests made to `http-server` outputs in CRDB's logging configuration. Custom headers enable the inclusion of API keys for 3rd party sinks and gzip compression reduces network resource consumption. Resolves #103477 Release note (ops change): `http-defaults` and `http-servers` sections of the log config will now accept a `headers` field containing a map of key value string pairs which will comprise custom HTTP headers appended to every request. Additionally a `compression` value is support which can be set to `gzip` or `none` to select a compression method for the HTTP requesst body. By default `gzip` is selected. This is a change from previous functionality that did not compress by default. 103963: kvclient: add x-region, x-zone metrics to DistSender r=healthy-pod a=wenyihu6 Previously, there were no metrics to observe cross-region, cross-zone traffic in batch requests / responses at DistSender. To improve this issue, this commit introduces six new distsender metrics - ``` "distsender.batch_requests.replica_addressed.bytes" "distsender.batch_responses.replica_addressed.bytes" "distsender.batch_requests.cross_region.bytes" "distsender.batch_responses.cross_region.bytes" "distsender.batch_requests.cross_zone.bytes" "distsender.batch_responses.cross_zone.bytes" ``` The first two metrics track the total byte count of batch requests processed and batch responses received at DistSender. Additionally, there are four metrics to track the aggregate counts processed and received across different regions and zones. Note that these metrics only track the sender node and not the receiver node as DistSender resides on the gateway node receiving SQL queries. Part of: #103983 Release note (ops change): Six new metrics - "distsender.batch_requests.replica_addressed.bytes", "distsender.batch_responses.replica_addressed.bytes", "distsender.batch_requests.cross_region.bytes", "distsender.batch_responses.cross_region.bytes", "distsender.batch_requests.cross_zone.bytes", "distsender.batch_responses.cross_zone.bytes"- are now added to DistSender metrics. For accurate metrics, follow these assumptions: - Configure region and zone tier keys consistently across nodes. - Within a node locality, ensure unique region and zone tier keys. - Maintain consistent configuration of region and zone tiers across nodes. 104088: metrics: fix windowed histogram merging approach r=ericharmeling a=ericharmeling Fixes #103814. Fixes #98266. This commit updates the windowed histogram merging approach to add the previous window's histogram bucket counts and sample count to those of the current one. As a result of this change, the histograms will no longer report under-sampled quantile values, and timeseries metrics-derived charts (e.g., the quantile-based SQL service latency charts on the DB console's Metrics page) will more accurately display metrics. Release note (bug fix): Updated the histogram window merge calculation to more accurately interpolate quantile values. This change will result in smoother, more accurate Metrics charts on the DB Console. 104376: sql: CREATEROLE now includes ability to grant non-admin roles r=rafiss a=rafiss fixes #104371 This matches the PostgreSQL behavior. Release note (security update): Users who have the CREATEROLE role option can now grant and revoke role membership in any non-admin role. This change also removes the sql.auth.createrole_allows_grant_role_membership.enabled cluster setting, which was added in v23.1. Now, the cluster setting is effectively always true. 104802: kvserver: prevent split at invalid tenant prefix keys r=arulajmani a=tbg Closes #104796. Epic: None Release note (bug fix): prevents invalid splits that can crash (and prevent restarts) of nodes that hold a replica for the right-hand side. Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Wenyi <[email protected]> Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>

Fixes #103814. Fixes #98266. This commit updates the windowed histogram merging approach to add the previous window's histogram bucket counts and sample count to those of the current one. As a result of this change, the histograms will no longer report under-sampled quantile values, and timeseries metrics-derived charts (e.g., the quantile-based SQL service latency charts on the DB console's Metrics page) will more accurately display metrics. Release note (bug fix): Updated the histogram window merge calculation to more accurately interpolate quantile values. This change will result in smoother, more accurate Metrics charts on the DB Console. Co-authored-by: Aaditya Sondhi <[email protected]>

Fixes cockroachdb#103814. Fixes cockroachdb#98266. This commit updates the windowed histogram merging approach to add the previous window's histogram bucket counts and sample count to those of the current one. As a result of this change, the histograms will no longer report under-sampled quantile values, and timeseries metrics-derived charts (e.g., the quantile-based SQL service latency charts on the DB console's Metrics page) will more accurately display metrics. Release note (bug fix): Updated the histogram window merge calculation to more accurately interpolate quantile values. This change will result in smoother, more accurate Metrics charts on the DB Console. Co-authored-by: Aaditya Sondhi <[email protected]>

kvoli added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-observability-inf labels Mar 9, 2023

blathers-crl bot added the A-observability-inf label Mar 9, 2023

aadityasondhi self-assigned this Mar 9, 2023

kvoli mentioned this issue Mar 13, 2023

kvserver: replica load distribution metric #98267

Merged

aadityasondhi mentioned this issue Mar 14, 2023

metric: windowed histograms and polling discrepancy #98621

Closed

aadityasondhi mentioned this issue Mar 14, 2023

metric: rework the interface of ManualWindowedHistogram #98622

Open

tbg mentioned this issue Mar 16, 2023

metrics: -count metrics in internal tsdb are broken/not useful #98745

Closed

kvoli mentioned this issue Apr 19, 2023

release-23.1.0: kvserver: replica load distribution metric #101883

Closed

aadityasondhi mentioned this issue May 23, 2023

metric: remove windowed histogram interface #101947

Closed

kvoli mentioned this issue May 24, 2023

metrics: histogram calculation results in reporting service latency spikes #103814

Closed

aadityasondhi mentioned this issue May 30, 2023

metrics: fix windowed histogram merging approach #104088

Merged

exalate-issue-sync bot assigned ericharmeling and unassigned aadityasondhi May 30, 2023

craig bot closed this as completed in bae5045 Jun 13, 2023

blathers-crl bot mentioned this issue Jun 13, 2023

release-23.1: metrics: fix windowed histogram merging approach #104815

Merged

ericharmeling mentioned this issue Jun 14, 2023

release-22.2: metrics: fix windowed histogram merging approach #104905

Merged

abarganier mentioned this issue Oct 31, 2023

pkg/util/metric: period spike in alloc_objects from prometheus.Histogram metrics on poll #113508

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metric: periodic gaps in histogram metrics #98266

metric: periodic gaps in histogram metrics #98266

kvoli commented Mar 9, 2023 •

edited by cockroach-jira-scripts

Loading

aadityasondhi commented Mar 13, 2023

andreimatei commented Mar 13, 2023

aadityasondhi commented Mar 13, 2023

andreimatei commented Mar 13, 2023

aadityasondhi commented Mar 13, 2023

andreimatei commented Mar 13, 2023

aadityasondhi commented Mar 14, 2023 •

edited

Loading

tbg commented Mar 14, 2023

aadityasondhi commented Mar 14, 2023

kvoli commented Mar 14, 2023 •

edited

Loading

aadityasondhi commented Mar 14, 2023

kvoli commented Mar 14, 2023

aadityasondhi commented Mar 14, 2023

metric: periodic gaps in histogram metrics #98266

metric: periodic gaps in histogram metrics #98266

Comments

kvoli commented Mar 9, 2023 • edited by cockroach-jira-scripts Loading

aadityasondhi commented Mar 13, 2023

andreimatei commented Mar 13, 2023

aadityasondhi commented Mar 13, 2023

andreimatei commented Mar 13, 2023

aadityasondhi commented Mar 13, 2023

andreimatei commented Mar 13, 2023

aadityasondhi commented Mar 14, 2023 • edited Loading

tbg commented Mar 14, 2023

aadityasondhi commented Mar 14, 2023

kvoli commented Mar 14, 2023 • edited Loading

aadityasondhi commented Mar 14, 2023

kvoli commented Mar 14, 2023

aadityasondhi commented Mar 14, 2023

kvoli commented Mar 9, 2023 •

edited by cockroach-jira-scripts

Loading

aadityasondhi commented Mar 14, 2023 •

edited

Loading

kvoli commented Mar 14, 2023 •

edited

Loading