Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metric: periodic gaps in histogram metrics #98266

Closed
kvoli opened this issue Mar 9, 2023 · 13 comments · Fixed by #104088
Closed

metric: periodic gaps in histogram metrics #98266

kvoli opened this issue Mar 9, 2023 · 13 comments · Fixed by #104088
Assignees
Labels
A-observability-inf C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@kvoli
Copy link
Collaborator

kvoli commented Mar 9, 2023

Describe the problem

When recording to a prometheus histogram every 10s, gaps in the dbconsole metric appear every HistogramOptions.Duration interval.

image

Screenshot taken with these histogram settings:

metric.HistogramOptions{
    Duration: 2 * histogramWindow, /* 120s */
    Buckets: metric.IOLatencyBuckets,
    Mode:    metric.HistogramModePrometheus,
}

To Reproduce

  1. Build this commit 3515d1d
  2. Run a workload e.g. ./cockroach workload run kv --read-percent=95 --drop --zipfian
  3. Inspect output of rebalancing.replicas.cpunanospersecond

Jira issue: CRDB-25158

@kvoli kvoli added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-observability-inf labels Mar 9, 2023
@aadityasondhi aadityasondhi self-assigned this Mar 9, 2023
@aadityasondhi
Copy link
Collaborator

Ran some experiments. The problem seems to be that the replica stats are stored every 10s and we scrape for tsdb at a similar cadence (I think it is also every 10s). So when we scrape for the first time after every tick, we get an empty histogram that we are calculating percentiles out of. This was documented as a limitation of our current windowed design:

// TODO(obs-inf): the way we implement windowed histograms is not great. If
.

Some solutions I can think of:

  1. Store stats for twice the rotation duration so that we have a combined view of the cur + prev bucket. This way we avoid reporting 0 after a fresh tick (seems like a simple solution)
  2. Synchronize ticks with tsdb scrapes so that we always scrape before a tick. (seems complex and could have side effects since our rotation duration is configurable).
  3. Record windowed stats in tsdb of the previous bucket instead of the current. We trade off having the most recent data but will always have a complete view of the rotation duration.

I think 3 is a good option if we are willing to make the tradeoff, otherwise, 1 would be my preference if we need to report the most recent bucket. 2 seems like the most accurate way to do it but I am hesitant to do that considering we are so late into the release and wouldn't want to break histograms.

cc @dhartunian

Screenshot 2023-03-13 at 1 12 25 PM

Screenshot taken after adjusting the interval for replica stats to 500ms instead of 10s. (There are no gaps in timeseries)

@andreimatei
Copy link
Contributor

  1. Synchronize ticks with tsdb scrapes so that we always scrape before a tick. (seems complex and could have side effects since our rotation duration is configurable).

To me, this one seems the most appealing. Unless we tie the rotation frequency with the tsdb poll, the problem you quote is common with option 3, isn't it?

@aadityasondhi
Copy link
Collaborator

Yeah with 3 we will always report stats that are one rotation interval old. Which on second thought seems less than ideal. I can take a look into how we can implement 2.

What are your thoughts on 1? When we scrape/calculate quantiles, we can use a merged view of current and previous windows.

@andreimatei
Copy link
Contributor

I don't understand 1 and 3 very well, truth is. Our problem is not simply that we want to avoid reporting zeroes; it's that we want to report data that makes sense (i.e. is calculated over the past 10s, not more than that).
For Option 1), let's consider this scenario:

|---------|---- ------|-----
a          b   c      d     e

Let's say that our windowed histograms ticks at time a, then b, then d. And let's say that out poller comes around at time c. If I understand correctly, you're proposing that we record the data for (a,c), right?
Then, let's say that the poller comes again at time e. Now we're recording (b, e), right? So we're double-counting the data for (b,c). And each point on the quantile graphs will represent not 10s, not 20s, but some duration in between.
In the case where the window ticking interval is smaller than the polling interval, we'll also be missing some data (like we also do today).
Similarly, if the two intervals are the same, and they actually align (i.e. we start the timers ~the same time), the ticking and the polling are racing and you'll sometimes look at two buckets out of which one is empty, other times you'll look at two full buckets.
I find it a bit hard to reason about this scheme (assuming I got it right), because it's complicated.

So, it seems to me that we should synchronize the polling and the ticking, one way or another. It can be done explicitly, like you propose in 2), or maybe I can suggest
4) Lift the windowing into the poller: Get rid of any notion of windows in the histogram implementation; maintain only the data for one cumulative histogram. And have the poller capture a snapshot of the histograms it polls and record the delta from the previous snapshot.
It contrast to 2), this would keep histograms able to be polled by different pollers.


The problem seems to be that the replica stats are stored every 10s and we scrape for tsdb at a similar cadence (I think it is also every 10s).

I'm confused, because the window size seems to be 60s (from here). Which, if true, would make no sense to me... Is it true?

@aadityasondhi
Copy link
Collaborator

I see your point of view. Thanks for the detailed explanation. I don't think it will be double counting since we use these to store averages and quantiles which are a factor of the number of samples and not time that has passed. But I do agree that having an ambiguous time window for what that data represents is not the best way to do it.

I like the idea of moving the window into the poller. Just want to clarify that this would mean that we remove the windowing config option entirely and always have a 10s windowing (or whatever we decide we want our polling interval to be).

I'm confused, because the window size seems to be 60s (from here). Which, if true, would make no sense to me... Is it true?

The window size is indeed 60s by default. The problem here is that when we poll for the first time after a tick, we are polling an empty histogram since the histogram in question only reports values every 10s.

@andreimatei
Copy link
Contributor

I don't think it will be double counting since we use these to store averages and quantiles which are a factor of the number of samples and not time that has passed.

By "double counting", I mean some measurements will affect multiple tsdb records, and others will affect only one (e.g. (c,d) if we continue the above example).

I like the idea of moving the window into the poller. Just want to clarify that this would mean that we remove the windowing config option entirely and always have a 10s windowing (or whatever we decide we want our polling interval to be).

I had no idea about the 60s window until today. So every 10s we have a point that represents the p99 over the past 60s? Or actually, even worse - the first point is p99[10s], the next is p99[20s],...p99[60s], p99[10s],...
It strikes me as bizarre; I think people expect every point to be the p99 over the past 10s. Otherwise, it gets really confusing when aggregating these percentiles over longer durations.
So, yeah, I'd get rid of it. Perhaps you could dig into the history a little bit and see if there was some reason for doing it this way.
If we really want the quantiles to be computed over a different duration, the poller can maintain multiple snapshots per histogram.

The window size is indeed 60s by default. The problem here is that when we poll for the first time after a tick, we are polling an empty histogram since the histogram in question only reports values every 10s.

Ack. I had misread the issue as complaining about gaps every 10s.

@aadityasondhi
Copy link
Collaborator

aadityasondhi commented Mar 14, 2023

I had no idea about the 60s window until today. So every 10s we have a point that represents the p99 over the past 60s? Or actually, even worse - the first point is p99[10s], the next is p99[20s],...p99[60s], p99[10s],...

Yeah, I think we are on the same page now.

Perhaps you could dig into the history a little bit and see if there was some reason for doing it this way.

@tbg might have a better idea why this is the case. I believe the reasoning behind this is the same as the reason why we have issues with this histogram. By having window sizes configurable (and separate from polling) we let the users (for example, KV team) decide what they want their histogram to represent. p99 over a 10s window might be weird for histograms such as the one in question where we take a measurement every 10s. It will on average have 1 sample per tsdb record. And based on the timing of the sample being read, it might even have 0. Which can provide the wrong picture to the reader of the metric. The windows give us the ability to provide moving averages/percentiles over a larger window so that we don't have sudden drops as we see here.

I think the implementation had limitations (as were noted in the comments) but the underlying idea seems reasonable. The user can define the window they want their histogram to have moving averages/percentiles for. Maybe we should maintain a queue of 10s chunks and always provide the user with the last window size data (i.e. last 60s). This would keep the ability to configure windows while also making sure that we have consistent windowing behavior.

@tbg
Copy link
Member

tbg commented Mar 14, 2023

I don't think there was ever a desire to vary the window size by histogram. Rather, it's all a pile-up of unhappy historical accidents. I agree with Andrei's suggestion to more tightly couple the TSDB loop with the windowed histogram. I think so far we have tried to avoid this, but it's leading to the windowed histograms representing fairly unclear actual time windows. The only thing that really makes sense is that the datapoint at time T uses a windowed histogram that contains measurements that happened in the interval between the last scrape (so ~T-10s) and T. We were likely worried that there would be multiple scrapers sometimes (which, I think, can happen if you also scrape prometheus) so we should "simply" make sure that there is only one kind of scrape that actually affects the windowed histogram, and that this one scrape really only happens every 10s.

@aadityasondhi
Copy link
Collaborator

That makes sense. Thanks for the clarification.

If we make this change, what happens to the histogram mentioned in this issue? Should we discourage people from using histograms in cases where they only record data every 10s or higher? Using your example, a histogram with a low cadence of samples might have 0 data points within the [T-10, T] window. This will show up as 0 on the graphs and will paint the wrong picture about the underlying metric.

@kvoli
Copy link
Collaborator Author

kvoli commented Mar 14, 2023

From reading the thread, I'm thinking it would be easiest (for the use case at the top) to implement some histogram helpers for periodic recording, where all values we wish to record for a time window are known upfront. An example I found is in the client_go pkg.

To shed some more light on the use case:

I want to view, for that 10s what the 99th, 95th 50th etc percentile of replica load is in order to determine outliers and compare against other intervals and stores.

To implement this, I piggybacked existing logic where every 10s the store iterates over all its replica and grabs the replica's load stats. With the load stats, I record the replica's value to the histogram.

@aadityasondhi
Copy link
Collaborator

Based on an in-person discussion with Andrei, I propose the following:

  1. Lift the windowing logic into the poller. This makes sure that each value that we store into tsdb is of the window [T-t, T], where t is the last poll time and T is the current time. This will work for histograms that are of cumulative nature.
  2. For histograms such as the one above, I would suggest using ManualWindowHistogram. I do recognize that the current interface is cumbersome and I will rework it to make it more easily usable. When you create this histogram, you will be required to provide the promised update interval. The poller will report the last known value of these histograms as long as T-t < update interval. To avoid race conditions such as the one in this issue, we will allow for 1 missed update but start reporting 0 if the poller misses twice in a row.

I will prioritize 2 over 1 to unblock the PR. Let me know if this sounds good as a resolution.

cc @andreimatei @kvoli

@kvoli
Copy link
Collaborator Author

kvoli commented Mar 14, 2023

For histograms such as the one above, I would suggest using ManualWindowHistogram. I do recognize that the current interface is cumbersome and I will rework it to make it more easily usable. When you create this histogram, you will be required to provide the promised update interval. The poller will report the last known value of these histograms as long as T-t < update interval. To avoid race conditions such as the one in this issue, we will allow for 1 missed update but start reporting 0 if the poller misses twice in a row.

That seems perfect for my use case. Would the interface for recording be Observe(float64), called for every value?

@aadityasondhi
Copy link
Collaborator

Would the interface for recording be Observe(float64), called for every value?

I think the interface will require you to maintain a local histogram and call Update(Histogram) whenever you want to insert the new values and overwrite the previous histogram. This will mostly mimic what the batch histogram in Prometheus does https://github.com/prometheus/client_golang/blob/3d2cf0b338e19b0eaf277496058fc77bd3add440/prometheus/go_collector_latest.go#L454-L456.

kvoli added a commit to kvoli/cockroach that referenced this issue Mar 24, 2023
This commit extends the ManualWindowHistogram to support RecordValue and
Rotate. Previously, it was necessary to maintain duplicate cumulative
histograms in order to batch update the manual histogram. This update
adds a quality of life feature, enabling recording to the
ManualWindowHistogram, then once finished, rotating the batch of
recorded values into the current window for the internal tsdb to query.

Touches: cockroachdb#98266

Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Apr 3, 2023
This commit extends the ManualWindowHistogram to support RecordValue and
Rotate. Previously, it was necessary to maintain duplicate cumulative
histograms in order to batch update the manual histogram. This update
adds a quality of life feature, enabling recording to the
ManualWindowHistogram, then once finished, rotating the batch of
recorded values into the current window for the internal tsdb to query.

Touches: cockroachdb#98266

Release note: None
blathers-crl bot pushed a commit that referenced this issue Apr 3, 2023
This commit extends the ManualWindowHistogram to support RecordValue and
Rotate. Previously, it was necessary to maintain duplicate cumulative
histograms in order to batch update the manual histogram. This update
adds a quality of life feature, enabling recording to the
ManualWindowHistogram, then once finished, rotating the batch of
recorded values into the current window for the internal tsdb to query.

Touches: #98266

Release note: None
kvoli added a commit that referenced this issue Apr 5, 2023
This commit extends the ManualWindowHistogram to support RecordValue and
Rotate. Previously, it was necessary to maintain duplicate cumulative
histograms in order to batch update the manual histogram. This update
adds a quality of life feature, enabling recording to the
ManualWindowHistogram, then once finished, rotating the batch of
recorded values into the current window for the internal tsdb to query.

Touches: #98266

Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Apr 19, 2023
This commit extends the ManualWindowHistogram to support RecordValue and
Rotate. Previously, it was necessary to maintain duplicate cumulative
histograms in order to batch update the manual histogram. This update
adds a quality of life feature, enabling recording to the
ManualWindowHistogram, then once finished, rotating the batch of
recorded values into the current window for the internal tsdb to query.

Touches: cockroachdb#98266

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 23, 2023
Prior to this change, the metric.Histogram maintained two seperate
histograms under the hood: cumulative and windowed. The reason for the
windowed histogram was to be able to export the most recent recordings
to TSDB. However, the windowed histogram's ticks were not synchronized
with the polling of metrics done by TSDB. This caused problems where the
TSDB polling would read an empty window and report gaps in metrics as
seen in: cockroachdb#98266.

This patch seperates the windowed logic from pkg/metric and makes TSDB
polling keep track of deltas using a singular cumulative histogram.

Fixes cockroachdb#98266
Fixes cockroachdb#98621

Release note: None
ericharmeling added a commit to ericharmeling/cockroach that referenced this issue Jun 1, 2023
Fixes cockroachdb#103814.
Fixes cockroachdb#98266.

This commit updates the windowed histogram merging approach
to add the previous window's histogram bucket counts and sample
count to those of the current one. As a result of this change,
the histograms will no longer report under-sampled quantile values,
and timeseries metrics-derived charts (e.g., the quantile-based
SQL service latency charts on the DB console's Metrics page) will
more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation
to more accurately interpolate quantile values. This change will result
in smoother, more accurate Metrics charts on the DB Console.

Co-authored-by: Aaditya Sondhi <[email protected]>
ericharmeling added a commit to ericharmeling/cockroach that referenced this issue Jun 1, 2023
Fixes cockroachdb#103814.
Fixes cockroachdb#98266.

This commit updates the windowed histogram merging approach
to add the previous window's histogram bucket counts and sample
count to those of the current one. As a result of this change,
the histograms will no longer report under-sampled quantile values,
and timeseries metrics-derived charts (e.g., the quantile-based
SQL service latency charts on the DB console's Metrics page) will
more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation
to more accurately interpolate quantile values. This change will result
in smoother, more accurate Metrics charts on the DB Console.

Co-authored-by: Aaditya Sondhi <[email protected]>
ericharmeling added a commit to ericharmeling/cockroach that referenced this issue Jun 2, 2023
Fixes cockroachdb#103814.
Fixes cockroachdb#98266.

This commit updates the windowed histogram merging approach
to add the previous window's histogram bucket counts and sample
count to those of the current one. As a result of this change,
the histograms will no longer report under-sampled quantile values,
and timeseries metrics-derived charts (e.g., the quantile-based
SQL service latency charts on the DB console's Metrics page) will
more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation
to more accurately interpolate quantile values. This change will result
in smoother, more accurate Metrics charts on the DB Console.

Co-authored-by: Aaditya Sondhi <[email protected]>
ericharmeling added a commit to ericharmeling/cockroach that referenced this issue Jun 5, 2023
Fixes cockroachdb#103814.
Fixes cockroachdb#98266.

This commit updates the windowed histogram merging approach
to add the previous window's histogram bucket counts and sample
count to those of the current one. As a result of this change,
the histograms will no longer report under-sampled quantile values,
and timeseries metrics-derived charts (e.g., the quantile-based
SQL service latency charts on the DB console's Metrics page) will
more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation
to more accurately interpolate quantile values. This change will result
in smoother, more accurate Metrics charts on the DB Console.

Co-authored-by: Aaditya Sondhi <[email protected]>
ericharmeling added a commit to ericharmeling/cockroach that referenced this issue Jun 9, 2023
Fixes cockroachdb#103814.
Fixes cockroachdb#98266.

This commit updates the windowed histogram merging approach
to add the previous window's histogram bucket counts and sample
count to those of the current one. As a result of this change,
the histograms will no longer report under-sampled quantile values,
and timeseries metrics-derived charts (e.g., the quantile-based
SQL service latency charts on the DB console's Metrics page) will
more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation
to more accurately interpolate quantile values. This change will result
in smoother, more accurate Metrics charts on the DB Console.

Co-authored-by: Aaditya Sondhi <[email protected]>
craig bot pushed a commit that referenced this issue Jun 13, 2023
103729: log: add headers, compression config to http servers r=healthy-pod a=dhartunian

This commit adds support for custom headers and gzip compression on requests made to `http-server` outputs in CRDB's logging configuration. Custom headers enable the inclusion of API keys for 3rd party sinks and gzip compression reduces network resource consumption.

Resolves #103477

Release note (ops change): `http-defaults` and `http-servers` sections of the log config will now accept a `headers` field containing a map of key value string pairs which will comprise custom HTTP headers appended to every request. Additionally a `compression` value is support which can be set to `gzip` or `none` to select a compression method for the HTTP requesst body. By default `gzip` is selected. This is a change from previous functionality that did not compress by default.

103963: kvclient: add x-region, x-zone metrics to DistSender r=healthy-pod a=wenyihu6

Previously, there were no metrics to observe cross-region, cross-zone traffic in
batch requests / responses at DistSender.

To improve this issue, this commit introduces six new distsender metrics -
```
"distsender.batch_requests.replica_addressed.bytes"
"distsender.batch_responses.replica_addressed.bytes"
"distsender.batch_requests.cross_region.bytes"
"distsender.batch_responses.cross_region.bytes"
"distsender.batch_requests.cross_zone.bytes"
"distsender.batch_responses.cross_zone.bytes"
```

The first two metrics track the total byte count of batch requests processed and
batch responses received at DistSender. Additionally, there are four metrics to
track the aggregate counts processed and received across different regions and
zones. Note that these metrics only track the sender node and not the receiver
node as DistSender resides on the gateway node receiving SQL queries.

Part of: #103983

Release note (ops change): Six new metrics -
"distsender.batch_requests.replica_addressed.bytes",
"distsender.batch_responses.replica_addressed.bytes",
"distsender.batch_requests.cross_region.bytes",
"distsender.batch_responses.cross_region.bytes",
"distsender.batch_requests.cross_zone.bytes",
"distsender.batch_responses.cross_zone.bytes"- are now added to DistSender 
metrics.

For accurate metrics, follow these assumptions:
- Configure region and zone tier keys consistently across nodes.
- Within a node locality, ensure unique region and zone tier keys.
- Maintain consistent configuration of region and zone tiers across nodes.

104088: metrics: fix windowed histogram merging approach r=ericharmeling a=ericharmeling

Fixes #103814.
Fixes #98266.

This commit updates the windowed histogram merging approach to add the previous window's histogram bucket counts and sample count to those of the current one. As a result of this change, the histograms will no longer report under-sampled quantile values, and timeseries metrics-derived charts (e.g., the quantile-based SQL service latency charts on the DB console's Metrics page) will more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation to more accurately interpolate quantile values. This change will result in smoother, more accurate Metrics charts on the DB Console.

104376: sql: CREATEROLE now includes ability to grant non-admin roles r=rafiss a=rafiss

fixes #104371

This matches the PostgreSQL behavior.

Release note (security update): Users who have the CREATEROLE role
option can now grant and revoke role membership in any non-admin role.

This change also removes the sql.auth.createrole_allows_grant_role_membership.enabled
cluster setting, which was added in v23.1. Now, the cluster setting is
effectively always true.

104802: kvserver: prevent split at invalid tenant prefix keys r=arulajmani a=tbg

Closes #104796.

Epic: None
Release note (bug fix): prevents invalid splits that can crash (and prevent
restarts) of nodes that hold a replica for the right-hand side.


Co-authored-by: David Hartunian <[email protected]>
Co-authored-by: Wenyi <[email protected]>
Co-authored-by: Eric Harmeling <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
@craig craig bot closed this as completed in bae5045 Jun 13, 2023
blathers-crl bot pushed a commit that referenced this issue Jun 13, 2023
Fixes #103814.
Fixes #98266.

This commit updates the windowed histogram merging approach
to add the previous window's histogram bucket counts and sample
count to those of the current one. As a result of this change,
the histograms will no longer report under-sampled quantile values,
and timeseries metrics-derived charts (e.g., the quantile-based
SQL service latency charts on the DB console's Metrics page) will
more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation
to more accurately interpolate quantile values. This change will result
in smoother, more accurate Metrics charts on the DB Console.

Co-authored-by: Aaditya Sondhi <[email protected]>
ericharmeling added a commit to ericharmeling/cockroach that referenced this issue Jun 14, 2023
Fixes cockroachdb#103814.
Fixes cockroachdb#98266.

This commit updates the windowed histogram merging approach
to add the previous window's histogram bucket counts and sample
count to those of the current one. As a result of this change,
the histograms will no longer report under-sampled quantile values,
and timeseries metrics-derived charts (e.g., the quantile-based
SQL service latency charts on the DB console's Metrics page) will
more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation
to more accurately interpolate quantile values. This change will result
in smoother, more accurate Metrics charts on the DB Console.

Co-authored-by: Aaditya Sondhi <[email protected]>
ericharmeling added a commit to ericharmeling/cockroach that referenced this issue Jun 15, 2023
Fixes cockroachdb#103814.
Fixes cockroachdb#98266.

This commit updates the windowed histogram merging approach
to add the previous window's histogram bucket counts and sample
count to those of the current one. As a result of this change,
the histograms will no longer report under-sampled quantile values,
and timeseries metrics-derived charts (e.g., the quantile-based
SQL service latency charts on the DB console's Metrics page) will
more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation
to more accurately interpolate quantile values. This change will result
in smoother, more accurate Metrics charts on the DB Console.

Co-authored-by: Aaditya Sondhi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
5 participants