metrics: -count metrics in internal tsdb are broken/not useful #98745

tbg · 2023-03-16T08:58:55Z

Look at these:

The average and count are taken from the windowed histogram. This is fine for the average, though it's not clear how there would be a zero average ever - we heartbeat every <5s so there should be a nonzero average at all times (we collect at 10s interval and rotate at an even slower interval). The count just doesn't help anyone - it's the windowed count (and you can see it drop to zero periodically, which explains the zero average and this is unexpected too).

In the internal tsdb, we only export a number of quantiles and these "windowed" avg/counts, so we don't have an ability to plot a rate of events affecting a histogram. This has been annoying in a few escalations over the years.

We should export the cumulative count instead of the windowed count. The windowed count is so useless that I doubt that anyone relies on it and we could just change the semantics (this doesn't affect prometheus endpoint).

The windowed average seems to be at least somewhat weird. When I plot -avg for a metric that should see "more regular" data recorded to it the jitter goes away some, but it's still noticeable:

so there's probably a data quality issue here. I suspect whatever @aadityasondhi will end up doing for #98266 will fix this problem here too.

Jira issue: CRDB-25494

Previously, we were writing the windowed count of a histogram into tsdb. This meant that on each tick, the count reset. This is useful for calculating averages and quantiles using windowed histograms, but makes little sense to record for `count`. This patch now uses the cumulative count in tsdb. Informs cockroachdb#98745 Release note (bug fix): Timeseries metric counts will now show cumulative counts for a histogram rather than the windowed count.

101909: metric: use cumulative count instead of windowed count in tsdb r=aadityasondhi a=aadityasondhi Previously, we were writing the windowed count of a histogram into tsdb. This meant that on each tick, the count reset. This is useful for calculating averages and quantiles using windowed histograms, but makes little sense to record for `count`. This patch now uses the cumulative count in tsdb. This patch also adds a `-sum` field to maintain a record of the cumulative sum along with the cumulative counts. Fixes #98745 Release note (bug fix): Timeseries metric counts will now show cumulative counts for a histogram rather than the windowed count. A `-sum` timeseries is also exported to keep track of the cumulative sum of all samples in the histogram. 103330: roachtest: fix cdc test timestamp logging r=aliher1911 a=aliher1911 This commit fixes format args of logging message. Current logging: ``` 8.224822939s was spent validating this resolved timestamp: %!s(MISSING) ``` Fixed logging ``` 207.587058ms was spent validating this resolved timestamp: 1684170215.036471445,0 ``` Epic: none Release note: None 103342: kvserver: deflake `TestRestoreReplicas` r=erikgrinaker a=erikgrinaker We recently began eagerly initializing replicas on startup when using expiration-based leases. This can cause elections on startup such that the Raft leadership moves around for a bit before settling down at the leaseholder. This test expected the first store to be able to acquire Raft leadership and the lease, but that wouldn't hold if the second replica had already acquired leadership. This patch changes the test to keep looking until a leaseholder is established. Resolves #103251. Epic: none Release note: None 103346: roachtest: disable more expiration lease metamorphism r=erikgrinaker a=erikgrinaker Resolves #103347. Resolves #103340. Touches #98799. Touches #99087. Epic: none Release note: None Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: Oleg Afanasyev <[email protected]> Co-authored-by: Erik Grinaker <[email protected]>

Previously, we were writing the windowed count of a histogram into tsdb. This meant that on each tick, the count reset. This is useful for calculating averages and quantiles using windowed histograms, but makes little sense to record for `count`. This patch now uses the cumulative count in tsdb. This patch also adds a `-sum` field to maintain a record of the cumulative sum along with the cumulative counts. Fixes cockroachdb#98745 Release note (bug fix): Timeseries metric counts will now show cumulative counts for a histogram rather than the windowed count. A `-sum` timeseries is also exported to keep track of the cumulative sum of all samples in the histogram.

tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-observability-inf labels Mar 16, 2023

blathers-crl bot added the A-observability-inf label Mar 16, 2023

aadityasondhi self-assigned this Mar 16, 2023

aadityasondhi mentioned this issue Apr 20, 2023

metric: use cumulative count instead of windowed count in tsdb #101909

Merged

craig bot closed this as completed in b5bac44 May 15, 2023

aadityasondhi mentioned this issue May 16, 2023

release-23.1: metric: use cumulative count instead of windowed count in tsdb #103444

Merged

aadityasondhi mentioned this issue May 16, 2023

release-22.2: metric: use cumulative count instead of windowed count in tsdb #103447

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics: -count metrics in internal tsdb are broken/not useful #98745

metrics: -count metrics in internal tsdb are broken/not useful #98745

tbg commented Mar 16, 2023 •

edited by cockroach-jira-scripts

Loading

metrics: -count metrics in internal tsdb are broken/not useful #98745

metrics: -count metrics in internal tsdb are broken/not useful #98745

Comments

tbg commented Mar 16, 2023 • edited by cockroach-jira-scripts Loading

tbg commented Mar 16, 2023 •

edited by cockroach-jira-scripts

Loading