Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: -count metrics in internal tsdb are broken/not useful #98745

Closed
tbg opened this issue Mar 16, 2023 · 0 comments · Fixed by #101909
Closed

metrics: -count metrics in internal tsdb are broken/not useful #98745

tbg opened this issue Mar 16, 2023 · 0 comments · Fixed by #101909
Assignees
Labels
A-observability-inf C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@tbg
Copy link
Member

tbg commented Mar 16, 2023

Look at these:

image

The average and count are taken from the windowed histogram. This is fine for the average, though it's not clear how there would be a zero average ever - we heartbeat every <5s so there should be a nonzero average at all times (we collect at 10s interval and rotate at an even slower interval). The count just doesn't help anyone - it's the windowed count (and you can see it drop to zero periodically, which explains the zero average and this is unexpected too).

In the internal tsdb, we only export a number of quantiles and these "windowed" avg/counts, so we don't have an ability to plot a rate of events affecting a histogram. This has been annoying in a few escalations over the years.

We should export the cumulative count instead of the windowed count. The windowed count is so useless that I doubt that anyone relies on it and we could just change the semantics (this doesn't affect prometheus endpoint).

The windowed average seems to be at least somewhat weird. When I plot -avg for a metric that should see "more regular" data recorded to it the jitter goes away some, but it's still noticeable:

image

so there's probably a data quality issue here. I suspect whatever @aadityasondhi will end up doing for #98266 will fix this problem here too.

Jira issue: CRDB-25494

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-observability-inf labels Mar 16, 2023
@aadityasondhi aadityasondhi self-assigned this Mar 16, 2023
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Apr 20, 2023
Previously, we were writing the windowed count of a histogram into tsdb.
This meant that on each tick, the count reset. This is useful for
calculating averages and quantiles using windowed histograms, but makes
little sense to record for `count`.

This patch now uses the cumulative count in tsdb.

Informs cockroachdb#98745

Release note (bug fix): Timeseries metric counts will now show
cumulative counts for a histogram rather than the windowed count.
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Apr 27, 2023
Previously, we were writing the windowed count of a histogram into tsdb.
This meant that on each tick, the count reset. This is useful for
calculating averages and quantiles using windowed histograms, but makes
little sense to record for `count`.

This patch now uses the cumulative count in tsdb.

Informs cockroachdb#98745

Release note (bug fix): Timeseries metric counts will now show
cumulative counts for a histogram rather than the windowed count.
craig bot pushed a commit that referenced this issue May 15, 2023
101909: metric: use cumulative count instead of windowed count in tsdb r=aadityasondhi a=aadityasondhi

Previously, we were writing the windowed count of a histogram into tsdb.
This meant that on each tick, the count reset. This is useful for
calculating averages and quantiles using windowed histograms, but makes
little sense to record for `count`.

This patch now uses the cumulative count in tsdb.

This patch also adds a `-sum` field to maintain a record of the
cumulative sum along with the cumulative counts.

Fixes #98745

Release note (bug fix): Timeseries metric counts will now show
cumulative counts for a histogram rather than the windowed count. A
`-sum` timeseries is also exported to keep track of the cumulative sum
of all samples in the histogram.

103330: roachtest: fix cdc test timestamp logging r=aliher1911 a=aliher1911

This commit fixes format args of logging message.

Current logging:
```
8.224822939s was spent validating this resolved timestamp: %!s(MISSING)
```

Fixed logging
```
207.587058ms was spent validating this resolved timestamp: 1684170215.036471445,0
```
Epic: none

Release note: None

103342: kvserver: deflake `TestRestoreReplicas` r=erikgrinaker a=erikgrinaker

We recently began eagerly initializing replicas on startup when using expiration-based leases. This can cause elections on startup such that the Raft leadership moves around for a bit before settling down at the leaseholder. This test expected the first store to be able to acquire Raft leadership and the lease, but that wouldn't hold if the second replica had already acquired leadership.

This patch changes the test to keep looking until a leaseholder is established.

Resolves #103251.
Epic: none

Release note: None

103346: roachtest: disable more expiration lease metamorphism r=erikgrinaker a=erikgrinaker

Resolves #103347.
Resolves #103340.
Touches #98799.
Touches #99087.

Epic: none
Release note: None

Co-authored-by: Aaditya Sondhi <[email protected]>
Co-authored-by: Oleg Afanasyev <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
@craig craig bot closed this as completed in b5bac44 May 15, 2023
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 16, 2023
Previously, we were writing the windowed count of a histogram into tsdb.
This meant that on each tick, the count reset. This is useful for
calculating averages and quantiles using windowed histograms, but makes
little sense to record for `count`.

This patch now uses the cumulative count in tsdb.

This patch also adds a `-sum` field to maintain a record of the
cumulative sum along with the cumulative counts.

Fixes cockroachdb#98745

Release note (bug fix): Timeseries metric counts will now show
cumulative counts for a histogram rather than the windowed count. A
`-sum` timeseries is also exported to keep track of the cumulative sum
of all samples in the histogram.
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue May 16, 2023
Previously, we were writing the windowed count of a histogram into tsdb.
This meant that on each tick, the count reset. This is useful for
calculating averages and quantiles using windowed histograms, but makes
little sense to record for `count`.

This patch now uses the cumulative count in tsdb.

This patch also adds a `-sum` field to maintain a record of the
cumulative sum along with the cumulative counts.

Fixes cockroachdb#98745

Release note (bug fix): Timeseries metric counts will now show
cumulative counts for a histogram rather than the windowed count. A
`-sum` timeseries is also exported to keep track of the cumulative sum
of all samples in the histogram.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants