-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics: -count metrics in internal tsdb are broken/not useful #98745
Labels
A-observability-inf
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Comments
tbg
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-observability-inf
labels
Mar 16, 2023
aadityasondhi
added a commit
to aadityasondhi/cockroach
that referenced
this issue
Apr 20, 2023
Previously, we were writing the windowed count of a histogram into tsdb. This meant that on each tick, the count reset. This is useful for calculating averages and quantiles using windowed histograms, but makes little sense to record for `count`. This patch now uses the cumulative count in tsdb. Informs cockroachdb#98745 Release note (bug fix): Timeseries metric counts will now show cumulative counts for a histogram rather than the windowed count.
aadityasondhi
added a commit
to aadityasondhi/cockroach
that referenced
this issue
Apr 27, 2023
Previously, we were writing the windowed count of a histogram into tsdb. This meant that on each tick, the count reset. This is useful for calculating averages and quantiles using windowed histograms, but makes little sense to record for `count`. This patch now uses the cumulative count in tsdb. Informs cockroachdb#98745 Release note (bug fix): Timeseries metric counts will now show cumulative counts for a histogram rather than the windowed count.
craig bot
pushed a commit
that referenced
this issue
May 15, 2023
101909: metric: use cumulative count instead of windowed count in tsdb r=aadityasondhi a=aadityasondhi Previously, we were writing the windowed count of a histogram into tsdb. This meant that on each tick, the count reset. This is useful for calculating averages and quantiles using windowed histograms, but makes little sense to record for `count`. This patch now uses the cumulative count in tsdb. This patch also adds a `-sum` field to maintain a record of the cumulative sum along with the cumulative counts. Fixes #98745 Release note (bug fix): Timeseries metric counts will now show cumulative counts for a histogram rather than the windowed count. A `-sum` timeseries is also exported to keep track of the cumulative sum of all samples in the histogram. 103330: roachtest: fix cdc test timestamp logging r=aliher1911 a=aliher1911 This commit fixes format args of logging message. Current logging: ``` 8.224822939s was spent validating this resolved timestamp: %!s(MISSING) ``` Fixed logging ``` 207.587058ms was spent validating this resolved timestamp: 1684170215.036471445,0 ``` Epic: none Release note: None 103342: kvserver: deflake `TestRestoreReplicas` r=erikgrinaker a=erikgrinaker We recently began eagerly initializing replicas on startup when using expiration-based leases. This can cause elections on startup such that the Raft leadership moves around for a bit before settling down at the leaseholder. This test expected the first store to be able to acquire Raft leadership and the lease, but that wouldn't hold if the second replica had already acquired leadership. This patch changes the test to keep looking until a leaseholder is established. Resolves #103251. Epic: none Release note: None 103346: roachtest: disable more expiration lease metamorphism r=erikgrinaker a=erikgrinaker Resolves #103347. Resolves #103340. Touches #98799. Touches #99087. Epic: none Release note: None Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: Oleg Afanasyev <[email protected]> Co-authored-by: Erik Grinaker <[email protected]>
aadityasondhi
added a commit
to aadityasondhi/cockroach
that referenced
this issue
May 16, 2023
Previously, we were writing the windowed count of a histogram into tsdb. This meant that on each tick, the count reset. This is useful for calculating averages and quantiles using windowed histograms, but makes little sense to record for `count`. This patch now uses the cumulative count in tsdb. This patch also adds a `-sum` field to maintain a record of the cumulative sum along with the cumulative counts. Fixes cockroachdb#98745 Release note (bug fix): Timeseries metric counts will now show cumulative counts for a histogram rather than the windowed count. A `-sum` timeseries is also exported to keep track of the cumulative sum of all samples in the histogram.
aadityasondhi
added a commit
to aadityasondhi/cockroach
that referenced
this issue
May 16, 2023
Previously, we were writing the windowed count of a histogram into tsdb. This meant that on each tick, the count reset. This is useful for calculating averages and quantiles using windowed histograms, but makes little sense to record for `count`. This patch now uses the cumulative count in tsdb. This patch also adds a `-sum` field to maintain a record of the cumulative sum along with the cumulative counts. Fixes cockroachdb#98745 Release note (bug fix): Timeseries metric counts will now show cumulative counts for a histogram rather than the windowed count. A `-sum` timeseries is also exported to keep track of the cumulative sum of all samples in the histogram.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-observability-inf
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Look at these:
The average and count are taken from the windowed histogram. This is fine for the average, though it's not clear how there would be a zero average ever - we heartbeat every <5s so there should be a nonzero average at all times (we collect at 10s interval and rotate at an even slower interval). The count just doesn't help anyone - it's the windowed count (and you can see it drop to zero periodically, which explains the zero average and this is unexpected too).
In the internal tsdb, we only export a number of quantiles and these "windowed" avg/counts, so we don't have an ability to plot a rate of events affecting a histogram. This has been annoying in a few escalations over the years.
We should export the cumulative count instead of the windowed count. The windowed count is so useless that I doubt that anyone relies on it and we could just change the semantics (this doesn't affect prometheus endpoint).
The windowed average seems to be at least somewhat weird. When I plot -avg for a metric that should see "more regular" data recorded to it the jitter goes away some, but it's still noticeable:
so there's probably a data quality issue here. I suspect whatever @aadityasondhi will end up doing for #98266 will fix this problem here too.
Jira issue: CRDB-25494
The text was updated successfully, but these errors were encountered: