-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics: fix windowed histogram merging approach #104088
metrics: fix windowed histogram merging approach #104088
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! The core logic looks good, just need to extend it to all implementations of the windowed histogram implementation.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling)
-- commits
line 15 at r1:
Please add the issues this fixes:
- metrics: histogram calculation results in reporting service latency spikes #103814
- metric: periodic gaps in histogram metrics #98266
pkg/util/metric/metric.go
line 388 at r1 (raw file):
// Merging the sample count is necessary to correctly calculate the value at // a given quantile with the merged bucket counts. func (h *Histogram) ToPrometheusMetricWindowedMerged() *prometheusgo.Metric {
This should be implemented for all implementations of windowed histograms. Consider making it part of the WindowedHistogram Interface.
pkg/util/metric/metric.go
line 453 at r1 (raw file):
// 2. Since the prometheus client library ensures buckets are in a strictly // increasing order at creation, we do not sort them. func (h *Histogram) ValueAtQuantileWindowed(q float64) float64 {
Similarly, every implementation of this function should be updated for every implementation of windowed histogram.
f7b0f47
to
eb2872c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
I've updated the PR following our discussion offline. I'm a little unsure about the ManualWindowHistogram
merging implementation. PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi)
Previously, aadityasondhi (Aaditya Sondhi) wrote…
Please add the issues this fixes:
Done.
pkg/util/metric/metric.go
line 388 at r1 (raw file):
Previously, aadityasondhi (Aaditya Sondhi) wrote…
This should be implemented for all implementations of windowed histograms. Consider making it part of the WindowedHistogram Interface.
After our discussion offline, I believe that we decided to not add ToPrometheusMetricWindowed
/ToPrometheusMetricWindowedMerged
to the WindowedHistogram
interface, and instead rely on ValueAtQuantileWindowed
implementations to merge prev and cur windows. I've added a note in the interface definition.
pkg/util/metric/metric.go
line 453 at r1 (raw file):
Previously, aadityasondhi (Aaditya Sondhi) wrote…
Similarly, every implementation of this function should be updated for every implementation of windowed histogram.
Done.
pkg/util/metric/metric.go
line 609 at r3 (raw file):
func (mwh *ManualWindowHistogram) ToPrometheusMetricWindowed() *prometheusgo. Metric { cur := &prometheusgo.Metric{}
@aadityasondhi
Does this look right to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great! Just a few minor changes are needed here.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling)
pkg/util/metric/metric.go
line 609 at r3 (raw file):
Previously, ericharmeling (Eric Harmeling) wrote…
@aadityasondhi
Does this look right to you?
Yes, looks good!
pkg/util/metric/metric.go
line 274 at r4 (raw file):
h.windowed.tickHelper = &tickHelper{ nextT: now(), tickInterval: windowDuration,
You want to divide the windowDuration
by 2 since we have 2 windows here. This is similar to what happens in newHdrHistogram
with the histWrapNum
.
pkg/util/metric/metric_test.go
line 337 at r4 (raw file):
} // Tick. This rotates the histogram. setNow(time.Duration(i+1) * 10 * time.Second)
I don't think you want 10 seconds here if you have set Duration
to time.minute
. You would want to set time to the next tickInterval + 1
.
pkg/util/metric/metric_test.go
line 341 at r4 (raw file):
} act := *h.ToPrometheusMetric().Histogram
Don't you want to call the ToPrometheusMetricWindowed()
function here? Right now, this test actually tests the cumulative histogram working correctly, which is not the point. If you call ToPrometheusMetricWindowed()
, I think you will get a test that will actually just show the prev+cur histogram merged view rather than the view of all 4 histograms you have created here. The oldest two histogram views would be removed from the windowed histogram (as designed).
eb2872c
to
16408b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! I've made some changes following your feedback, with some commentary. I might want to pair offline once more to clarify some things.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi)
pkg/util/metric/metric.go
line 274 at r4 (raw file):
Previously, aadityasondhi (Aaditya Sondhi) wrote…
You want to divide the
windowDuration
by 2 since we have 2 windows here. This is similar to what happens innewHdrHistogram
with thehistWrapNum
.
I don't fully understand why this is the case, but done. Perhaps we chat offline a bit?
pkg/util/metric/metric_test.go
line 337 at r4 (raw file):
I don't think you want 10 seconds here if you have set
Duration
totime.minute
.
I meant to set the Duration
to 10 seconds. Fixed that. (Although I guess it's arbitrary, since we are manually setting the now
time)
You would want to set time to the next
tickInterval + 1
.
I think I want to set this to time.Duration(i+1) * tickInterval
, right? If we are incrementing the nextTick
value by the tickInterval
at each tick call, we should be setting the now
time to exactly when the maybeTick
evaluates nextTick().Before(now())
.
pkg/util/metric/metric_test.go
line 341 at r4 (raw file):
Previously, aadityasondhi (Aaditya Sondhi) wrote…
Don't you want to call the
ToPrometheusMetricWindowed()
function here? Right now, this test actually tests the cumulative histogram working correctly, which is not the point. If you callToPrometheusMetricWindowed()
, I think you will get a test that will actually just show the prev+cur histogram merged view rather than the view of all 4 histograms you have created here. The oldest two histogram views would be removed from the windowed histogram (as designed).
Oh you're totally right. We should be looking at the windows and not the cumulative histogram.
But I don't think we need to call ToPrometheusMetricWindowed
here. The ToPrometheusMetric
erroneously used here is an implementation of the PrometheusExportable
interface, which I don't think we want to include ToPrometheusMetricWindowed
, as this interface is implemented by metrics other than histograms (right?).
ValueAtQuantileWindowed
calls ToPrometheusMetricWindowed()
when we start asserting values. I think the trick here is checking the "windowed" functions before and after observations within each window interval. This way we can make sure that the merge is happening and the quantile values are capturing the previous window early on in the current window's interval (i.e., before any observations) and later on (i.e., after all observations).
Does that make sense?
16408b7
to
13beab0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work, this looks good to me!
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ericharmeling)
pkg/util/metric/metric.go
line 274 at r4 (raw file):
Previously, ericharmeling (Eric Harmeling) wrote…
I don't fully understand why this is the case, but done. Perhaps we chat offline a bit?
Spoke offline about this,
pkg/util/metric/metric_test.go
line 337 at r4 (raw file):
Previously, ericharmeling (Eric Harmeling) wrote…
I don't think you want 10 seconds here if you have set
Duration
totime.minute
.I meant to set the
Duration
to 10 seconds. Fixed that. (Although I guess it's arbitrary, since we are manually setting thenow
time)You would want to set time to the next
tickInterval + 1
.I think I want to set this to
time.Duration(i+1) * tickInterval
, right? If we are incrementing thenextTick
value by thetickInterval
at each tick call, we should be setting thenow
time to exactly when themaybeTick
evaluatesnextTick().Before(now())
.
Yes that seems right to me.
pkg/util/metric/metric_test.go
line 341 at r4 (raw file):
Previously, ericharmeling (Eric Harmeling) wrote…
Oh you're totally right. We should be looking at the windows and not the cumulative histogram.
But I don't think we need to call
ToPrometheusMetricWindowed
here. TheToPrometheusMetric
erroneously used here is an implementation of thePrometheusExportable
interface, which I don't think we want to includeToPrometheusMetricWindowed
, as this interface is implemented by metrics other than histograms (right?).
ValueAtQuantileWindowed
callsToPrometheusMetricWindowed()
when we start asserting values. I think the trick here is checking the "windowed" functions before and after observations within each window interval. This way we can make sure that the merge is happening and the quantile values are capturing the previous window early on in the current window's interval (i.e., before any observations) and later on (i.e., after all observations).Does that make sense?
Spoke offline, seems right.
Note: this should probably be backported to 22.2 and 23.1. But I leave that decision to the obs-inf team |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aadityasondhi Thank you for all of the help on this!
this should probably be backported to 22.2 and 23.1. But I leave that decision to the obs-inf team
@dhartunian Any objections to backporting?
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi and @dhartunian)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should be backported to both since both versions have the new histogram by default.
Reviewed 1 of 3 files at r6, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi and @ericharmeling)
pkg/util/metric/metric.go
line 274 at r4 (raw file):
Previously, aadityasondhi (Aaditya Sondhi) wrote…
Spoke offline about this,
@ericharmeling can you document the explanation in a comment right above this field setting?
pkg/util/metric/metric.go
line 315 at r6 (raw file):
// but that is now how we have set it up right now. It should be doable // though, since there is only one consumer of windowed histograms - our // internal timeseries system.
Is this comment no longer necessary? It seems like we're implemented some version of this.
Code quote:
// TODO(obs-inf): the way we implement windowed histograms is not great. If
// the windowed histogram is pulled right after a tick, it will be mostly
// empty. We could add a third bucket and represent the merged view of the two
// most recent buckets to avoid that. Or we could "just" double the rotation
// interval (so that the histogram really collects for 20s when we expect to
// persist the contents every 10s). Really it would make more sense to
// explicitly rotate the histogram atomically with collecting its contents,
// but that is now how we have set it up right now. It should be doable
// though, since there is only one consumer of windowed histograms - our
// internal timeseries system.
pkg/util/metric/metric.go
line 391 at r6 (raw file):
panic(err) } if h.windowed.prev != nil {
why do we need to check for nil here? it seems like h.windowed.cur
and h.windowed.prev
are both set in the onTick
field of tickHelper
. Why check one for nil and not the other?
pkg/util/metric/metric.go
line 616 at r6 (raw file):
// ToPrometheusMetricWindowed returns a filled-in prometheus metric of the // right type. func (mwh *ManualWindowHistogram) ToPrometheusMetricWindowed() *prometheusgo.
nit: maybe suffix method with Locked
to clarify that lock should be taken prior.
pkg/util/metric/metric.go
line 617 at r6 (raw file):
// right type. func (mwh *ManualWindowHistogram) ToPrometheusMetricWindowed() *prometheusgo. Metric {
nit: why is Metric {
on a separate line?
pkg/util/metric/metric.go
line 639 at r6 (raw file):
func (mwh *ManualWindowHistogram) Total() (int64, float64) { mwh.mu.RLock() defer mwh.mu.RUnlock()
ToPrometheusMetric
locks this mutex already
pkg/util/metric/metric.go
line 653 at r6 (raw file):
func (mwh *ManualWindowHistogram) Mean() float64 { mwh.mu.RLock() defer mwh.mu.RUnlock()
ToPrometheusMetric
locks this mutex already
pkg/util/metric/metric.go
line 927 at r6 (raw file):
// histogram. // NB: Buckets on each histogram must be the same func MergeWindowedHistogram(cur *prometheusgo.Histogram, prev *prometheusgo.Histogram) {
Can you add some unit tests for this merge function?
f7006be
to
cd5e718
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR @dhartunian ! I think I got to all of your comments.
Yes, this should be backported to both since both versions have the new histogram by default.
Added the backport labels.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @dhartunian, and @tbg)
pkg/util/metric/metric.go
line 274 at r4 (raw file):
Previously, dhartunian (David Hartunian) wrote…
@ericharmeling can you document the explanation in a comment right above this field setting?
Done.
pkg/util/metric/metric.go
line 315 at r6 (raw file):
I believe @tbg wrote this in eb82a43.
If the windowed histogram is pulled right after a tick, it will be mostly empty.
This is no longer true. However, the comments after this might still apply for future improvements.
We could add a third bucket and represent the merged view of the two most recent buckets to avoid that.
We are kind of doing this now. But we're really just calculating this "third bucket" every time we call TotalWindowed
, MeanWindowed
, and ValueAtQuantileWindowed
.
@tbg any objections to removing this comment (or folding it into a tracking issue)?
pkg/util/metric/metric.go
line 391 at r6 (raw file):
We get a runtime error without this check at cluster start-up:
* ERROR: a panic has occurred!
* runtime error: invalid memory address or nil pointer dereference
* (1) attached stack trace
* -- stack trace:
* | runtime.gopanic
* | GOROOT/src/runtime/panic.go:884
* | runtime.panicmem
* | GOROOT/src/runtime/panic.go:260
* | runtime.sigpanic
* | GOROOT/src/runtime/signal_unix.go:835
* | github.com/cockroachdb/cockroach/pkg/util/metric.(*Histogram).ToPrometheusMetricWindowed
* | github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/metric.go:394
* | github.com/cockroachdb/cockroach/pkg/util/metric.(*Histogram).MeanWindowed
* | github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/metric.go:435
* | github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric.(*AggHistogram).MeanWindowed
* | github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric/histogram.go:76
* | github.com/cockroachdb/cockroach/pkg/server/status.extractValue
* | github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:617
* | github.com/cockroachdb/cockroach/pkg/server/status.eachRecordableValue.func1
* | github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:646
* | github.com/cockroachdb/cockroach/pkg/util/metric.(*Registry).Each.func1
* | github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/registry.go:171
* | github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric.(*AggHistogram).Inspect
* | github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric/histogram.go:62
* | github.com/cockroachdb/cockroach/pkg/util/metric.(*Registry).Each
* | github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/registry.go:170
* | github.com/cockroachdb/cockroach/pkg/server/status.eachRecordableValue
* | github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:645
* | github.com/cockroachdb/cockroach/pkg/server/status.(*MetricsRecorder).GenerateNodeStatus
* | github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:509
* | github.com/cockroachdb/cockroach/pkg/server.(*Node).writeNodeStatus.func1
* | github.com/cockroachdb/cockroach/pkg/server/node.go:1056
* | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTask
* | github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:319
* | github.com/cockroachdb/cockroach/pkg/server.(*Node).writeNodeStatus
* | github.com/cockroachdb/cockroach/pkg/server/node.go:1055
* | github.com/cockroachdb/cockroach/pkg/server.(*Node).startWriteNodeStatus.func1
* | github.com/cockroachdb/cockroach/pkg/server/node.go:1013
* | github.com/cockroachdb/cockroach/pkg/util/startup.RunIdempotentWithRetry.func1
* | github.com/cockroachdb/cockroach/pkg/util/startup/retry.go:121
* | github.com/cockroachdb/cockroach/pkg/util/startup.RunIdempotentWithRetryEx[...]
* | github.com/cockroachdb/cockroach/pkg/util/startup/retry.go:142
* | github.com/cockroachdb/cockroach/pkg/util/startup.RunIdempotentWithRetry
* | github.com/cockroachdb/cockroach/pkg/util/startup/retry.go:120
* | github.com/cockroachdb/cockroach/pkg/server.(*Node).startWriteNodeStatus
* | github.com/cockroachdb/cockroach/pkg/server/node.go:1010
* | github.com/cockroachdb/cockroach/pkg/server.(*Server).PreStart
* | github.com/cockroachdb/cockroach/pkg/server/server.go:1945
* | github.com/cockroachdb/cockroach/pkg/cli.createAndStartServerAsync.func1.2
* | github.com/cockroachdb/cockroach/pkg/cli/start.go:805
* | github.com/cockroachdb/cockroach/pkg/cli.createAndStartServerAsync.func1
* | github.com/cockroachdb/cockroach/pkg/cli/start.go:853
* | runtime.goexit
* | GOROOT/src/runtime/asm_arm64.s:1172
* Wraps: (2) runtime error: invalid memory address or nil pointer dereference
* Error types: (1) *withstack.withStack (2) runtime.errorString
Checking the prev
nil also helps us avoid evaluating the merge when one isn't possible (i.e., when there is no previous window).
Why check one for nil and not the other?
I believe h.windowed.cur
starts as nil
, so when it's rotated on the first call to newHistogram
, that nil
value is assigned to h.windowed.prev
(and a new histogram is assigned to h.windowed.prev
).
So... should we leave this check here, or explicitly initialize both prev
and cur
(and check the merge on an empty prev
instead)?
pkg/util/metric/metric.go
line 616 at r6 (raw file):
Previously, dhartunian (David Hartunian) wrote…
nit: maybe suffix method with
Locked
to clarify that lock should be taken prior.
Done.
pkg/util/metric/metric.go
line 617 at r6 (raw file):
Previously, dhartunian (David Hartunian) wrote…
nit: why is
Metric {
on a separate line?
Linter thing - fixed.
pkg/util/metric/metric.go
line 639 at r6 (raw file):
Previously, dhartunian (David Hartunian) wrote…
ToPrometheusMetric
locks this mutex already
Fixed.
pkg/util/metric/metric.go
line 653 at r6 (raw file):
Previously, dhartunian (David Hartunian) wrote…
ToPrometheusMetric
locks this mutex already
Fixed.
pkg/util/metric/metric.go
line 927 at r6 (raw file):
Previously, dhartunian (David Hartunian) wrote…
Can you add some unit tests for this merge function?
Done.
cd5e718
to
8dc28e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r7.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @ericharmeling, and @tbg)
pkg/util/metric/metric.go
line 391 at r6 (raw file):
Previously, ericharmeling (Eric Harmeling) wrote…
We get a runtime error without this check at cluster start-up:
* ERROR: a panic has occurred! * runtime error: invalid memory address or nil pointer dereference * (1) attached stack trace * -- stack trace: * | runtime.gopanic * | GOROOT/src/runtime/panic.go:884 * | runtime.panicmem * | GOROOT/src/runtime/panic.go:260 * | runtime.sigpanic * | GOROOT/src/runtime/signal_unix.go:835 * | github.com/cockroachdb/cockroach/pkg/util/metric.(*Histogram).ToPrometheusMetricWindowed * | github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/metric.go:394 * | github.com/cockroachdb/cockroach/pkg/util/metric.(*Histogram).MeanWindowed * | github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/metric.go:435 * | github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric.(*AggHistogram).MeanWindowed * | github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric/histogram.go:76 * | github.com/cockroachdb/cockroach/pkg/server/status.extractValue * | github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:617 * | github.com/cockroachdb/cockroach/pkg/server/status.eachRecordableValue.func1 * | github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:646 * | github.com/cockroachdb/cockroach/pkg/util/metric.(*Registry).Each.func1 * | github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/registry.go:171 * | github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric.(*AggHistogram).Inspect * | github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric/histogram.go:62 * | github.com/cockroachdb/cockroach/pkg/util/metric.(*Registry).Each * | github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/registry.go:170 * | github.com/cockroachdb/cockroach/pkg/server/status.eachRecordableValue * | github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:645 * | github.com/cockroachdb/cockroach/pkg/server/status.(*MetricsRecorder).GenerateNodeStatus * | github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:509 * | github.com/cockroachdb/cockroach/pkg/server.(*Node).writeNodeStatus.func1 * | github.com/cockroachdb/cockroach/pkg/server/node.go:1056 * | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTask * | github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:319 * | github.com/cockroachdb/cockroach/pkg/server.(*Node).writeNodeStatus * | github.com/cockroachdb/cockroach/pkg/server/node.go:1055 * | github.com/cockroachdb/cockroach/pkg/server.(*Node).startWriteNodeStatus.func1 * | github.com/cockroachdb/cockroach/pkg/server/node.go:1013 * | github.com/cockroachdb/cockroach/pkg/util/startup.RunIdempotentWithRetry.func1 * | github.com/cockroachdb/cockroach/pkg/util/startup/retry.go:121 * | github.com/cockroachdb/cockroach/pkg/util/startup.RunIdempotentWithRetryEx[...] * | github.com/cockroachdb/cockroach/pkg/util/startup/retry.go:142 * | github.com/cockroachdb/cockroach/pkg/util/startup.RunIdempotentWithRetry * | github.com/cockroachdb/cockroach/pkg/util/startup/retry.go:120 * | github.com/cockroachdb/cockroach/pkg/server.(*Node).startWriteNodeStatus * | github.com/cockroachdb/cockroach/pkg/server/node.go:1010 * | github.com/cockroachdb/cockroach/pkg/server.(*Server).PreStart * | github.com/cockroachdb/cockroach/pkg/server/server.go:1945 * | github.com/cockroachdb/cockroach/pkg/cli.createAndStartServerAsync.func1.2 * | github.com/cockroachdb/cockroach/pkg/cli/start.go:805 * | github.com/cockroachdb/cockroach/pkg/cli.createAndStartServerAsync.func1 * | github.com/cockroachdb/cockroach/pkg/cli/start.go:853 * | runtime.goexit * | GOROOT/src/runtime/asm_arm64.s:1172 * Wraps: (2) runtime error: invalid memory address or nil pointer dereference * Error types: (1) *withstack.withStack (2) runtime.errorString
Checking the
prev
nil also helps us avoid evaluating the merge when one isn't possible (i.e., when there is no previous window).Why check one for nil and not the other?
I believe
h.windowed.cur
starts asnil
, so when it's rotated on the first call tonewHistogram
, thatnil
value is assigned toh.windowed.prev
(and a new histogram is assigned toh.windowed.prev
).So... should we leave this check here, or explicitly initialize both
prev
andcur
(and check the merge on an emptyprev
instead)?
ah makes sense. thanks for explaining in-depth.
pkg/util/metric/metric_test.go
line 437 at r7 (raw file):
// sample count, and per-bucket cumulative count values, // plus those of the previous histogram. require.Equal(t, *cur.Histogram.SampleCount, uint64(2))
nit: keep the order consistent between "expected" and "actual" args in the require
call.
Fixes cockroachdb#103814. Fixes cockroachdb#98266. This commit updates the windowed histogram merging approach to add the previous window's histogram bucket counts and sample count to those of the current one. As a result of this change, the histograms will no longer report under-sampled quantile values, and timeseries metrics-derived charts (e.g., the quantile-based SQL service latency charts on the DB console's Metrics page) will more accurately display metrics. Release note (bug fix): Updated the histogram window merge calculation to more accurately interpolate quantile values. This change will result in smoother, more accurate Metrics charts on the DB Console. Co-authored-by: Aaditya Sondhi <[email protected]>
8dc28e2
to
bae5045
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aadityasondhi, @dhartunian, and @tbg)
pkg/util/metric/metric_test.go
line 437 at r7 (raw file):
Previously, dhartunian (David Hartunian) wrote…
nit: keep the order consistent between "expected" and "actual" args in the
require
call.
Done.
bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from bae5045 to blathers/backport-release-22.2-104088: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Fixes #103814.
Fixes #98266.
This commit updates the windowed histogram merging approach to add the previous window's histogram bucket counts and sample count to those of the current one. As a result of this change, the histograms will no longer report under-sampled quantile values, and timeseries metrics-derived charts (e.g., the quantile-based SQL service latency charts on the DB console's Metrics page) will more accurately display metrics.
Release note (bug fix): Updated the histogram window merge calculation to more accurately interpolate quantile values. This change will result in smoother, more accurate Metrics charts on the DB Console.