Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: fix windowed histogram merging approach #104088

Merged

Conversation

ericharmeling
Copy link
Contributor

@ericharmeling ericharmeling commented May 30, 2023

Fixes #103814.
Fixes #98266.

This commit updates the windowed histogram merging approach to add the previous window's histogram bucket counts and sample count to those of the current one. As a result of this change, the histograms will no longer report under-sampled quantile values, and timeseries metrics-derived charts (e.g., the quantile-based SQL service latency charts on the DB console's Metrics page) will more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation to more accurately interpolate quantile values. This change will result in smoother, more accurate Metrics charts on the DB Console.

@ericharmeling ericharmeling requested review from a team and removed request for a team May 30, 2023 15:28
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ericharmeling
Copy link
Contributor Author

Here's a screenshot of the DB Console custom metrics and a prometheus metrics query for the same quantile:

image

Copy link
Collaborator

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! The core logic looks good, just need to extend it to all implementations of the windowed histogram implementation.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling)


-- commits line 15 at r1:
Please add the issues this fixes:


pkg/util/metric/metric.go line 388 at r1 (raw file):

// Merging the sample count is necessary to correctly calculate the value at
// a given quantile with the merged bucket counts.
func (h *Histogram) ToPrometheusMetricWindowedMerged() *prometheusgo.Metric {

This should be implemented for all implementations of windowed histograms. Consider making it part of the WindowedHistogram Interface.


pkg/util/metric/metric.go line 453 at r1 (raw file):

//  2. Since the prometheus client library ensures buckets are in a strictly
//     increasing order at creation, we do not sort them.
func (h *Histogram) ValueAtQuantileWindowed(q float64) float64 {

Similarly, every implementation of this function should be updated for every implementation of windowed histogram.

@ericharmeling ericharmeling force-pushed the fix-windowed-prometheus-histograms branch 3 times, most recently from f7b0f47 to eb2872c Compare June 1, 2023 14:06
Copy link
Contributor Author

@ericharmeling ericharmeling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

I've updated the PR following our discussion offline. I'm a little unsure about the ManualWindowHistogram merging implementation. PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi)


-- commits line 15 at r1:

Previously, aadityasondhi (Aaditya Sondhi) wrote…

Please add the issues this fixes:

Done.


pkg/util/metric/metric.go line 388 at r1 (raw file):

Previously, aadityasondhi (Aaditya Sondhi) wrote…

This should be implemented for all implementations of windowed histograms. Consider making it part of the WindowedHistogram Interface.

After our discussion offline, I believe that we decided to not add ToPrometheusMetricWindowed/ToPrometheusMetricWindowedMerged to the WindowedHistogram interface, and instead rely on ValueAtQuantileWindowed implementations to merge prev and cur windows. I've added a note in the interface definition.


pkg/util/metric/metric.go line 453 at r1 (raw file):

Previously, aadityasondhi (Aaditya Sondhi) wrote…

Similarly, every implementation of this function should be updated for every implementation of windowed histogram.

Done.


pkg/util/metric/metric.go line 609 at r3 (raw file):

func (mwh *ManualWindowHistogram) ToPrometheusMetricWindowed() *prometheusgo.
	Metric {
	cur := &prometheusgo.Metric{}

@aadityasondhi
Does this look right to you?

Copy link
Collaborator

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great! Just a few minor changes are needed here.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling)


pkg/util/metric/metric.go line 609 at r3 (raw file):

Previously, ericharmeling (Eric Harmeling) wrote…

@aadityasondhi
Does this look right to you?

Yes, looks good!


pkg/util/metric/metric.go line 274 at r4 (raw file):

	h.windowed.tickHelper = &tickHelper{
		nextT:        now(),
		tickInterval: windowDuration,

You want to divide the windowDuration by 2 since we have 2 windows here. This is similar to what happens in newHdrHistogram with the histWrapNum.


pkg/util/metric/metric_test.go line 337 at r4 (raw file):

		}
		// Tick. This rotates the histogram.
		setNow(time.Duration(i+1) * 10 * time.Second)

I don't think you want 10 seconds here if you have set Duration to time.minute. You would want to set time to the next tickInterval + 1.


pkg/util/metric/metric_test.go line 341 at r4 (raw file):

	}

	act := *h.ToPrometheusMetric().Histogram

Don't you want to call the ToPrometheusMetricWindowed() function here? Right now, this test actually tests the cumulative histogram working correctly, which is not the point. If you call ToPrometheusMetricWindowed(), I think you will get a test that will actually just show the prev+cur histogram merged view rather than the view of all 4 histograms you have created here. The oldest two histogram views would be removed from the windowed histogram (as designed).

@ericharmeling ericharmeling force-pushed the fix-windowed-prometheus-histograms branch from eb2872c to 16408b7 Compare June 2, 2023 19:23
Copy link
Contributor Author

@ericharmeling ericharmeling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! I've made some changes following your feedback, with some commentary. I might want to pair offline once more to clarify some things.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi)


pkg/util/metric/metric.go line 274 at r4 (raw file):

Previously, aadityasondhi (Aaditya Sondhi) wrote…

You want to divide the windowDuration by 2 since we have 2 windows here. This is similar to what happens in newHdrHistogram with the histWrapNum.

I don't fully understand why this is the case, but done. Perhaps we chat offline a bit?


pkg/util/metric/metric_test.go line 337 at r4 (raw file):

I don't think you want 10 seconds here if you have set Duration to time.minute.

I meant to set the Duration to 10 seconds. Fixed that. (Although I guess it's arbitrary, since we are manually setting the now time)

You would want to set time to the next tickInterval + 1.

I think I want to set this to time.Duration(i+1) * tickInterval, right? If we are incrementing the nextTick value by the tickInterval at each tick call, we should be setting the now time to exactly when the maybeTick evaluates nextTick().Before(now()).


pkg/util/metric/metric_test.go line 341 at r4 (raw file):

Previously, aadityasondhi (Aaditya Sondhi) wrote…

Don't you want to call the ToPrometheusMetricWindowed() function here? Right now, this test actually tests the cumulative histogram working correctly, which is not the point. If you call ToPrometheusMetricWindowed(), I think you will get a test that will actually just show the prev+cur histogram merged view rather than the view of all 4 histograms you have created here. The oldest two histogram views would be removed from the windowed histogram (as designed).

Oh you're totally right. We should be looking at the windows and not the cumulative histogram.

But I don't think we need to call ToPrometheusMetricWindowed here. The ToPrometheusMetric erroneously used here is an implementation of the PrometheusExportable interface, which I don't think we want to include ToPrometheusMetricWindowed, as this interface is implemented by metrics other than histograms (right?).

ValueAtQuantileWindowed calls ToPrometheusMetricWindowed() when we start asserting values. I think the trick here is checking the "windowed" functions before and after observations within each window interval. This way we can make sure that the merge is happening and the quantile values are capturing the previous window early on in the current window's interval (i.e., before any observations) and later on (i.e., after all observations).

Does that make sense?

@ericharmeling ericharmeling force-pushed the fix-windowed-prometheus-histograms branch from 16408b7 to 13beab0 Compare June 5, 2023 19:03
Copy link
Collaborator

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work, this looks good to me!
:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ericharmeling)


pkg/util/metric/metric.go line 274 at r4 (raw file):

Previously, ericharmeling (Eric Harmeling) wrote…

I don't fully understand why this is the case, but done. Perhaps we chat offline a bit?

Spoke offline about this,


pkg/util/metric/metric_test.go line 337 at r4 (raw file):

Previously, ericharmeling (Eric Harmeling) wrote…

I don't think you want 10 seconds here if you have set Duration to time.minute.

I meant to set the Duration to 10 seconds. Fixed that. (Although I guess it's arbitrary, since we are manually setting the now time)

You would want to set time to the next tickInterval + 1.

I think I want to set this to time.Duration(i+1) * tickInterval, right? If we are incrementing the nextTick value by the tickInterval at each tick call, we should be setting the now time to exactly when the maybeTick evaluates nextTick().Before(now()).

Yes that seems right to me.


pkg/util/metric/metric_test.go line 341 at r4 (raw file):

Previously, ericharmeling (Eric Harmeling) wrote…

Oh you're totally right. We should be looking at the windows and not the cumulative histogram.

But I don't think we need to call ToPrometheusMetricWindowed here. The ToPrometheusMetric erroneously used here is an implementation of the PrometheusExportable interface, which I don't think we want to include ToPrometheusMetricWindowed, as this interface is implemented by metrics other than histograms (right?).

ValueAtQuantileWindowed calls ToPrometheusMetricWindowed() when we start asserting values. I think the trick here is checking the "windowed" functions before and after observations within each window interval. This way we can make sure that the merge is happening and the quantile values are capturing the previous window early on in the current window's interval (i.e., before any observations) and later on (i.e., after all observations).

Does that make sense?

Spoke offline, seems right.

@aadityasondhi
Copy link
Collaborator

Note: this should probably be backported to 22.2 and 23.1. But I leave that decision to the obs-inf team

@ericharmeling ericharmeling requested a review from dhartunian June 6, 2023 20:38
Copy link
Contributor Author

@ericharmeling ericharmeling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aadityasondhi Thank you for all of the help on this!

this should probably be backported to 22.2 and 23.1. But I leave that decision to the obs-inf team

@dhartunian Any objections to backporting?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi and @dhartunian)

@ericharmeling ericharmeling requested a review from a team June 7, 2023 13:43
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should be backported to both since both versions have the new histogram by default.

Reviewed 1 of 3 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi and @ericharmeling)


pkg/util/metric/metric.go line 274 at r4 (raw file):

Previously, aadityasondhi (Aaditya Sondhi) wrote…

Spoke offline about this,

@ericharmeling can you document the explanation in a comment right above this field setting?


pkg/util/metric/metric.go line 315 at r6 (raw file):

	// but that is now how we have set it up right now. It should be doable
	// though, since there is only one consumer of windowed histograms - our
	// internal timeseries system.

Is this comment no longer necessary? It seems like we're implemented some version of this.

Code quote:

	// TODO(obs-inf): the way we implement windowed histograms is not great. If
	// the windowed histogram is pulled right after a tick, it will be mostly
	// empty. We could add a third bucket and represent the merged view of the two
	// most recent buckets to avoid that. Or we could "just" double the rotation
	// interval (so that the histogram really collects for 20s when we expect to
	// persist the contents every 10s). Really it would make more sense to
	// explicitly rotate the histogram atomically with collecting its contents,
	// but that is now how we have set it up right now. It should be doable
	// though, since there is only one consumer of windowed histograms - our
	// internal timeseries system.

pkg/util/metric/metric.go line 391 at r6 (raw file):

		panic(err)
	}
	if h.windowed.prev != nil {

why do we need to check for nil here? it seems like h.windowed.cur and h.windowed.prev are both set in the onTick field of tickHelper. Why check one for nil and not the other?


pkg/util/metric/metric.go line 616 at r6 (raw file):

// ToPrometheusMetricWindowed returns a filled-in prometheus metric of the
// right type.
func (mwh *ManualWindowHistogram) ToPrometheusMetricWindowed() *prometheusgo.

nit: maybe suffix method with Locked to clarify that lock should be taken prior.


pkg/util/metric/metric.go line 617 at r6 (raw file):

// right type.
func (mwh *ManualWindowHistogram) ToPrometheusMetricWindowed() *prometheusgo.
	Metric {

nit: why is Metric { on a separate line?


pkg/util/metric/metric.go line 639 at r6 (raw file):

func (mwh *ManualWindowHistogram) Total() (int64, float64) {
	mwh.mu.RLock()
	defer mwh.mu.RUnlock()

ToPrometheusMetric locks this mutex already


pkg/util/metric/metric.go line 653 at r6 (raw file):

func (mwh *ManualWindowHistogram) Mean() float64 {
	mwh.mu.RLock()
	defer mwh.mu.RUnlock()

ToPrometheusMetric locks this mutex already


pkg/util/metric/metric.go line 927 at r6 (raw file):

// histogram.
// NB: Buckets on each histogram must be the same
func MergeWindowedHistogram(cur *prometheusgo.Histogram, prev *prometheusgo.Histogram) {

Can you add some unit tests for this merge function?

@ericharmeling ericharmeling force-pushed the fix-windowed-prometheus-histograms branch 2 times, most recently from f7006be to cd5e718 Compare June 9, 2023 18:07
@ericharmeling ericharmeling added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-22.2.x labels Jun 9, 2023
Copy link
Contributor Author

@ericharmeling ericharmeling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR @dhartunian ! I think I got to all of your comments.

Yes, this should be backported to both since both versions have the new histogram by default.

Added the backport labels.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @dhartunian, and @tbg)


pkg/util/metric/metric.go line 274 at r4 (raw file):

Previously, dhartunian (David Hartunian) wrote…

@ericharmeling can you document the explanation in a comment right above this field setting?

Done.


pkg/util/metric/metric.go line 315 at r6 (raw file):
I believe @tbg wrote this in eb82a43.

If the windowed histogram is pulled right after a tick, it will be mostly empty.

This is no longer true. However, the comments after this might still apply for future improvements.

We could add a third bucket and represent the merged view of the two most recent buckets to avoid that.

We are kind of doing this now. But we're really just calculating this "third bucket" every time we call TotalWindowed, MeanWindowed, and ValueAtQuantileWindowed.

@tbg any objections to removing this comment (or folding it into a tracking issue)?


pkg/util/metric/metric.go line 391 at r6 (raw file):
We get a runtime error without this check at cluster start-up:

* ERROR: a panic has occurred!
* runtime error: invalid memory address or nil pointer dereference
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   | 	GOROOT/src/runtime/panic.go:884
*   | runtime.panicmem
*   | 	GOROOT/src/runtime/panic.go:260
*   | runtime.sigpanic
*   | 	GOROOT/src/runtime/signal_unix.go:835
*   | github.com/cockroachdb/cockroach/pkg/util/metric.(*Histogram).ToPrometheusMetricWindowed
*   | 	github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/metric.go:394
*   | github.com/cockroachdb/cockroach/pkg/util/metric.(*Histogram).MeanWindowed
*   | 	github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/metric.go:435
*   | github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric.(*AggHistogram).MeanWindowed
*   | 	github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric/histogram.go:76
*   | github.com/cockroachdb/cockroach/pkg/server/status.extractValue
*   | 	github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:617
*   | github.com/cockroachdb/cockroach/pkg/server/status.eachRecordableValue.func1
*   | 	github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:646
*   | github.com/cockroachdb/cockroach/pkg/util/metric.(*Registry).Each.func1
*   | 	github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/registry.go:171
*   | github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric.(*AggHistogram).Inspect
*   | 	github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric/histogram.go:62
*   | github.com/cockroachdb/cockroach/pkg/util/metric.(*Registry).Each
*   | 	github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/registry.go:170
*   | github.com/cockroachdb/cockroach/pkg/server/status.eachRecordableValue
*   | 	github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:645
*   | github.com/cockroachdb/cockroach/pkg/server/status.(*MetricsRecorder).GenerateNodeStatus
*   | 	github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:509
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).writeNodeStatus.func1
*   | 	github.com/cockroachdb/cockroach/pkg/server/node.go:1056
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTask
*   | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:319
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).writeNodeStatus
*   | 	github.com/cockroachdb/cockroach/pkg/server/node.go:1055
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).startWriteNodeStatus.func1
*   | 	github.com/cockroachdb/cockroach/pkg/server/node.go:1013
*   | github.com/cockroachdb/cockroach/pkg/util/startup.RunIdempotentWithRetry.func1
*   | 	github.com/cockroachdb/cockroach/pkg/util/startup/retry.go:121
*   | github.com/cockroachdb/cockroach/pkg/util/startup.RunIdempotentWithRetryEx[...]
*   | 	github.com/cockroachdb/cockroach/pkg/util/startup/retry.go:142
*   | github.com/cockroachdb/cockroach/pkg/util/startup.RunIdempotentWithRetry
*   | 	github.com/cockroachdb/cockroach/pkg/util/startup/retry.go:120
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).startWriteNodeStatus
*   | 	github.com/cockroachdb/cockroach/pkg/server/node.go:1010
*   | github.com/cockroachdb/cockroach/pkg/server.(*Server).PreStart
*   | 	github.com/cockroachdb/cockroach/pkg/server/server.go:1945
*   | github.com/cockroachdb/cockroach/pkg/cli.createAndStartServerAsync.func1.2
*   | 	github.com/cockroachdb/cockroach/pkg/cli/start.go:805
*   | github.com/cockroachdb/cockroach/pkg/cli.createAndStartServerAsync.func1
*   | 	github.com/cockroachdb/cockroach/pkg/cli/start.go:853
*   | runtime.goexit
*   | 	GOROOT/src/runtime/asm_arm64.s:1172
* Wraps: (2) runtime error: invalid memory address or nil pointer dereference
* Error types: (1) *withstack.withStack (2) runtime.errorString

Checking the prev nil also helps us avoid evaluating the merge when one isn't possible (i.e., when there is no previous window).

Why check one for nil and not the other?

I believe h.windowed.cur starts as nil, so when it's rotated on the first call to newHistogram, that nil value is assigned to h.windowed.prev (and a new histogram is assigned to h.windowed.prev).

So... should we leave this check here, or explicitly initialize both prev and cur (and check the merge on an empty prev instead)?


pkg/util/metric/metric.go line 616 at r6 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: maybe suffix method with Locked to clarify that lock should be taken prior.

Done.


pkg/util/metric/metric.go line 617 at r6 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: why is Metric { on a separate line?

Linter thing - fixed.


pkg/util/metric/metric.go line 639 at r6 (raw file):

Previously, dhartunian (David Hartunian) wrote…

ToPrometheusMetric locks this mutex already

Fixed.


pkg/util/metric/metric.go line 653 at r6 (raw file):

Previously, dhartunian (David Hartunian) wrote…

ToPrometheusMetric locks this mutex already

Fixed.


pkg/util/metric/metric.go line 927 at r6 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Can you add some unit tests for this merge function?

Done.

@ericharmeling ericharmeling force-pushed the fix-windowed-prometheus-histograms branch from cd5e718 to 8dc28e2 Compare June 9, 2023 18:23
@ericharmeling ericharmeling requested a review from dhartunian June 12, 2023 13:26
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 2 files at r7.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @ericharmeling, and @tbg)


pkg/util/metric/metric.go line 391 at r6 (raw file):

Previously, ericharmeling (Eric Harmeling) wrote…

We get a runtime error without this check at cluster start-up:

* ERROR: a panic has occurred!
* runtime error: invalid memory address or nil pointer dereference
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   | 	GOROOT/src/runtime/panic.go:884
*   | runtime.panicmem
*   | 	GOROOT/src/runtime/panic.go:260
*   | runtime.sigpanic
*   | 	GOROOT/src/runtime/signal_unix.go:835
*   | github.com/cockroachdb/cockroach/pkg/util/metric.(*Histogram).ToPrometheusMetricWindowed
*   | 	github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/metric.go:394
*   | github.com/cockroachdb/cockroach/pkg/util/metric.(*Histogram).MeanWindowed
*   | 	github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/metric.go:435
*   | github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric.(*AggHistogram).MeanWindowed
*   | 	github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric/histogram.go:76
*   | github.com/cockroachdb/cockroach/pkg/server/status.extractValue
*   | 	github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:617
*   | github.com/cockroachdb/cockroach/pkg/server/status.eachRecordableValue.func1
*   | 	github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:646
*   | github.com/cockroachdb/cockroach/pkg/util/metric.(*Registry).Each.func1
*   | 	github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/registry.go:171
*   | github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric.(*AggHistogram).Inspect
*   | 	github.com/cockroachdb/cockroach/pkg/util/metric/aggmetric/histogram.go:62
*   | github.com/cockroachdb/cockroach/pkg/util/metric.(*Registry).Each
*   | 	github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/registry.go:170
*   | github.com/cockroachdb/cockroach/pkg/server/status.eachRecordableValue
*   | 	github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:645
*   | github.com/cockroachdb/cockroach/pkg/server/status.(*MetricsRecorder).GenerateNodeStatus
*   | 	github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:509
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).writeNodeStatus.func1
*   | 	github.com/cockroachdb/cockroach/pkg/server/node.go:1056
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTask
*   | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:319
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).writeNodeStatus
*   | 	github.com/cockroachdb/cockroach/pkg/server/node.go:1055
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).startWriteNodeStatus.func1
*   | 	github.com/cockroachdb/cockroach/pkg/server/node.go:1013
*   | github.com/cockroachdb/cockroach/pkg/util/startup.RunIdempotentWithRetry.func1
*   | 	github.com/cockroachdb/cockroach/pkg/util/startup/retry.go:121
*   | github.com/cockroachdb/cockroach/pkg/util/startup.RunIdempotentWithRetryEx[...]
*   | 	github.com/cockroachdb/cockroach/pkg/util/startup/retry.go:142
*   | github.com/cockroachdb/cockroach/pkg/util/startup.RunIdempotentWithRetry
*   | 	github.com/cockroachdb/cockroach/pkg/util/startup/retry.go:120
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).startWriteNodeStatus
*   | 	github.com/cockroachdb/cockroach/pkg/server/node.go:1010
*   | github.com/cockroachdb/cockroach/pkg/server.(*Server).PreStart
*   | 	github.com/cockroachdb/cockroach/pkg/server/server.go:1945
*   | github.com/cockroachdb/cockroach/pkg/cli.createAndStartServerAsync.func1.2
*   | 	github.com/cockroachdb/cockroach/pkg/cli/start.go:805
*   | github.com/cockroachdb/cockroach/pkg/cli.createAndStartServerAsync.func1
*   | 	github.com/cockroachdb/cockroach/pkg/cli/start.go:853
*   | runtime.goexit
*   | 	GOROOT/src/runtime/asm_arm64.s:1172
* Wraps: (2) runtime error: invalid memory address or nil pointer dereference
* Error types: (1) *withstack.withStack (2) runtime.errorString

Checking the prev nil also helps us avoid evaluating the merge when one isn't possible (i.e., when there is no previous window).

Why check one for nil and not the other?

I believe h.windowed.cur starts as nil, so when it's rotated on the first call to newHistogram, that nil value is assigned to h.windowed.prev (and a new histogram is assigned to h.windowed.prev).

So... should we leave this check here, or explicitly initialize both prev and cur (and check the merge on an empty prev instead)?

ah makes sense. thanks for explaining in-depth.


pkg/util/metric/metric_test.go line 437 at r7 (raw file):

	// sample count, and per-bucket cumulative count values,
	// plus those of the previous histogram.
	require.Equal(t, *cur.Histogram.SampleCount, uint64(2))

nit: keep the order consistent between "expected" and "actual" args in the require call.

Fixes cockroachdb#103814.
Fixes cockroachdb#98266.

This commit updates the windowed histogram merging approach
to add the previous window's histogram bucket counts and sample
count to those of the current one. As a result of this change,
the histograms will no longer report under-sampled quantile values,
and timeseries metrics-derived charts (e.g., the quantile-based
SQL service latency charts on the DB console's Metrics page) will
more accurately display metrics.

Release note (bug fix): Updated the histogram window merge calculation
to more accurately interpolate quantile values. This change will result
in smoother, more accurate Metrics charts on the DB Console.

Co-authored-by: Aaditya Sondhi <[email protected]>
@ericharmeling ericharmeling force-pushed the fix-windowed-prometheus-histograms branch from 8dc28e2 to bae5045 Compare June 13, 2023 18:57
Copy link
Contributor Author

@ericharmeling ericharmeling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aadityasondhi, @dhartunian, and @tbg)


pkg/util/metric/metric_test.go line 437 at r7 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: keep the order consistent between "expected" and "actual" args in the require call.

Done.

@ericharmeling
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 13, 2023

Build succeeded:

@craig craig bot merged commit 99f92ee into cockroachdb:master Jun 13, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jun 13, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from bae5045 to blathers/backport-release-22.2-104088: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
4 participants