-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start drafting weighted and timing histograms #109277
Start drafting weighted and timing histograms #109277
Conversation
) | ||
|
||
// WeightedObserver | ||
type WeightedObserver interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MikeSpreitzer - I think I'm not following the usecase of having weight here.
Can you point me to the usecase for it? Why is timingHistogram not enough for us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@caibirdme gave a use case at prometheus/client_golang#796 (comment)
cc @logicalhan |
/retest |
/remove-sig api-machinery |
type GaugeOps interface { | ||
// Set sets the Gauge to the given value. | ||
Set(float64) | ||
// Add(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: comments should be prefixed by the method they are commenting by convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah - let's maybe just copy comments from
https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#Gauge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the comments here are just placeholders, yes they need to follow conventions and be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the relationship utterly clear, in my latest push I made the comments on the methods of GaugeOps state their equivalence to the corresponding method of Gauge.
Set(float64) | ||
// Add(1) | ||
Inc() | ||
// Sub(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
Set(float64) | ||
// Add(1) | ||
Inc() | ||
// Sub(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
} | ||
|
||
func (th *timingHistogram) SetToCurrentTime() { | ||
th.update(func(oldValue float64) float64 { return th.clock.Since(time.Unix(0, 0)).Seconds() }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really confused by this method, are you intending to set the gauge value to the unix timestamp or are you intending to set the th.lastSetTime
to the current timestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method will just set the current time as the new value it will not override the lastSetTime
. At least that would be the behavior of the normal SetToCurrentTime
method of a gauge. I find it also a bit confusing, but I guess it was a shortcut made in client_golang to simplify the use of gauge with timestamp metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strange to me too, but effectively this method is coming from the original prometheus interface:
https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#Gauge
// SetToCurrentTime sets the Gauge to the current Unix time in seconds.
SetToCurrentTime()
So it works as designed :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created GaugeOps
from the existing Gauge
interface, just removing the inclusion of Metric and Collector. Frankly, prometheus should have this interface (analogous to the way it has Observer = Histogram - {Metric, Collector}
).
"github.com/prometheus/client_golang/prometheus" | ||
|
||
dto "github.com/prometheus/client_model/go" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should be grouped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
v0 := value0 | ||
t1 := t0.Add(time.Nanosecond) | ||
var v1 float64 = 0.75 | ||
clk.SetTime(t1) | ||
th.Set(v1) | ||
t2 := t1.Add(time.Microsecond) | ||
var d2 float64 = 0.5 | ||
v2 := v1 + d2 | ||
clk.SetTime(t2) | ||
th.Add(d2) | ||
t3 := t2 | ||
for i := 0; i < 1000000; i++ { | ||
t3 = t3.Add(time.Nanosecond) | ||
clk.SetTime(t3) | ||
th.Set(v2) | ||
} | ||
var d3 float64 = -0.6 | ||
v3 := v2 + d3 | ||
th.Add(d3) | ||
t4 := t3.Add(time.Second) | ||
clk.SetTime(t4) | ||
|
||
metch := make(chan prometheus.Metric) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test setup is really hard to read and it's not obvious to me what you are doing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my latest push I added a func comment hopefully explaining it.
if want, got := uint64(t4.Sub(t0)), wroteHist.GetSampleCount(); want != got { | ||
t.Errorf("Wanted %v but got %v", want, got) | ||
} | ||
if want, got := float64(t1.Sub(t0))*v0+float64(t2.Sub(t1))*v1+float64(t3.Sub(t2))*v2+float64(t4.Sub(t3))*v3, wroteHist.GetSampleSum(); want != got { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth adding a private method which does the Sub and casts to a float for readability just for test readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
) | ||
|
||
// WeightedObserver | ||
type WeightedObserver interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have tests for weighted histogram too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
/triage accepted |
New changes look good to me. I am fine with moving forward with this PR once wojtek's comments are addressed. |
@dgrisonnet, I have three concerns with making the histogram code add text about the type into the help line:
|
/retest |
I understand your concerns, but sadly we can't tweak the To avoid any confusion for the users, maybe just pretending the actual HELP text by |
301a655
to
5f8b53e
Compare
The force-push to 5f8b53ef38e makes the suggested addition to the |
The following investigation occurred during development. Add TimingHistogram impl that shares lock with WeightedHistogram Benchmarking and profiling shows that two layers of locking is noticeably more expensive than one. After adding this new alternative, I now get the following benchmark results. ``` (base) mspreitz@mjs12 kubernetes % go test -benchmem -run=^$ -bench ^BenchmarkTimingHistogram$ k8s.io/component-base/metrics/prometheusextension goos: darwin goarch: amd64 pkg: k8s.io/component-base/metrics/prometheusextension cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz BenchmarkTimingHistogram-16 22232037 52.79 ns/op 0 B/op 0 allocs/op PASS ok k8s.io/component-base/metrics/prometheusextension 1.404s (base) mspreitz@mjs12 kubernetes % go test -benchmem -run=^$ -bench ^BenchmarkTimingHistogram$ k8s.io/component-base/metrics/prometheusextension goos: darwin goarch: amd64 pkg: k8s.io/component-base/metrics/prometheusextension cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz BenchmarkTimingHistogram-16 22190997 54.50 ns/op 0 B/op 0 allocs/op PASS ok k8s.io/component-base/metrics/prometheusextension 1.435s ``` and ``` (base) mspreitz@mjs12 kubernetes % go test -benchmem -run=^$ -bench ^BenchmarkTimingHistogramDirect$ k8s.io/component-base/metrics/prometheusextension goos: darwin goarch: amd64 pkg: k8s.io/component-base/metrics/prometheusextension cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz BenchmarkTimingHistogramDirect-16 28863244 40.99 ns/op 0 B/op 0 allocs/op PASS ok k8s.io/component-base/metrics/prometheusextension 1.890s (base) mspreitz@mjs12 kubernetes % (base) mspreitz@mjs12 kubernetes % (base) mspreitz@mjs12 kubernetes % go test -benchmem -run=^$ -bench ^BenchmarkTimingHistogramDirect$ k8s.io/component-base/metrics/prometheusextension goos: darwin goarch: amd64 pkg: k8s.io/component-base/metrics/prometheusextension cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz BenchmarkTimingHistogramDirect-16 27994173 40.37 ns/op 0 B/op 0 allocs/op PASS ok k8s.io/component-base/metrics/prometheusextension 1.384s ``` So the new implementation is roughly 20% faster than the original. Add overlooked exception, rename timingHistogram to timingHistogramLayered Use the direct (one mutex) style of TimingHistogram impl This is about a 20% gain in CPU speed on my development machine, in benchmarks without lock contention. Following are two consecutive trials. (base) mspreitz@mjs12 prometheusextension % go test -benchmem -run=^$ -bench Histogram . goos: darwin goarch: amd64 pkg: k8s.io/component-base/metrics/prometheusextension cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz BenchmarkTimingHistogramLayered-16 21650905 51.91 ns/op 0 B/op 0 allocs/op BenchmarkTimingHistogramDirect-16 29876860 39.33 ns/op 0 B/op 0 allocs/op BenchmarkWeightedHistogram-16 49227044 24.13 ns/op 0 B/op 0 allocs/op BenchmarkHistogram-16 41063907 28.82 ns/op 0 B/op 0 allocs/op PASS ok k8s.io/component-base/metrics/prometheusextension 5.432s (base) mspreitz@mjs12 prometheusextension % go test -benchmem -run=^$ -bench Histogram . goos: darwin goarch: amd64 pkg: k8s.io/component-base/metrics/prometheusextension cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz BenchmarkTimingHistogramLayered-16 22483816 51.72 ns/op 0 B/op 0 allocs/op BenchmarkTimingHistogramDirect-16 29697291 39.39 ns/op 0 B/op 0 allocs/op BenchmarkWeightedHistogram-16 48919845 24.03 ns/op 0 B/op 0 allocs/op BenchmarkHistogram-16 41153044 29.26 ns/op 0 B/op 0 allocs/op PASS ok k8s.io/component-base/metrics/prometheusextension 5.044s Remove layered implementation of TimingHistogram
5f8b53e
to
b4a40cd
Compare
And the force-push to b4a40cd adds the |
This LGTM. Based on the comments above from @dgrisonnet I'm going to approve it, but I will leave stamping as lgtm to @dgrisonnet or @logicalhan @dgrisonnet - would you mind taking a look? /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MikeSpreitzer, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Great job @MikeSpreitzer! /lgtm |
Note that this is only an extension of the proto spec. Both generators and consumers of the protobuf still need changes to make use of these changes. Gauge histograms measure current distributions. For one, they are inspired by the GaugeHistogram type introducted by OpenMetrics, see https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#gaugehistogram They are also handled in the same way as OpenMetrics does it, by using a new MetricType enum field GAUGE_HISTOGRAM, but not changing anything else, i.e. for both regular and gauge histograms, the same Histogram message type is used. The other reason why we need gauge histograms comes from PromQL: If you `rate` a histogram (which is possible with the new sparse histograms as 1st class data type), the result is a gauge histogram. A rate'd histogram can be created by a recording rule and then stored in the TSDB. From there, it can be exposed by federation, so we need to be able to represent it in the exposition format. Float histograms are histograms where all counts (count of observations, counts in each bucket, zero bucket count) are floating point numbers rather than integer numbers. They are rarely needed for direct instrumentation. Use cases are weighted histograms or timing histograms, see kubernetes/kubernetes#109277 for a real-world example. However, float histograms happen all the time as results of PromQL expressions. Following the same line of argument as above, those float histograms can end up in the TSDB via recording rules, which means they can be exposed via federation. Note that float histograms are implicitly supported by the original Prometheus text format, as this format simply uses floating point numbers for all sample values. OpenMetrics has avoided this ambiguity and has specified integers for bucket counts and the count of observations in a histogram, which means it needs to be extended to support float histograms, similar to how this commit extends the original Prometheus protobuf format. Signed-off-by: beorn7 <[email protected]>
@MikeSpreitzer the |
I am not sure I understand the question. This particular PR has no release note. It is part of a larger campaign that does have release notes and that I hope to finish in 1.25. |
The bit of text immediately after the https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.25.md#:~:text=tbd |
Oh, I see. Yes, that is not what I want. How can I fix that? |
I'm not sure if changing it after the fact will effect release notes going forward, or if it'll take someone from sig-release fix? |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
This PR introduces two new kinds of histograms, weighted histograms and timing histograms. A timing histogram has the interface of a gauge, but keeps track of the time that the variable spent in each of the ranges defined by the bucket boundaries. A timing histogram is built on a weighted histogram, which is like a regular histogram but its Observe method takes a weight as well as a value. Due to the limitations of the current model (in OpenMetrics) for histograms, the weight has to be an unsigned integer. The timing histograms are intended to replace the existing sample-and-watermark histograms in k/apiserver/pkg/util/flowcontrol with something less complex to consume and less costly at runtime to update.
These new histograms are hoped to eventually migrate into prometheus. Until then, they resides in k/component-base/metrics/prometheusextension .
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
This is like #109094 but:
This is part of addressing #108272 .
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
@kubernetes/sig-api-machinery-misc
/cc @wojtek-t
@beorn7
@dgrisonnet
@logicalhan