-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The histogram aggregator plugin was added. #2387
Conversation
thank you for the contribution @vlamug We also already have a PR for adding basic aggregator stats to telegraf: #2167. I will be working on adding quantiles either to this one, or to a separate plugin. one major problem with merging your plugin is that it re-implements measurement filtering, which telegraf already does, and plugin authors should not be re-implementing this. For these reasons I'm going to close this PR. |
Hi, @sparrc!
Is there another possibility to do not apply aggregations for all metrics?
quantiles calculated on client is not the same as histograms: it precalculated on client - so you're not able to calculate cross-server quantiles at all. |
ah, sorry I thought that the "buckets" were quantiles. What is a "histogram" stat? I think you might be using a definition from a specific product/workflow, can you link to some documentation on what this is exactly? |
@sparrc thanks for suggestion! About quantiles. There are two ways how you can calculate it.
Which modifications should be done to move forward with this pull request (except for filtering)? |
ah, I see. Apologies for the misunderstanding, I'll reopen, please ping me when you've implemented it to use builtin filtering and I'll review. |
# ## The example of config to aggregate histogram for all fields of specified metric. | ||
# [[aggregators.histogram.config]] | ||
# ## The set of buckets. | ||
# buckets = [0.0, 15.6, 34.5, 49.1, 71.5, 80.5, 94.5, 100.0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each bucket is an "upper inclusive bound", correct? you should document that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Added to documentation.
|
||
### Tags: | ||
|
||
All measurements have tag `le`. This tag has the border value of bucket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does le
mean exactly? Also document that it's the "upper inclusive bound"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is right border of bucket, which means, that value of metric is less than the value of this bucket.
For example, we have the value "9" of metric "usage_user" and we have the following buckets:
[0.0, 10.0, 30.0, 50.0, 80.0, 100.0]
Then the tag "le" will have the value "10" and this tag will be added for the field "usage_user".
So, we will have the following result in output like this:
...
cpu,host=localhost,cpu=cpu1,le=10.0 usage_user_bucket=1
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is right border of bucket, which means, that value of metric is less than the value of this bucket.
I believe it should actually be "less than or equal"? do you have a unit test to verify this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it should actually be "less than or equal"?
Yes, you are right, sorry for my wrong.
do you have a unit test to verify this?
Yes, unit test checks this case.
|
||
``` | ||
cpu_usage_system_bucket{cpu="cpu-total",host="local",le="0"} 0 | ||
cpu_usage_system_bucket{cpu="cpu-total",host="local",le="10"} 24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please provide output in influx-format (you can get this easily using the --test flag)
also, does the prometheus output properly put the correct types on these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prometheus works well - we already tested full integration: from metrics collection to visualisation in Grafana
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not negotiable, every other plugin provides it in influx, so you need to do it the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we see, we will provide the example in influx-format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
if value, ok := convert(value); ok { | ||
index := sort.SearchFloat64s(buckets, value) | ||
if index < len(agr.histogramCollection[field]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this check necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This algorithm has been taken from golang prometheus clint. And it is needed to prevent the increase the bucket "+Inf" in two times. But we changed the algorithm a bit and this condition will be removed after our changes.
So it is not necessary here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional checks removed. Done
|
||
// Push returns histogram values for metrics | ||
func (h *HistogramAggregator) Push(acc telegraf.Accumulator) { | ||
var isResetNeeded = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this exactly? can you explain why this is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you noticed, the method Reset() is empty. We did it, in order to be able to accumulate hits in buckets. For example:
If we reset cache in the method Reset() and the value "period" is short(30 second for example), then hits will not accumulate enough, i.e. the hits in buckets will be so small that we will get not quite correct distribution. Therefore the method Reset() is empty.
However, after some time, along with the empty method Reset() the value of hist will be more than the value uint64. So we decided to add the variable isResetNeeded for reseting hits cache.
In line 153 you can find the condition, which difenes the time of the resetting the cache. We use the value math.MaxInt64. It is quite suitable for a sufficient accumulation of hits in order to have correct distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MaxInt64 seems like overkill to get a "correct distribution"
If you want this merged into the main repo you're going to have to make it reset when Reset() is called.
I'm not 100% sure I agree that it should be managing the int64 rollovers itself either, but I'll consider it if you really don't have a way of predicting when an int64 rollover might occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for your observation.
We decided, that MaxInt64 is really overkill. Furthermore, our buckets have uint64 type:
type counts []uint64
Therefore, we can remove the reset cache and just accumulate hist in buckets.
So, we offer to remove the using the variable 'isResetNeeded' and the method 'resetCache()'. The method 'Reset()' will be empty.
What do you think about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that works, although you should change the counts to be int64 instead of uint64, the reason being that influxdb won't accept numbers over int64 (I believe prometheus may accept numbers over this, but you will begin to get odd rounding behavior)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we will change from uint64 to int64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
EDIT see below You should also add a "Histogram" type to the telegraf metric definition here: https://github.com/influxdata/telegraf/blob/master/metric.go, and then add an "AddHistogram" function here: https://github.com/influxdata/telegraf/blob/master/agent/accumulator.go this is all so that the prometheus_client exporter will expose these metrics as proper "histogram" metrics rather than untyped. You'll also need to update the prometheus exporter here: https://github.com/influxdata/telegraf/blob/master/plugins/outputs/prometheus_client/prometheus_client.go#L131 |
EDIT on the above comment, this might quite a bit more complicated than I realized, but I'd also like to hear from you guys who are presumably using these histograms with prometheus. Do you run into problems with these metrics being "Untyped" rather than "Histogram" types? how do you deal with that and how does it change the workflow? |
@sparrc it doesn't make sense to add histogram type in telegraf - finally - its just counter with Also, we think, that the sending of histogram type will confuse others, because they will think that histogram makes sense in outputs. In addition, we've tried to implement aggregator without changes in telegraf core.
As for us - we prefer histograms instead of precalculated quantiles in most cases.
We have no problems with type “Untyped”. Prometheus doesn’t use this comments at all as I already mentioned.
We already use our fork with histograms - it works well. We’ve build histograms in Grafana based on collected metrics. For example, we made graphs for CPU usage on cluster. There are no changes in workflow. |
@sparrc we added all changes by code-review. Please, check it, when you have free time. Thanks. |
Hi, @sparrc! |
Hello, @sparrc , we updated the CHANGELOG.md file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great just a few things to make it fit style wise. There is also a bunch of grammar stuff in the README, but we can fix that up in a second pass. Also, I'd like to get a review from @goller to make sure the schema plays nice with Chronograf.
|
||
#### Explanation | ||
|
||
The field `metric_fields` is the list of metric parameters. For example, the metric `cpu` has the following parameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid confusion we should use Telegraf terminology ("field" instead of "parameter"). There are several other places this occurs.
// sampleConfig is config sample of histogram aggregation plugin | ||
var sampleConfig = ` | ||
# # Configuration for aggregate histogram metrics | ||
# [[aggregators.histogram]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't comment sampleConfig
, also remove this section header and comment above it, it gets included outside of the sample config. If you generate the config you will see what I mean.
// Add adds new hit to the buckets | ||
func (h *HistogramAggregator) Add(in telegraf.Metric) { | ||
id := in.HashID() | ||
agr, ok := h.cache[id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
h.cache
will have an item for every series. To avoid this becoming large users should be directed to exclude any metrics not aggregated in the docs. I think we should check the measurement name first and return as well based on the configuration, that way it should remain fast even if filters are not setup correctly.
|
||
for index, bucket := range buckets { | ||
count += counts[index] | ||
addFields(acc, aggregate, field, strconv.FormatFloat(bucket, 'f', 1, 64), count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the buckets are less than 0.1 apart we won't be able to emit them accurately. Perhaps we need a precision option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me, that the having such option will be difficult to understand. Maybe we just increase the value of precision, for example to 3? Then we write this limitation in the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it would be confusing. What do you think about specifying the buckets as a list of strings, a little more annoying to type in but then we could report exactly as inputted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in any case, we will need to convert strings to float64. What if we will determine the precision dynamically? I mean, we will get the bucket and determine the number of digits after the comma. It will look something like this:
strconv.FormatFloat(bucket, 'f', determinePrecision(bucket), 64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do the list of strings idea, you will need to keep both the string and float64 versions of the array around, and it would complicate the sorting of buckets. So, definitely some drawbacks there.
The determinePrecision
idea is fine from a usage point of view, so long as it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will try to implement the determinePrecision
function. I will add changes soon.
} | ||
|
||
// checkAndGetBuckets checks the order of buckets and returns them. | ||
func (h *HistogramAggregator) getBuckets(metric string, field string) []float64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename function or update the comment.
for key, val := range agr.tags { | ||
tags[key] = val | ||
} | ||
tags[bucketTag] = bucketTagVal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to make a copy here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cant remember what the problem was. But now it works correctly without any copying. I will remove this piece of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just remembered! When we pass tags without copying, it is passed by reference. As result, all metrics have tag "le" with value "+inf". It was discovered by tests.
The debug output of tags map without copying:
map[tag_name:tag_value le:+Inf] map[tag_name:tag_value le:+Inf] map[tag_name:tag_value le:+Inf] map[le:+Inf tag_name:tag_value] map[tag_name:tag_value le:+Inf] map[tag_name:tag_value le:+Inf]
With coping:
map[tag_name:tag_value le:0.000] map[tag_name:tag_value le:10.000] map[tag_name:tag_value le:20.000] map[tag_name:tag_value le:30.000] map[tag_name:tag_value le:40.000] map[tag_name:tag_value le:+Inf]
count := int64(0) | ||
|
||
for index, bucket := range buckets { | ||
count += counts[index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates a cumulative histogram. We should mention this somewhere in the docs.
// Push returns histogram values for metrics | ||
func (h *HistogramAggregator) Push(acc telegraf.Accumulator) { | ||
for _, aggregate := range h.cache { | ||
for field, counts := range aggregate.histogramCollection { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should push one metric per series.
So instead of this:
cpu,cpu=cpu1,host=localhost,le=0.0 usage_idle_bucket=0i 1486998330000000000
cpu,cpu=cpu1,host=localhost,le=0.0 usage_system_bucket=0i 1486998330000000000
There should just be one metric:
cpu,cpu=cpu1,host=localhost,le=0.0 usage_idle_bucket=0i,usage_system_bucket=0i 1486998330000000000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the number of hits to buckets will be different for this fields "usage_system" and "usage_idle".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand, the two lines are equivalent and each field could have different values:
cpu,cpu=cpu1,host=localhost,le=0.0 usage_idle_bucket=5i,usage_system_bucket=42i 1486998330000000000
When the tagset is different then it should be separate:
cpu,cpu=cpu1,host=localhost,le=0.0 usage_idle_bucket=5i,usage_system_bucket=42i 1486998330000000000
cpu,cpu=cpu1,host=localhost,le=5.0 usage_idle_bucket=3i,usage_system_bucket=5i 1486998330000000000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for this we need to perform an additional action: the grouping fields by value. I think it is overhead. Why do not you want to leave it at that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasons for it are that it would reduce the size of the output and also it would match the rest of the metric creation in Telegraf so it will work as expected in regards to the metric buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks. I will do this changes soon.
to the value of this tag. For example, let assume that we have the metric value 10 and the following buckets: | ||
[5, 10, 30, 70, 100]. Then the tag `le` will have the value 10, because the metrics value is passed into bucket with | ||
right border value `10`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add some InfluxQL sample queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want so that we add some InfluxQL queries, right? Do you suggest doing this for a better understating? If so, could you provide us any plugin or aggregator, where we can take an example of queries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, don't worry I'll add them in later on.
|
||
#### Description | ||
|
||
The histogram aggregator plugin aggregates values of specified metric\`s parameters. The metric is emitted every |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use metric's
. Also could you wrap this file at 78 chars?
This would be amazing, thanks for putting in the effort to get it started! |
@danielnelson I added necessary changes. Please, check it. Also, we need to solve the problem about bucket decision. I am waiting you response. Also, regarding failed checks: as I understand the problem is in environment of running checks. The error: Can anybody restart checks again after a while? @sebito91 thank you) |
Rebasing will fix the zookeeper issue. |
b198963
to
5a5fd6a
Compare
The refactoring of the "Add" method. The "metric" attribute of structure "metricHistogramCollection" was renamed to "name". The value of var "sampleConfig" was changed. Some changes were added to readme file.
Refactoring of the method getBuckets(). Using sortBuckets method to sort buckets, if it is needed.
Now the precision is fully used.
70250bf
to
36f356c
Compare
@vlamug Sorry, I had forgotten how it worked. I found an issue in the Telegraf startup that could cause metrics collected on startup to not be sent to the aggregator. I made this PR to address it: #3050. With that change applied, I still see some one thing I'm not expecting. With this config and output, it appears to me that some metrics are not grouping how we wanted: [[aggregators.histogram]]
namepass = ["test"]
period = "20s"
drop_original = false
[[aggregators.histogram.config]]
buckets = [0.0, 1.0, 1.5, 2.0]
metric_name = "test"
metric_fields = ["foo", "bar"]
Here is what I expected the output to be:
|
…ersByBuckets Histogram aggregator now groups metric fields by bucket
@danielnelson I've implemented grouping which works as described - could you please, check it? |
Looks good, I think it's ready! Could you just remove the edits to |
@danielnelson thanks for your code-review. We removed the edits from |
Required for all PRs:
The ability of collecting metrics for histogram was added.