-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Lens][Visualize] Enables histogram for histogram fields #131379
base: main
Are you sure you want to change the base?
Conversation
Pinging @elastic/kibana-vis-editors @elastic/kibana-vis-editors-external (Team:VisEditors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
app services code LGTM
It's a tough one from UX perspective. Imagine you have a chart like this (using the example data you mentioned): and then you can add intervals operation that will immediately throw error. I feel like we should either block the intervals in this case, or convert Unique count of my text to Count of records. What do you think @ghudgins @flash1293? Maybe we should discuss it during weekly? |
I get your point Marta but changing to count seems weird to me. Disabling it for non count can be better but what happens if the user configured first the bucket and then the metric? I was thinking that it is how histogram works on histogram fields and it makes sense to me but I would be happy to discuss it all together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @stratoula I don't want to be the usual party pooper, but I have some doubts.
The ES docs about histogram says:
When using a histogram as part of an aggregation, the accuracy of the results will depend on how the histogram was constructed. It is important to consider the percentiles aggregation mode that will be used to build it. Some possibilities include:
For the T-Digest mode, the values array represents the mean centroid positions and the counts array represents the number of values that are attributed to each centroid. If the algorithm has already started to approximate the percentiles, this inaccuracy is carried over in the histogram.
For the High Dynamic Range (HDR) histogram mode, the values array represents fixed upper limits of each bucket interval, and the counts array represents the number of values that are attributed to each interval. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits), therefore the value used when generating the histogram would be the maximum accuracy you can achieve at aggregation time.
The histogram field is "algorithm agnostic" and does not store data specific to either T-Digest or HDRHistogram. While this means the field can technically be aggregated with either algorithm, in practice the user should chose one algorithm and index data in that manner (e.g. centroids for T-Digest or intervals for HDRHistogram) to ensure best accuracy.
This to me means that the values
array can represent buckets in different ways than the way we are using buckets today (the bucket key represents the beginning of the bucket).
Here, the ES description clearly defines that the value array can represent not only the initial bucket, but a mean centroid, or the upper limit of each interval.
The way the histogram field is built is essential to represent it correctly. By default we/elastic-charts consider the data passed to it as a set of histogram bins where the value represent the beginning of the bin, but we don't automatically (and I think this is not possible anyway) determine what method was used to build such histogram and we can fall into untruthful representation of data.
Your example here is the proof. I think it is using an histogram of this like:
[0.1, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5]
An histogram divides the range into intervals without leaving gaps between them. But the way it is constructired is fundamental. With a simple set of value like that we can describe the following histograms:
- each value represent the beginning of an equal width bin
[0.1 - 0.15], [0.15 - 0.2] (missing), [0.2- 0.25]
- each value represent the beginning of unequal width bins
[0.1 -0.2],[0.2 - 0.5]
- each value represent the centroid (based on some data)
[0 - 0.18],[0.18 - 0.22],[0.22 - 0.29]
note the first element can start at zero for example - each value represent the end of equal unequal width bins
[0 - 0.1], [0.1 - 0.2] [0.2 - 0.25]
note the first element can start at zero for example
These are just examples but each one shows a different representation and is interpreted differently. We can make this works for 1 case (each value represents the beginning of the bin) but I don't think this covers the various use cases, and in particular, this doesn't cover the two use cases specified in the docs.
If, instead, we are applying our histogram aggregation on top of histogram fields, your work is fine, but the ES implementation logic blows my mind because it looks like not even considering what was defined in the ES documentation itself.
… by bucket is given
@markov00 I agree that the user has to know what the numbers in the histogram actually mean to be able to use it in a meaningful way in an aggregation. However, this is a true statement for all kinds of data and not special to the histogram field type (although I agree it's extra weird). The docs you cited just specify that - what a histogram actually means is up to how the data was indexed which is fully controlled by the user and Elasticsearch can't store meta information about it because it doesn't know. It calls out a special case about percentiles (right now we don't support hdr percentiles in Lens at all - a known limitation ), but that's kind of separate to the histogram aggregation as there is no way I know off to have the histogram aggregation interpret the histogram field in a "centroid way" and attribute the counts to multiple buckets if they are closer to the edge or something like this - seems like Elasticsearch itself is lacking this functionality and you probably shouldn't use the histogram agg on histogram fields indexed this way (or be aware of inaccuracies). The mental model (and prime use case from my understanding) is indeed the first case you are calling out - it's also the example made in the docs for this situation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-histogram-aggregation.html#search-aggregations-bucket-histogram-aggregation-histogram-fields All of that being said, it's probably worth adding an in-product documentation popover for this, as it's definitely something a user can misinterpret if they are not aware of what their data means - however IMHO it's not a blocker for adding this functionality because it's inherent to the level of meta information about the data structure Lens is working with. To make a current example - it's possible to use the "Counter rate" function in a misleading way because the time series has to be broken down by all dimension fields. We don't know the dimension fields so we can't guide the user with this. If a dimension field is missed, the chart will still show something, but it won't be correct. In the same way it would be meaningless to do a percentile calculation on a counter metric field, but we can't prevent or guide the user doing this at the moment. This case seems similar to me. |
@flash1293 that example to me is completely wrong: the histogram aggregation on the histogram field is not even considering the "histogram" nature of the field, but it is taking each single
The fact that the nature of the bin (a range between two values) is completely removed from the calculation, is probably one of the most misleading features we can have in ES... happy to discuss in a separate session, but to me ES is doing it wrong and we are just releasing a feature of something that works in the wrong way. |
@markov00 Sorry, I should have clarified - the way I interpreted this example is that the buckets of the histogram field are 1ms each, but you are right that it's not called out explicitly.
I definitely understand your reservations here, let's discuss offline. It's a pretty "raw" feature but it definitely has it's use cases. |
@MichaelMarcialis we decided to proceed with this feature (enable support of histogram on histogram fields) but we would like to add an in-app documentation to explain to the users how intervals work with histogram fields in order to ensure that there will be no confusion. I wonder how this in-app documentation should work. Atm we have the date_histogram documentation I could do the same for the intervals, but do we want this to be rendered always (even if the field is not of histogram type)? Do we want this to appear as long as the user has selected the field? What do you think? |
@stratoula: My first instinct is to say that any in-app documentation we include for the quick functions would be ever-present and not conditional based on the field selected (especially since the field selection can happen after function selection if not using drag-and-drop). Instead, could we include descriptions within the popover that explain what will happen with both histogram type fields as well as other field types? |
Yes @MichaelMarcialis this makes sense, thanx! |
💚 Build SucceededMetrics [docs]Async chunks
Page load bundle
History
To update your PR or re-run it, just comment with: |
We met and decided to hold on this PR. If we can find a more tangible business case for adding this to Lens (someone who needs it more explicitly) then we can re-open this PR. One thing we identified that would help this feature be useful for experts is the ability to set the interval width with a more detailed control instead of the slider UI. This would allow the expert user to match the histogram bucket size in their histogram fields. |
Summary
Closes #112391
Closes #112390
Adds support for histogram and range aggregations on histogram fields.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-histogram-aggregation.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-range-aggregation.html
Lens
Visualize editor
How to test
Create an index with an histogram field. I used the example here. Create an agg-based visualization or a Lens visualization on this dataview.
Important note
From ES documentation:
Checklist