-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query: auto-downsampling causes inaccurate output of metrics (inflated values) #922
Comments
Hi, hmm can you try this in the Thanos query UI directly? There is select box where you can choose exatly what lvl of downsampling to display. So try explicitly raw data and the 5m resolution possibly even the 1d. |
@FUSAKLA the same thing happens if I specify 'max 5m resolution' in the thanos query UI. The same of the graph looks largely the same except that its 10x the value. the 1h resolution results is a pretty broken graph it seems. |
Thanks for report, I have seen similar problem on our prod as well, but fortunately we can fallback to raw data always. I am pretty sure this path is not well tested, so some bug in choosing blocks might happen. High priority IMO. |
Essentially there are couple things we need to take a closer look:
cc @SuperQ as I mislead you on slack I think. Max 5m ,Max 1h should fallback to higher resolutions, but it seems to not work. cc @mjd95 |
Thanks @bwplotka. I'll have a look in to this - will first add the "Only 1h" and "Only 5m" features in the UI, agree this would be helpful for debugging. |
@ottoyiu plz, compare the result:
|
@ivmaks the graph looks identical with or without the |
So we found and fixed issue in algorithm that chooses what blocks to use: #1146. So if you different blocks compacted and some not compacted it could incorrectly just give 0 results. This will fix:
But still incorrect result is a different story. I am kind of sure, that the reason is that you use |
@bwplotka awesome to see #1146, will definitely try out the v0.5.0-rc.0! thank you for all the hard work and to everyone involved. I'm applying an instant sum because we have 3 loadbalancers and we want to see at a given time what the aggregate bytes per seconds is per service. I'm not sure if I'll try to replicate this bug in v0.5.0 and see if something was changed that fixed it. Edit: I can still replicate it on v0.5.0-rc.0 with and without the multipler (times 8): tried using sum_over_time instead (I don't know how to use sum_over_time and still group by the the two fields, so maybe this is not what you're looking for): |
We are also experiencing this bug. Our Prometheus instances are all set to scrape every 30 seconds. We noticed that with Max 5 Minute downsampling, we get a 10x increase in the value. With Max 1 Hour, we get a 12x increase over the 5 Minute downsampling, and ~120x over the raw data. All of these values can be explained by the number of scrapes in the downsample window. Those values make sense because we have 10 raw samples in a 5 minute window and 120 samples in a 1 hour window. Is the store trying to "smear" the values across the 5m/1hr time frame to fill in the gaps, causing aggregations like sum to output the wrong result? |
Same thing here, expression |
Guys Is it reproducible with |
@bwplotka |
We fixed nasty bug (: Thanks to this: #1146 Thanks everyone involved. |
@bwplotka, @ryansheppard's graphs are from 0.5.0-rc.0. we can try the v0.5.0 release today, but I don't see anything besides documentation changes. |
I have the same issue with v0.5.0 |
I think there are many things here. One was the bug with choosing the downsampled resolution: That one works. Additional one is with particular non |
I don't think there's any bug in this, but there might be some confusion caused by the way Grafana and the Prometheus console work, that is the graphs are sampled: If sample frequency (=scraping frequency) is smaller than drawing interval (the In Thanos, down-sampled series are actually aggregated values over the interval. When you request a value from one such series, you get either an average over the down-sampling interval, either some aggregated value if there's some function involved (min/max/sum/count). IMHO this sheds some light on what the querier returns. This explains why However, this kind of "bucketing" combined with the sampling in Grafana can yield surprising results, so care should be taken when building the queries. |
@vladvasiliu I have the same issue in thanos-query ui, not just in grafana. |
That's because thanos-query ui works the same. See below for examples. It's a pretty generic tool that draws values for whatever time series you throw at it. It doesn't always make sense to group the values when "zooming out". For example if you have a time series for some status, like In my opinion there should be a broader documentation in Thanos about how this works and how this interacts with graphing tools. I think the most surprising things happen when the graphing tool has a resolution in between downsampling intervals, say 20 minutes in the case of Thanos. If you sum your values, you'll get a partial sum for that period, which is weird for me. The way to look at this is that downsampling loses resolution. Instead of five values, one every 5 minutes, you only get one, which isn't equal to any of them. (You actually get several: min, max, sum, count - see #813 - this allows to retain some idea about what the data distribution was). I think what's a bit confusing is that asking for just one sample gives an average, so the value isn't clearly wrong when compared to raw data (but it should be noted that if the raw data is somewhat random, they don't match!) The graphs below have samples scraped every minute. This is thanos-query UI v0.5.0. Sampling and missing data: same series, different "zoom": The second is over the last two days. Notice the maximum barely hits 30. There's information missing (focus on the graph that's present, the series was only created yesterday). Same series, 1 scrape per minute in raw data, downsamples are to 5 minutes. See how the shape of the curve changes with what is displayed.
Up until this point, all values are roughly the same as the original raw data. The next one is the confusing part. Note there's just one series, so if using The difference is the way those charts are read. The last is read "in the interval between one point and the other, there were this many requests". That's an aggregation. All the others read "at some time between the last two scrapes there were this many requests per unit of time". You'll have to check with the series for this unit, and it's always the same. If the range is 5 minutes, you probably don't care. If it's a day, and you only had 2000 requests, it makes a big difference to know whether those were for the whole day or just during one second. |
So yeah, you've demonstrated Spike Erosion quite well here. Definitely a well understood side effect when you downsample by averaging or your graph display toolkit uses a weighted averages to dynamical resize the graph. Being that we have min, max, sum, count, and (therefore) average for each downsampled data point, I bet that we are using the Being that Spike Erosion is usually controlled by controlling the downsampling aggregation function, do we need to expose how to select the min. max, sum, or count when working with downsampled data? Another way to handle Spike Erosion is by using and aggregating histograms to build a quantile estimation. That too is going to require |
I believe this has been fixed in |
Just rolled out |
Thanos, Prometheus and Golang version used
and
I was able to replicate it on both versions.
Prometheus 2.7.1
What happened
When
--query.auto-downsampling
is enabled on the query component, metrics beyond two days would be ballooned by multiples of the actual result. In this case, we've seen the metrics values go 10x.PromQL:
Auto-downsampling enabled (grafana v5.3.4):
![Screenshot-2019-3-14 Grafana - (k8s) BalanceD Service At A Glance(1)](https://user-images.githubusercontent.com/2016437/54339527-53b0ab00-45f2-11e9-9070-1d5cb2ca7d34.png)
Auto-downsampling disabled (grafana v.5.3.4) - these metrics are accurate:
![Screenshot-2019-3-14 Grafana - (k8s) BalanceD Service At A Glance](https://user-images.githubusercontent.com/2016437/54339322-bbb2c180-45f1-11e9-869c-f77581363a17.png)
Another one with auto-downsampling enabled (grafana v6.0.1):
![Screenshot-2019-3-14 New dashboard - Grafana](https://user-images.githubusercontent.com/2016437/54340028-b22a5900-45f3-11e9-8c38-fb7427c0f586.png)
What you expected to happen
Metrics to be accurate irregardless of auto-downsampling is enabled or not.
How to reproduce it (as minimally and precisely as possible):
on compactor
Disable auto-downsamping, observe any metrics with 30 days windows in Grafana. Metrics are accurate.
Full logs to relevant components
thanos bucket inspect
outputAnything else we need to know
Using Grafana v5.3.4 and v6.0.1. Could this be a grafana bug?
The text was updated successfully, but these errors were encountered: