query: auto-downsampling causes inaccurate output of metrics (inflated values) #922

ottoyiu · 2019-03-14T07:44:52Z

Thanos, Prometheus and Golang version used

image: improbable/thanos:v0.3.1

and

image: improbable/thanos:v0.3.2

I was able to replicate it on both versions.

Prometheus 2.7.1

What happened
When --query.auto-downsampling is enabled on the query component, metrics beyond two days would be ballooned by multiples of the actual result. In this case, we've seen the metrics values go 10x.

PromQL:

sum(dest_bps_in{hostname=~"$hostname", exported_namespace=~"$namespace"}) by (service_name, exported_namespace) * 8

Auto-downsampling enabled (grafana v5.3.4):

Auto-downsampling disabled (grafana v.5.3.4) - these metrics are accurate:

Another one with auto-downsampling enabled (grafana v6.0.1):

What you expected to happen
Metrics to be accurate irregardless of auto-downsampling is enabled or not.

How to reproduce it (as minimally and precisely as possible):

        - --retention.resolution-raw=30d
        - --retention.resolution-5m=90d
        - --retention.resolution-1h=365d

on compactor

Enable auto-downsampling, observe any metrics with 30 days window in Grafana. Metrics are inaccurate, and when zooming back in to a smaller window, the metrics become accurate again.
Disable auto-downsamping, observe any metrics with 30 days windows in Grafana. Metrics are accurate.

Full logs to relevant components
thanos bucket inspect output

Logs

|            ULID            |        FROM         |        UNTIL        |     RANGE     |   UNTIL-COMP   |  #SERIES  |    #SAMPLES    |   #CHUNKS   | COMP-LEVEL | COMP-FAILED |                                                                   LABELS                                                                   | RESOLUTION |  SOURCE   |
|----------------------------|---------------------|---------------------|---------------|----------------|-----------|----------------|-------------|------------|-------------|--------------------------------------------------------------------------------------------------------------------------------------------|------------|-----------|
| 01D5C9DF7VXBRQPCR8P9HF0ERH | 26-02-2019 15:51:04 | 06-03-2019 16:00:00 | 192h8m55.639s | -152h8m55.639s | 1,562,415 | 44,642,000,050 | 373,976,728 | 4          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5D42HA6ES2XJM5XNN0EZ7VT | 26-02-2019 15:51:04 | 06-03-2019 16:00:00 | 192h8m55.663s | -152h8m55.663s | 1,562,599 | 44,651,399,075 | 373,977,040 | 4          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5D8J10JXA8FTFRPGSA52CNN | 26-02-2019 15:51:04 | 06-03-2019 16:00:00 | 192h8m55.639s | 47h51m4.361s   | 1,562,415 | 3,459,134,615  | 25,743,172  | 4          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 5m0s       | compactor |
| 01D5DF724T11E1WX5W6SAPJJ4R | 26-02-2019 15:51:04 | 06-03-2019 16:00:00 | 192h8m55.663s | 47h51m4.337s   | 1,562,599 | 3,460,647,340  | 25,743,356  | 4          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 5m0s       | compactor |
| 01D5G8YD96MC0JG89X1BM60ANE | 06-03-2019 16:00:00 | 08-03-2019 16:00:00 | 48h0m0s       | -8h0m0s        | 1,588,659 | 11,285,984,868 | 94,100,935  | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5G9XG672QFRTWJ4C7CAMZNF | 06-03-2019 16:00:00 | 08-03-2019 16:00:00 | 48h0m0s       | -8h0m0s        | 1,588,728 | 11,286,004,997 | 94,101,042  | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5GAZWSJXNGCG44NVTPJKC6J | 06-03-2019 16:00:00 | 08-03-2019 16:00:00 | 48h0m0s       | 192h0m0s       | 1,588,658 | 874,836,605    | 7,651,554   | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 5m0s       | compactor |
| 01D5GCD7FRDZS5YBS78SGF4X0S | 06-03-2019 16:00:00 | 08-03-2019 16:00:00 | 48h0m0s       | 192h0m0s       | 1,588,728 | 874,836,712    | 7,651,624   | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 5m0s       | compactor |
| 01D5NE0GQPJXX86J8F1N64R0G5 | 08-03-2019 16:00:00 | 10-03-2019 17:00:00 | 48h0m0s       | -8h0m0s        | 1,592,420 | 11,349,456,418 | 94,636,942  | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5NEYKFR32V0Y4CNP4F20KGC | 08-03-2019 16:00:00 | 10-03-2019 17:00:00 | 48h0m0s       | -8h0m0s        | 1,592,436 | 11,349,476,561 | 94,636,972  | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5NFZS7T5KX10N84NVAGKTW2 | 08-03-2019 16:00:00 | 10-03-2019 17:00:00 | 48h0m0s       | 192h0m0s       | 1,592,419 | 880,110,425    | 7,696,584   | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 5m0s       | compactor |
| 01D5NHF3JXX7JZR52HSYQVW86B | 08-03-2019 16:00:00 | 10-03-2019 17:00:00 | 48h0m0s       | 192h0m0s       | 1,592,435 | 880,110,409    | 7,696,600   | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 5m0s       | compactor |
| 01D5TRKJ8ZTT4RA81F8X2H08HA | 10-03-2019 17:00:00 | 12-03-2019 17:00:00 | 48h0m0s       | -8h0m0s        | 1,659,932 | 11,414,008,950 | 95,216,437  | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5TSEMGHA3P1FCWC5P06J6QB | 10-03-2019 17:00:00 | 12-03-2019 17:00:00 | 48h0m0s       | -8h0m0s        | 1,660,023 | 11,414,028,755 | 95,216,496  | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5TTAE3NYKETYZPQ4B35G79P | 10-03-2019 17:00:00 | 12-03-2019 17:00:00 | 48h0m0s       | 192h0m0s       | 1,659,871 | 885,270,232    | 7,788,901   | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 5m0s       | compactor |
| 01D5TVKN3QK4FCMNZ14H9BC2HZ | 10-03-2019 17:00:00 | 12-03-2019 17:00:00 | 48h0m0s       | 192h0m0s       | 1,659,962 | 885,270,286    | 7,788,992   | 3          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 5m0s       | compactor |
| 01D5VDBQD2YN5YSTETC1HFW89K | 12-03-2019 17:00:00 | 13-03-2019 01:00:00 | 8h0m0s        | 32h0m0s        | 1,552,580 | 1,893,087,654  | 15,924,597  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5VDJSHWEKASM3Q7AG19GDA1 | 12-03-2019 17:00:00 | 13-03-2019 01:00:00 | 8h0m0s        | 32h0m0s        | 1,552,545 | 1,892,299,071  | 15,924,289  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5W93Q11Y8XGN10D3P8W4AJG | 13-03-2019 01:00:00 | 13-03-2019 09:00:00 | 8h0m0s        | 32h0m0s        | 1,580,955 | 1,910,657,289  | 15,953,592  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5W9JYC4D4MRYN53F5NS8D55 | 13-03-2019 01:00:00 | 13-03-2019 09:00:00 | 8h0m0s        | 32h0m0s        | 1,580,949 | 1,910,653,297  | 15,953,550  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5X49TXE2FPRN1XKPAJCAZNB | 13-03-2019 09:00:00 | 13-03-2019 17:00:00 | 8h0m0s        | 32h0m0s        | 1,546,268 | 1,891,744,574  | 15,479,228  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5X4G6QTG00SBS74Z3ZS5WF6 | 13-03-2019 09:00:00 | 13-03-2019 17:00:00 | 8h0m0s        | 32h0m0s        | 1,546,270 | 1,891,253,611  | 15,479,221  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5XZRJBRTZ8ZRE6YQPDHXQNA | 13-03-2019 17:00:00 | 14-03-2019 01:00:00 | 8h0m0s        | 32h0m0s        | 1,556,778 | 1,911,067,151  | 15,936,507  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | compactor |
| 01D5Y002A7A24Y55WWQ8V84Z7V | 13-03-2019 17:00:00 | 14-03-2019 01:00:00 | 8h0m0s        | 32h0m0s        | 1,556,774 | 1,911,064,040  | 15,936,467  | 2          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | compactor |
| 01D5XXQPZJ96WTWG5Z7PZBBPDD | 14-03-2019 01:00:00 | 14-03-2019 03:00:00 | 2h0m0s        | 38h0m0s        | 1,544,246 | 477,747,855    | 3,981,488   | 1          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | sidecar   |
| 01D5XXQPZM56832FMDBFCQ3SPP | 14-03-2019 01:00:00 | 14-03-2019 03:00:00 | 2h0m0s        | 38h0m0s        | 1,544,238 | 477,748,831    | 3,981,484   | 1          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | sidecar   |
| 01D5Y4KE4GR3HDZAY8K8SBZH1X | 14-03-2019 03:00:00 | 14-03-2019 05:00:00 | 2h0m0s        | 38h0m0s        | 1,546,542 | 477,752,906    | 3,983,785   | 1          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | sidecar   |
| 01D5Y4KE4HXB5M4E85HK02N54P | 14-03-2019 03:00:00 | 14-03-2019 05:00:00 | 2h0m0s        | 38h0m0s        | 1,546,548 | 477,753,874    | 3,983,791   | 1          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | sidecar   |
| 01D5YBF5C7BN8DJ27C7493GRAJ | 14-03-2019 05:00:00 | 14-03-2019 07:00:00 | 2h0m0s        | 38h0m0s        | 1,544,508 | 477,740,036    | 3,981,739   | 1          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-1 | 0s         | sidecar   |
| 01D5YBF5CJPTW2X4G5KZ58NS14 | 14-03-2019 05:00:00 | 14-03-2019 07:00:00 | 2h0m0s        | 38h0m0s        | 1,544,510 | 477,740,921    | 3,981,741   | 1          | false       | cluster_name=xx.xxxx.xxxxxxxxx.com,prometheus=monitoring/k8s-kube-prom-prometheus,prometheus_replica=prometheus-k8s-kube-prom-prometheus-0 | 0s         | sidecar   |

Anything else we need to know
Using Grafana v5.3.4 and v6.0.1. Could this be a grafana bug?

The text was updated successfully, but these errors were encountered:

FUSAKLA · 2019-03-23T21:20:09Z

Hi, hmm can you try this in the Thanos query UI directly? There is select box where you can choose exatly what lvl of downsampling to display. So try explicitly raw data and the 5m resolution possibly even the 1d.
This should rule out the grafana or anything else along the way (trickster possibly).

ottoyiu · 2019-03-26T00:55:45Z

@FUSAKLA the same thing happens if I specify 'max 5m resolution' in the thanos query UI. The same of the graph looks largely the same except that its 10x the value.

5m resolution:

raw resolution:

1h resolution:

the 1h resolution results is a pretty broken graph it seems.

bwplotka · 2019-03-28T18:33:44Z

Thanks for report, I have seen similar problem on our prod as well, but fortunately we can fallback to raw data always. I am pretty sure this path is not well tested, so some bug in choosing blocks might happen.

High priority IMO.

bwplotka · 2019-03-29T11:07:33Z

Essentially there are couple things we need to take a closer look:

Auto downsampling & particularly max source resolution should chose other resolution if the smallest (e.g 1h) is not available. However clearly is not doing as expected. Code is here: https://github.com/improbable-eng/thanos/blob/5b74df2ae745e61f1ba09c347e63dea271df6d2e/pkg/store/bucket.go#L977
We need:
- Fix auto downsampling and max resolution option with "auto filling"
- Change the interface. We need some options like "Only 1h". "Only 5m" without this auto filling/fallback to more details data. This would be very useful for debugging.
The generation process of downsampled blocks. Is there any way we can miss some downsampling blocks? Especally in between? I am relatively sure that this part works with give time needed for downsampling: https://github.com/improbable-eng/thanos/blob/0ceacc933e0a92d0f4f6b16433ebcdebcb89dba1/cmd/thanos/downsample.go#L166

cc @SuperQ as I mislead you on slack I think. Max 5m ,Max 1h should fallback to higher resolutions, but it seems to not work.

cc @mjd95

mjd95 · 2019-03-29T11:31:26Z

Thanks @bwplotka. I'll have a look in to this - will first add the "Only 1h" and "Only 5m" features in the UI, agree this would be helpful for debugging.

ivmaks · 2019-04-11T08:13:44Z

@ottoyiu plz, compare the result:

sum(dest_bps_in{hostname=~"$hostname", exported_namespace=~"$namespace"}) by (service_name, exported_namespace) * 8
vs
sum(dest_bps_in{hostname="$hostname", exported_namespace="$namespace"}) by (service_name, exported_namespace) * 8

ottoyiu · 2019-04-12T17:17:43Z

@ivmaks the graph looks identical with or without the ~
without:

with:

bwplotka · 2019-05-30T15:03:11Z

So we found and fixed issue in algorithm that chooses what blocks to use: #1146. So if you different blocks compacted and some not compacted it could incorrectly just give 0 results. This will fix:

the 1h resolution results is a pretty broken graph it seems.

But still incorrect result is a different story. I am kind of sure, that the reason is that you use sum on downsampled data, what if you try sum_over_time? Sorry - this bit is definitely confusing even for someone that knows promQL and how we downsample pretty well - we should document it better.

ottoyiu · 2019-05-30T21:40:22Z

@bwplotka awesome to see #1146, will definitely try out the v0.5.0-rc.0! thank you for all the hard work and to everyone involved.

I'm applying an instant sum because we have 3 loadbalancers and we want to see at a given time what the aggregate bytes per seconds is per service. I'm not sure if sum_over_time makes sense, since it's not really an instant vector we're looking for unless I'm mistaken what sum_over_time does.

I'll try to replicate this bug in v0.5.0 and see if something was changed that fixed it.

Edit: I can still replicate it on v0.5.0-rc.0 with and without the multipler (times 8):
without:

with:

tried using sum_over_time instead (I don't know how to use sum_over_time and still group by the the two fields, so maybe this is not what you're looking for):

ryansheppard · 2019-05-31T15:20:55Z

We are also experiencing this bug. Our Prometheus instances are all set to scrape every 30 seconds. We noticed that with Max 5 Minute downsampling, we get a 10x increase in the value. With Max 1 Hour, we get a 12x increase over the 5 Minute downsampling, and ~120x over the raw data.

All of these values can be explained by the number of scrapes in the downsample window. Those values make sense because we have 10 raw samples in a 5 minute window and 120 samples in a 1 hour window. Is the store trying to "smear" the values across the 5m/1hr time frame to fill in the gaps, causing aggregations like sum to output the wrong result?

alivespirit · 2019-06-05T09:53:17Z

Same thing here, expression sum(elasticsearch_cluster_health_number_of_nodes)/count(elasticsearch_cluster_health_number_of_nodes) gives value 15 with raw data and 450 with downsampling.

bwplotka · 2019-06-12T09:50:41Z

Guys Is it reproducible with v0.5.0? (:

alivespirit · 2019-06-12T09:58:15Z

@bwplotka ~~Actually it is not! Just tested 0.5.0 in our setup and it works great.~~ Was mistaken, forgot about sum.

bwplotka · 2019-06-12T10:00:10Z

We fixed nasty bug (: Thanks to this: #1146

Thanks everyone involved. ~~Closing, we can reopen if anyone can repro it with v0.5.0 or newest master.~~

hhsnow · 2019-06-12T10:01:37Z

@bwplotka, @ryansheppard's graphs are from 0.5.0-rc.0. we can try the v0.5.0 release today, but I don't see anything besides documentation changes.

therapy-lf · 2019-06-12T10:08:49Z

I have the same issue with v0.5.0

bwplotka · 2019-06-12T10:12:11Z

I think there are many things here. One was the bug with choosing the downsampled resolution: That one works.

Additional one is with particular non _over_time operators like sum and there is some confusion there (or bug) in handling those queries. We should look closely on that part in the code between PromQL and Agreggations to fetch.

vladvasiliu · 2019-06-13T14:46:53Z

I don't think there's any bug in this, but there might be some confusion caused by the way Grafana and the Prometheus console work, that is the graphs are sampled:

If sample frequency (=scraping frequency) is smaller than drawing interval (the $__interval variable), then samples in between are dropped.
This is as opposed to e.g. Kibana, which usually uses histograms, which means that samples are summed.
That's why bars in Kibana go up as the selected time interval goes up, whereas in Grafana they stay the same. Basically, in Kibana requests per second become requests per minute, then requests per hour, etc.

In Thanos, down-sampled series are actually aggregated values over the interval. When you request a value from one such series, you get either an average over the down-sampling interval, either some aggregated value if there's some function involved (min/max/sum/count).

IMHO this sheds some light on what the querier returns.

This explains why sum(requests) is not the same as requests + stacking:
The first one sums a bunch of sums, whereas the latter sums a bunch of averages. Each average being sum / count (count is number of scrapes per interval), which is exactly the difference seen in the examples above.

However, this kind of "bucketing" combined with the sampling in Grafana can yield surprising results, so care should be taken when building the queries.

therapy-lf · 2019-06-13T15:15:48Z

@vladvasiliu I have the same issue in thanos-query ui, not just in grafana.

vladvasiliu · 2019-06-13T17:36:14Z

That's because thanos-query ui works the same. See below for examples.

It's a pretty generic tool that draws values for whatever time series you throw at it.

It doesn't always make sense to group the values when "zooming out". For example if you have a time series for some status, like up. It wouldn't make sense to display "3" if the UI step is 3 times the scraping period. It has no way of knowing what the value represents, so it goes with the safest route, which is sampling, ie dropping values.

In my opinion there should be a broader documentation in Thanos about how this works and how this interacts with graphing tools. I think the most surprising things happen when the graphing tool has a resolution in between downsampling intervals, say 20 minutes in the case of Thanos. If you sum your values, you'll get a partial sum for that period, which is weird for me.

The way to look at this is that downsampling loses resolution. Instead of five values, one every 5 minutes, you only get one, which isn't equal to any of them. (You actually get several: min, max, sum, count - see #813 - this allows to retain some idea about what the data distribution was).

I think what's a bit confusing is that asking for just one sample gives an average, so the value isn't clearly wrong when compared to raw data (but it should be noted that if the raw data is somewhat random, they don't match!)
People probably don't expect that summing different dimensions would actually sum different values that what's returned without the sum for a query that's identical otherwise.

The graphs below have samples scraped every minute. This is thanos-query UI v0.5.0.

Sampling and missing data: same series, different "zoom":
The first is over the last 12h. Notice the peak at 60.

The second is over the last two days. Notice the maximum barely hits 30. There's information missing (focus on the graph that's present, the series was only created yesterday).

Same series, 1 scrape per minute in raw data, downsamples are to 5 minutes. See how the shape of the curve changes with what is displayed.

60s resolution, only raw data. Notice the peak above 8000.

60s resolution, downsampled. The peak is gone, and there are 5m plateaus. This is a smoothed version, because it's the average for each plateau.

60s resolution, downsampled, max. The peak is back, and the shape is roughly the same.

300s raw. This is where it gets interesting. Notice the missing features. But there are no plateaus.

300s downsampled. It's pretty much the same, just smoothed. Instead of sampling 1 of 5 values, it uses the average.

300s downsampled, max. The peak is back.

Up until this point, all values are roughly the same as the original raw data. The next one is the confusing part. Note there's just one series, so if using sum or avg on the raw data the values wouldn't change. But on the downsampling it does, and it's... 5 times larger !

The difference is the way those charts are read. The last is read "in the interval between one point and the other, there were this many requests". That's an aggregation. All the others read "at some time between the last two scrapes there were this many requests per unit of time".

You'll have to check with the series for this unit, and it's always the same. If the range is 5 minutes, you probably don't care. If it's a day, and you only had 2000 requests, it makes a big difference to know whether those were for the whole day or just during one second.

jjneely · 2019-06-20T18:34:40Z

So yeah, you've demonstrated Spike Erosion quite well here. Definitely a well understood side effect when you downsample by averaging or your graph display toolkit uses a weighted averages to dynamical resize the graph.

Being that we have min, max, sum, count, and (therefore) average for each downsampled data point, I bet that we are using the sum value of the downsampled data point when we use the sum() function...which leads to these inflated results. However, when sum_over_time() is used, this is exactly what we want to do.

Being that Spike Erosion is usually controlled by controlling the downsampling aggregation function, do we need to expose how to select the min. max, sum, or count when working with downsampled data?

Another way to handle Spike Erosion is by using and aggregating histograms to build a quantile estimation. That too is going to require sum() over Counter type data and probably work best with max as the aggregation function.

GiedriusS · 2019-07-12T13:17:58Z

I believe this has been fixed in 0.6.0-rc.0. Please test.

ryansheppard · 2019-07-12T13:54:17Z

Just rolled out 0.6.0-rc.0 for the querier and it looks good. Both sum and avg are returning what we expect with Max 5m Downsampling.

ottoyiu changed the title ~~query: auto-downsampling causes inaccurate output of metrics~~ query: auto-downsampling causes inaccurate output of metrics (nflated values) Mar 14, 2019

ottoyiu changed the title ~~query: auto-downsampling causes inaccurate output of metrics (nflated values)~~ query: auto-downsampling causes inaccurate output of metrics (inflated values) Mar 14, 2019

bwplotka added bug difficulty: hard help wanted priority: P0 labels Mar 28, 2019

mjd95 mentioned this issue May 2, 2019

Store: Add min resolution to be specific what downsampling levels to use on query #1104

Closed

bwplotka closed this as completed Jun 12, 2019

bwplotka reopened this Jun 12, 2019

GiedriusS mentioned this issue Jun 26, 2019

query/querier: fix sum() inflated values problem #1278

Merged

bwplotka closed this as completed Oct 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query: auto-downsampling causes inaccurate output of metrics (inflated values) #922

query: auto-downsampling causes inaccurate output of metrics (inflated values) #922

ottoyiu commented Mar 14, 2019 •

edited

Loading

FUSAKLA commented Mar 23, 2019

ottoyiu commented Mar 26, 2019 •

edited

Loading

bwplotka commented Mar 28, 2019

bwplotka commented Mar 29, 2019

mjd95 commented Mar 29, 2019

ivmaks commented Apr 11, 2019

ottoyiu commented Apr 12, 2019

bwplotka commented May 30, 2019 •

edited

Loading

ottoyiu commented May 30, 2019 •

edited

Loading

ryansheppard commented May 31, 2019

alivespirit commented Jun 5, 2019

bwplotka commented Jun 12, 2019

alivespirit commented Jun 12, 2019 •

edited

Loading

bwplotka commented Jun 12, 2019 •

edited

Loading

hhsnow commented Jun 12, 2019

therapy-lf commented Jun 12, 2019

bwplotka commented Jun 12, 2019

vladvasiliu commented Jun 13, 2019

therapy-lf commented Jun 13, 2019

vladvasiliu commented Jun 13, 2019

jjneely commented Jun 20, 2019

GiedriusS commented Jul 12, 2019

ryansheppard commented Jul 12, 2019

query: auto-downsampling causes inaccurate output of metrics (inflated values) #922

query: auto-downsampling causes inaccurate output of metrics (inflated values) #922

Comments

ottoyiu commented Mar 14, 2019 • edited Loading

FUSAKLA commented Mar 23, 2019

ottoyiu commented Mar 26, 2019 • edited Loading

bwplotka commented Mar 28, 2019

bwplotka commented Mar 29, 2019

mjd95 commented Mar 29, 2019

ivmaks commented Apr 11, 2019

ottoyiu commented Apr 12, 2019

bwplotka commented May 30, 2019 • edited Loading

ottoyiu commented May 30, 2019 • edited Loading

ryansheppard commented May 31, 2019

alivespirit commented Jun 5, 2019

bwplotka commented Jun 12, 2019

alivespirit commented Jun 12, 2019 • edited Loading

bwplotka commented Jun 12, 2019 • edited Loading

hhsnow commented Jun 12, 2019

therapy-lf commented Jun 12, 2019

bwplotka commented Jun 12, 2019

vladvasiliu commented Jun 13, 2019

therapy-lf commented Jun 13, 2019

vladvasiliu commented Jun 13, 2019

jjneely commented Jun 20, 2019

GiedriusS commented Jul 12, 2019

ryansheppard commented Jul 12, 2019

ottoyiu commented Mar 14, 2019 •

edited

Loading

ottoyiu commented Mar 26, 2019 •

edited

Loading

bwplotka commented May 30, 2019 •

edited

Loading

ottoyiu commented May 30, 2019 •

edited

Loading

alivespirit commented Jun 12, 2019 •

edited

Loading

bwplotka commented Jun 12, 2019 •

edited

Loading