Latency aggregates produce confusing values for percentiles #1989

vladzcloudius · 2023-05-24T18:53:54Z

Installation details
Panel Name: Latencies
Dashboard Name: Detailed
Scylla-Monitoring Version: 4.3.4
Scylla-Version: 2021.1.12 (this is most likely irrelevant)

Description
rlatencypXX/wlatencypXX aggregates sometimes produce a very confusing result due to internal calculations error.
As a result values are sometimes shifted, sometimes are dramatically different from what a histogram_quantile function over a raw metric returns:

These are aggregates values:

And here are values calculated by a histogram_quantile alongside the values above (notice the shift):

Here is a different example when the percentile values were significantly lower with aggregates than with a raw metric (which is the correct value). Notice a few orders of magnitude difference!

(aggregates only)

(aggregates + raw value):

These "artifacts" make the debugging using Monitoring much harder than needed.
It's impossible to rely on graphs given such huge errors.

I suggest to stop using those aggregates and get back to using histogram_quantile directly.

The reason we started using those aggregates was mainly due to a huge load histogram_quantile were creating while calculating values for internal scheduling groups. Since we are filtering those out now there is a good chance we don't need to aggregate on Prometheus anymore.

The text was updated successfully, but these errors were encountered:

vladzcloudius · 2023-05-24T18:54:14Z

cc @mykaul

vladzcloudius · 2023-05-24T19:17:54Z

cc @tomer-sandler @harel-z @gcarmin

mykaul · 2023-06-13T18:41:02Z

@amnonh - what's the next step here?

amnonh · 2023-06-13T19:35:30Z

Wait for 5.3 and 2023.1 that should change how latency is calculated, and then decide

vladzcloudius · 2023-06-13T19:53:21Z

Wait for 5.3 and 2023.1 that should change how latency is calculated, and then decide

We need a solution for 2022.x, @amnonh.
We can't wait for 2023.1.

Wouldn't filtering on a Monitoring level of "trash scheduling classes" histograms give us enough headroom to not have to use these aggregates?

amnonh · 2023-06-13T20:35:42Z

@vladzcloudius let me try to explain. Calculating quantiles from histograms over multiple histograms and a long duration is enough to crash Prometheus. We've seen that in the past.
To overcome this, we are using recording rules to calculate smaller chuck incrementally.

That moved the problem to the recording rules calculation part.
For 2022.x we address it in two ways. One, reduce the recording rules calculation interval, and two in 4.4.x drop the internal latency histograms from the calculation.

In 5.3 and 2023.1 most of those calculations will be moved to Scylla, so I expect great improvement, but we will need to see it live to be sure.

Right now, if you have a cluster that doesn't have millions of metrics (you can verify that by looking at http://{ip}:9090/tsdb-status) and that the recording rules calculation does not take too long (you can verify that by looking at http://{ip}:9090/rules) you can reduce the evaluation_interval to 20s, that would make the graphs look nicer.

There are two ways of changing that, one (I think is simpler) edit prometheus/prometheus.yml.template and change evaluation_interval to 20s.

Alternatively, you can use a command line option to start-all.sh: --evaluation-interval 20s

That would, of course, increase the load on the Prometheus server, so if done, the server should be monitored to see that it can handle the load.

I would suggest dropping cas and cdc if possible.
And of course, I'm available to try it together

vladzcloudius · 2023-07-21T22:38:23Z

@amnonh there must be a typo in your patch or something.
I don't see how #2022 fixes this issue. Reopening

amnonh · 2023-10-26T14:03:52Z

@vladzcloudius Can you verify that the issue is solved with ScyllaDB > 2023.1 and setting the recording rule calculation to 20s?

amnonh · 2024-02-08T09:28:46Z

@vladzcloudius ping

amnonh · 2024-02-20T09:45:41Z

@vladzcloudius ping

vladzcloudius · 2024-02-20T21:55:27Z

@vladzcloudius Can you verify that the issue is solved with ScyllaDB > 2023.1 and setting the recording rule calculation to 20s?

Why do we need to increase recording rules calculation to 20s with 2023.1, @amnonh?

amnonh · 2024-02-21T07:40:45Z

@vladzcloudius It's decreasing. If it's lower than 20s, it's fine, but before 2023.1, it used to be 1m, and it is the root cause of the issue.

Post 2023.1 with low recording rule calculation I expect the issue will be solved

vladzcloudius · 2024-02-21T15:41:00Z

@vladzcloudius It's decreasing. If it's lower than 20s, it's fine, but before 2023.1, it used to be 1m, and it is the root cause of the issue.

Post 2023.1 with low recording rule calculation I expect the issue will be solved

So no need to change the recording rule calculation with 2023.1, right?

amnonh · 2024-02-21T16:06:44Z

You need to change evaluation_interval to 20s, either edit prometheus/prometheus.yml.template

$ head -4 prometheus/prometheus.yml.template
global:
  scrape_interval: 20s # By default, scrape targets every 20 second.
  scrape_timeout: 15s # Timeout before trying to scrape a target again
  evaluation_interval: 20s # <--- This used to be 60s it should be 20s

or using the --evaluation-interval command line parameter to start-all.sh

Next scylla-monitoring release will have it set to 20s by default.

amnonh · 2024-03-06T08:40:20Z

@vladzcloudius ping

amnonh · 2024-05-12T12:28:30Z

@vladzcloudius ping, can you see if things as expected with 2024.1 and the latest monitoring release?

amnonh · 2024-05-15T14:17:42Z

@vladzcloudius I'm closing is completed, please re-open if there's still an issue

vladzcloudius added the bug Something isn't working right label May 24, 2023

vladzcloudius assigned amnonh May 24, 2023

This was referenced Jul 12, 2023

avg function across multiple scheduling group #2012

Closed

Do not avg p95,p99 of scheduling group #2022

Merged

amnonh closed this as completed in #2022 Jul 12, 2023

vladzcloudius reopened this Jul 21, 2023

amnonh mentioned this issue Feb 22, 2024

prometheus/prometheus.yml.template: set evaluation interval to 20s #2185

Merged

amnonh closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latency aggregates produce confusing values for percentiles #1989

Latency aggregates produce confusing values for percentiles #1989

vladzcloudius commented May 24, 2023 •

edited

Loading

vladzcloudius commented May 24, 2023

vladzcloudius commented May 24, 2023

mykaul commented Jun 13, 2023

amnonh commented Jun 13, 2023

vladzcloudius commented Jun 13, 2023 •

edited

Loading

amnonh commented Jun 13, 2023

vladzcloudius commented Jul 21, 2023

amnonh commented Oct 26, 2023

amnonh commented Feb 8, 2024

amnonh commented Feb 20, 2024

vladzcloudius commented Feb 20, 2024

amnonh commented Feb 21, 2024

vladzcloudius commented Feb 21, 2024

amnonh commented Feb 21, 2024

amnonh commented Mar 6, 2024

amnonh commented May 12, 2024

amnonh commented May 15, 2024

Latency aggregates produce confusing values for percentiles #1989

Latency aggregates produce confusing values for percentiles #1989

Comments

vladzcloudius commented May 24, 2023 • edited Loading

vladzcloudius commented May 24, 2023

vladzcloudius commented May 24, 2023

mykaul commented Jun 13, 2023

amnonh commented Jun 13, 2023

vladzcloudius commented Jun 13, 2023 • edited Loading

amnonh commented Jun 13, 2023

vladzcloudius commented Jul 21, 2023

amnonh commented Oct 26, 2023

amnonh commented Feb 8, 2024

amnonh commented Feb 20, 2024

vladzcloudius commented Feb 20, 2024

amnonh commented Feb 21, 2024

vladzcloudius commented Feb 21, 2024

amnonh commented Feb 21, 2024

amnonh commented Mar 6, 2024

amnonh commented May 12, 2024

amnonh commented May 15, 2024

vladzcloudius commented May 24, 2023 •

edited

Loading

vladzcloudius commented Jun 13, 2023 •

edited

Loading