Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latency aggregates produce confusing values for percentiles #1989

Closed
vladzcloudius opened this issue May 24, 2023 · 17 comments · Fixed by #2022
Closed

Latency aggregates produce confusing values for percentiles #1989

vladzcloudius opened this issue May 24, 2023 · 17 comments · Fixed by #2022
Assignees
Labels
bug Something isn't working right

Comments

@vladzcloudius
Copy link
Contributor

vladzcloudius commented May 24, 2023

Installation details
Panel Name: Latencies
Dashboard Name: Detailed
Scylla-Monitoring Version: 4.3.4
Scylla-Version: 2021.1.12 (this is most likely irrelevant)

Description
rlatencypXX/wlatencypXX aggregates sometimes produce a very confusing result due to internal calculations error.
As a result values are sometimes shifted, sometimes are dramatically different from what a histogram_quantile function over a raw metric returns:

These are aggregates values:
image

And here are values calculated by a histogram_quantile alongside the values above (notice the shift):

image

Here is a different example when the percentile values were significantly lower with aggregates than with a raw metric (which is the correct value). Notice a few orders of magnitude difference!

(aggregates only)
image

(aggregates + raw value):
image

These "artifacts" make the debugging using Monitoring much harder than needed.
It's impossible to rely on graphs given such huge errors.

I suggest to stop using those aggregates and get back to using histogram_quantile directly.

The reason we started using those aggregates was mainly due to a huge load histogram_quantile were creating while calculating values for internal scheduling groups. Since we are filtering those out now there is a good chance we don't need to aggregate on Prometheus anymore.

@vladzcloudius vladzcloudius added the bug Something isn't working right label May 24, 2023
@vladzcloudius
Copy link
Contributor Author

cc @mykaul

@vladzcloudius
Copy link
Contributor Author

cc @tomer-sandler @harel-z @gcarmin

@mykaul
Copy link
Contributor

mykaul commented Jun 13, 2023

@amnonh - what's the next step here?

@amnonh
Copy link
Collaborator

amnonh commented Jun 13, 2023

Wait for 5.3 and 2023.1 that should change how latency is calculated, and then decide

@vladzcloudius
Copy link
Contributor Author

vladzcloudius commented Jun 13, 2023

Wait for 5.3 and 2023.1 that should change how latency is calculated, and then decide

We need a solution for 2022.x, @amnonh.
We can't wait for 2023.1.

Wouldn't filtering on a Monitoring level of "trash scheduling classes" histograms give us enough headroom to not have to use these aggregates?

@amnonh
Copy link
Collaborator

amnonh commented Jun 13, 2023

@vladzcloudius let me try to explain. Calculating quantiles from histograms over multiple histograms and a long duration is enough to crash Prometheus. We've seen that in the past.
To overcome this, we are using recording rules to calculate smaller chuck incrementally.

That moved the problem to the recording rules calculation part.
For 2022.x we address it in two ways. One, reduce the recording rules calculation interval, and two in 4.4.x drop the internal latency histograms from the calculation.

In 5.3 and 2023.1 most of those calculations will be moved to Scylla, so I expect great improvement, but we will need to see it live to be sure.

Right now, if you have a cluster that doesn't have millions of metrics (you can verify that by looking at http://{ip}:9090/tsdb-status) and that the recording rules calculation does not take too long (you can verify that by looking at http://{ip}:9090/rules) you can reduce the evaluation_interval to 20s, that would make the graphs look nicer.

There are two ways of changing that, one (I think is simpler) edit prometheus/prometheus.yml.template and change evaluation_interval to 20s.

Alternatively, you can use a command line option to start-all.sh: --evaluation-interval 20s

That would, of course, increase the load on the Prometheus server, so if done, the server should be monitored to see that it can handle the load.

I would suggest dropping cas and cdc if possible.
And of course, I'm available to try it together

@vladzcloudius
Copy link
Contributor Author

@amnonh there must be a typo in your patch or something.
I don't see how #2022 fixes this issue. Reopening

@vladzcloudius vladzcloudius reopened this Jul 21, 2023
@amnonh
Copy link
Collaborator

amnonh commented Oct 26, 2023

@vladzcloudius Can you verify that the issue is solved with ScyllaDB > 2023.1 and setting the recording rule calculation to 20s?

@amnonh
Copy link
Collaborator

amnonh commented Feb 8, 2024

@vladzcloudius ping

1 similar comment
@amnonh
Copy link
Collaborator

amnonh commented Feb 20, 2024

@vladzcloudius ping

@vladzcloudius
Copy link
Contributor Author

@vladzcloudius Can you verify that the issue is solved with ScyllaDB > 2023.1 and setting the recording rule calculation to 20s?

Why do we need to increase recording rules calculation to 20s with 2023.1, @amnonh?

@amnonh
Copy link
Collaborator

amnonh commented Feb 21, 2024

@vladzcloudius It's decreasing. If it's lower than 20s, it's fine, but before 2023.1, it used to be 1m, and it is the root cause of the issue.

Post 2023.1 with low recording rule calculation I expect the issue will be solved

@vladzcloudius
Copy link
Contributor Author

@vladzcloudius It's decreasing. If it's lower than 20s, it's fine, but before 2023.1, it used to be 1m, and it is the root cause of the issue.

Post 2023.1 with low recording rule calculation I expect the issue will be solved

So no need to change the recording rule calculation with 2023.1, right?

@amnonh
Copy link
Collaborator

amnonh commented Feb 21, 2024

You need to change evaluation_interval to 20s, either edit prometheus/prometheus.yml.template

$ head -4 prometheus/prometheus.yml.template
global:
  scrape_interval: 20s # By default, scrape targets every 20 second.
  scrape_timeout: 15s # Timeout before trying to scrape a target again
  evaluation_interval: 20s # <--- This used to be 60s it should be 20s 

or using the --evaluation-interval command line parameter to start-all.sh

Next scylla-monitoring release will have it set to 20s by default.

@amnonh
Copy link
Collaborator

amnonh commented Mar 6, 2024

@vladzcloudius ping

@amnonh
Copy link
Collaborator

amnonh commented May 12, 2024

@vladzcloudius ping, can you see if things as expected with 2024.1 and the latest monitoring release?

@amnonh
Copy link
Collaborator

amnonh commented May 15, 2024

@vladzcloudius I'm closing is completed, please re-open if there's still an issue

@amnonh amnonh closed this as completed May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants