-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latency aggregates produce confusing values for percentiles #1989
Comments
cc @mykaul |
@amnonh - what's the next step here? |
Wait for 5.3 and 2023.1 that should change how latency is calculated, and then decide |
We need a solution for 2022.x, @amnonh. Wouldn't filtering on a Monitoring level of "trash scheduling classes" histograms give us enough headroom to not have to use these aggregates? |
@vladzcloudius let me try to explain. Calculating quantiles from histograms over multiple histograms and a long duration is enough to crash Prometheus. We've seen that in the past. That moved the problem to the recording rules calculation part. In 5.3 and 2023.1 most of those calculations will be moved to Scylla, so I expect great improvement, but we will need to see it live to be sure. Right now, if you have a cluster that doesn't have millions of metrics (you can verify that by looking at http://{ip}:9090/tsdb-status) and that the recording rules calculation does not take too long (you can verify that by looking at http://{ip}:9090/rules) you can reduce the evaluation_interval to 20s, that would make the graphs look nicer. There are two ways of changing that, one (I think is simpler) edit Alternatively, you can use a command line option to start-all.sh: That would, of course, increase the load on the Prometheus server, so if done, the server should be monitored to see that it can handle the load. I would suggest dropping cas and cdc if possible. |
@vladzcloudius Can you verify that the issue is solved with ScyllaDB > 2023.1 and setting the recording rule calculation to 20s? |
@vladzcloudius ping |
1 similar comment
@vladzcloudius ping |
Why do we need to increase recording rules calculation to 20s with 2023.1, @amnonh? |
@vladzcloudius It's decreasing. If it's lower than 20s, it's fine, but before 2023.1, it used to be 1m, and it is the root cause of the issue. Post 2023.1 with low recording rule calculation I expect the issue will be solved |
So no need to change the |
You need to change evaluation_interval to 20s, either edit prometheus/prometheus.yml.template
or using the Next scylla-monitoring release will have it set to 20s by default. |
@vladzcloudius ping |
@vladzcloudius ping, can you see if things as expected with 2024.1 and the latest monitoring release? |
@vladzcloudius I'm closing is completed, please re-open if there's still an issue |
Installation details
Panel Name: Latencies
Dashboard Name: Detailed
Scylla-Monitoring Version: 4.3.4
Scylla-Version: 2021.1.12 (this is most likely irrelevant)
Description
rlatencypXX/wlatencypXX aggregates sometimes produce a very confusing result due to internal calculations error.
As a result values are sometimes shifted, sometimes are dramatically different from what a
histogram_quantile
function over a raw metric returns:These are aggregates values:
And here are values calculated by a
histogram_quantile
alongside the values above (notice the shift):Here is a different example when the percentile values were significantly lower with aggregates than with a raw metric (which is the correct value). Notice a few orders of magnitude difference!
(aggregates only)
(aggregates + raw value):
These "artifacts" make the debugging using Monitoring much harder than needed.
It's impossible to rely on graphs given such huge errors.
I suggest to stop using those aggregates and get back to using
histogram_quantile
directly.The reason we started using those aggregates was mainly due to a huge load
histogram_quantile
were creating while calculating values for internal scheduling groups. Since we are filtering those out now there is a good chance we don't need to aggregate on Prometheus anymore.The text was updated successfully, but these errors were encountered: