-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent DataSourceErrors via Grafana Alerts - 'monarch::220 Cancelled due to the number of queries...' #885
Comments
Thanks @lyanco! Much appreciated! It definitely looks like it's calming down! Before I close this resolved - is there anything I can/should do in the future should this recur? As a fellow OSS contributor (hence the heavily detailed issue description 😆 ), this is more of a question of "how do we not give you too much noise to deal with", is a GitHub issue the best place? Thanks my friend!! |
Letting us know here like you just did is the best thing you can do (and could have done)! If any of our internal wackiness leaks out to the public, that's something we need to address ASAP, and GH issues come directly to us. Thanks again! |
Hi, We have also been experiencing this problem, besides the monarch related errors, we were are also receiving the following error in Grafana:
Also, in the frontend instances, we are seeing the following error:
Now, we are using a different region (
Please note we increased the timeout for the Prometheus DS in Grafana, but we are still seeing monarch errors frequently. Given that timelines somehow align for different regions, could this be caused by a new monarch release? @lyanco, can you let us know how can we obtain the metrics you included here? |
The metrics we posted above are internal and not customer-specific. How many queries per day are you running? A very small but non-0 amount of query failures are expected from any system, ours included. On a normal day, we have something like 99.9995% availability, so maybe 1/200,000 queries are expected to fail. Our published SLA is 99.95 monthly uptime percentage, so we're running well ahead of that. FYI you can get rid of the frontend container now :-). We have a new auth path that is significantly more reliable than the frontend container, see the docs for the datasource syncer here: https://cloud.google.com/stackdriver/docs/managed-prometheus/query#grafana-oauth |
I still see an error a couple times a day but nothing I can't tolerate now, but chiming in to recommend the datasource syncer @ovelascoh ! Works great and removing that extra network hop has made my Grafana a much better experience |
Looking back, the error had been present, but with low numbers, this is why it went under the radar. The server degradation event brought the problem to our attention and triggered an investigation. I will give |
We have been using Managed Prometheus and we see the following issue almost 8-10 times a day for the last month: [sse.dataQueryError] failed to execute query [Message Size]: resource_exhausted: expanding series: generic::aborted: invalid status monarch::220: Cancelled due to the number of queries whose evaluation is blocked waiting for memory is 500, which is equal to or greater than the limit of 500. We are not even able to tell what are the queries that are running (we have very few) and which are the slow or frequent queries. |
Following up here too since @devansh-hitwicket posted will a similar experience: It's anecdotal, I haven't run the numbers, but we're seeing this more often too, causing errors from our self managed Grafana - I wonder if the underlying architecture treats non-GCP based queries (such as from Grafana) as second class and thus drops them |
Hey @devansh-hitwicket @seth-acuitymd - thanks for reporting. It would be helpful to have a little more concrete data around performance. Since you're using Grafana, you should be able to ingest performance metrics and graph them. If you're using managed collection, consider a apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
name: grafana
namespace: monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/name: grafana
endpoints:
- port: http
interval: 30s Then, let's assume you configured your Grafana datasource and named it Rate of different error codes:
Error rate:
p95 latency by status code:
We do not have any notion of "second class" queries. Requests from Grafana are treated like any other 🙂 . Hope that helps. |
This issue is almost invariably caused by having a rule evaluator running a ton of really inefficient rules, the most egregious example being looking for a rate over 4 weeks and running that rule every 1 minute. In this case the query takes more than 1 minute to run, causing a backup of queries executed from your project, which then gets throttled by monarch. The solution is to loosen the period of your rules so that they're not running every 1 minute. For a 4-week rate, it should be sufficient to run that query every 1 hour. Once there's no more bottleneck of inefficient long-running queries, this issue should go away. |
Awesome! Thanks @pintohutch and @lyanco !! Much appreciated!! |
Hey GCP Prom team!
Background
I've been running GCP Managed Prometheus in 4 of my GKE Autopilot Clusters, and have configured the Data Source Syncer to allow an open source Grafana deployment to auth to GCP Prom.
This is all working swimmingly! The DS Syncer triggers and updates Grafana, I can query metrics and even consume pre-made Grafana Dashboards for some open source services I run.
Issue
I have configured a few alerts via Grafana's native AlertManager, and as time passes, they seem to want to fire off with
Datasource Errors
a few times a day.When I expand the error message, I see a couple different messages:
or
My query for an alert that recently fired this error is
kube_deployment_status_replicas_available{deployment="external-secrets", namespace="external-secrets"}
(this metrics is from GCP's implementation of kube state metrics
and the alert condition is
The alert is evaluated every 2 minutes and will sit pending for 6 minutes before firing?
Questions
Is this just Monarch under the hood being stressed?
Is it the eval window? These are low priority alerts so they don't need to be polled every second
Would adding a Grafana replica help? (I assume "no" if this is a Monarch issue)
Thanks y'all!
The text was updated successfully, but these errors were encountered: