Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent DataSourceErrors via Grafana Alerts - 'monarch::220 Cancelled due to the number of queries...' #885

Closed
seth-acuitymd opened this issue Mar 8, 2024 · 12 comments
Assignees

Comments

@seth-acuitymd
Copy link

Hey GCP Prom team!

Background

I've been running GCP Managed Prometheus in 4 of my GKE Autopilot Clusters, and have configured the Data Source Syncer to allow an open source Grafana deployment to auth to GCP Prom.

This is all working swimmingly! The DS Syncer triggers and updates Grafana, I can query metrics and even consume pre-made Grafana Dashboards for some open source services I run.

Issue

I have configured a few alerts via Grafana's native AlertManager, and as time passes, they seem to want to fire off with Datasource Errors a few times a day.

When I expand the error message, I see a couple different messages:

- Error = [sse.dataQueryError] failed to execute query [available]: internal: expanding series: generic::aborted: 
invalid status monarch::220: Cancelled due to the number of queries whose evaluation is blocked waiting for 
memory is 501, which is equal to or greater than the limit of 500.

or

- Error = [sse.dataQueryError] failed to execute query [ALERTNAME]: internal: expanding series: generic::aborted: 
invalid status monarch::219: Rejecting query because user /UNSPECIFIED:cloud-monitoring-query/UNSPECIFIED:gcm-
api/CONSUMER_RESOURCE_CONTAINER:0 has requested 5787MiB of memory for processing queries on one Monarch 
node, above or equal to the limit 5787MiB allowed for a single user.

My query for an alert that recently fired this error is
kube_deployment_status_replicas_available{deployment="external-secrets", namespace="external-secrets"}
(this metrics is from GCP's implementation of kube state metrics
and the alert condition is
Screenshot 2024-03-08 at 9 21 56 AM
The alert is evaluated every 2 minutes and will sit pending for 6 minutes before firing?

Questions

Is this just Monarch under the hood being stressed?

Is it the eval window? These are low priority alerts so they don't need to be polled every second

Would adding a Grafana replica help? (I assume "no" if this is a Monarch issue)

Thanks y'all!

@lyanco
Copy link
Collaborator

lyanco commented Mar 11, 2024

Heya Seth,

Thank you so much for the very detailed and well written bug report!

Indeed, there was a Monarch query server degradation in us-east1 that was affecting a small but noticeable number of queries, especially for users on the East coast. As of 6am PT last Friday the issue was mostly resolved and the error rate looks to have dropped off considerably:

image

Let us know if you are still seeing this today; we now know the root cause so we can address this more quickly if it arises again.

@seth-acuitymd
Copy link
Author

seth-acuitymd commented Mar 11, 2024

Thanks @lyanco! Much appreciated!

It definitely looks like it's calming down!
March 7th - 29 occurrences
March 8th - 9 occurrences
March 9th - 9 occurrences
March 10th (yesterday) - 7 occurrences
March 11th (today) - 2 occurrences

Before I close this resolved - is there anything I can/should do in the future should this recur? As a fellow OSS contributor (hence the heavily detailed issue description 😆 ), this is more of a question of "how do we not give you too much noise to deal with", is a GitHub issue the best place?

Thanks my friend!!

@lyanco
Copy link
Collaborator

lyanco commented Mar 11, 2024

Letting us know here like you just did is the best thing you can do (and could have done)! If any of our internal wackiness leaks out to the public, that's something we need to address ASAP, and GH issues come directly to us.

Thanks again!

@ovelascoh
Copy link

ovelascoh commented Mar 19, 2024

Hi,

We have also been experiencing this problem, besides the monarch related errors, we were are also receiving the following error in Grafana:

Error: [sse.dataQueryError] failed to execute query [A]: Post "http://{{}}/api/v1/query_range": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Also, in the frontend instances, we are seeing the following error:

level=warn ts=2024-03-06T01:34:04.599622614Z caller=main.go:205 msg="requesting GCM failed" err="Post \"https://monitoring.googleapis.com/v1/projects/{{}}/location/global/prometheus/api/v1/query_range\": context canceled"

Now, we are using a different region (europe-west1), but we also see a similar pattern (data obtained from the frontend logs):

March 05th - 300 occurrences
March 06th - 206  occurrences
March 11th - 3 occurrences
March 12th - 17 occurrences
March 13th - 6 occurrences  <- timeout increased
March 19th - 4 occurrences

Please note we increased the timeout for the Prometheus DS in Grafana, but we are still seeing monarch errors frequently.

Given that timelines somehow align for different regions, could this be caused by a new monarch release?

@lyanco, can you let us know how can we obtain the metrics you included here?

@lyanco
Copy link
Collaborator

lyanco commented Mar 19, 2024

The metrics we posted above are internal and not customer-specific. How many queries per day are you running? A very small but non-0 amount of query failures are expected from any system, ours included. On a normal day, we have something like 99.9995% availability, so maybe 1/200,000 queries are expected to fail. Our published SLA is 99.95 monthly uptime percentage, so we're running well ahead of that.

FYI you can get rid of the frontend container now :-). We have a new auth path that is significantly more reliable than the frontend container, see the docs for the datasource syncer here: https://cloud.google.com/stackdriver/docs/managed-prometheus/query#grafana-oauth

@seth-acuitymd
Copy link
Author

I still see an error a couple times a day but nothing I can't tolerate now, but chiming in to recommend the datasource syncer @ovelascoh ! Works great and removing that extra network hop has made my Grafana a much better experience

@ovelascoh
Copy link

Looking back, the error had been present, but with low numbers, this is why it went under the radar. The server degradation event brought the problem to our attention and triggered an investigation. I will give datasource syncer a try, thanks for the feedback @lyanco , @seth-acuitymd !

@devansh-hitwicket
Copy link

devansh-hitwicket commented Oct 3, 2024

We have been using Managed Prometheus and we see the following issue almost 8-10 times a day for the last month:

[sse.dataQueryError] failed to execute query [Message Size]: resource_exhausted: expanding series: generic::aborted: invalid status monarch::220: Cancelled due to the number of queries whose evaluation is blocked waiting for memory is 500, which is equal to or greater than the limit of 500.

We are not even able to tell what are the queries that are running (we have very few) and which are the slow or frequent queries.

@seth-acuitymd
Copy link
Author

Following up here too since @devansh-hitwicket posted will a similar experience:

It's anecdotal, I haven't run the numbers, but we're seeing this more often too, causing errors from our self managed Grafana - I wonder if the underlying architecture treats non-GCP based queries (such as from Grafana) as second class and thus drops them

@pintohutch
Copy link
Collaborator

pintohutch commented Oct 3, 2024

Hey @devansh-hitwicket @seth-acuitymd - thanks for reporting.

It would be helpful to have a little more concrete data around performance. Since you're using Grafana, you should be able to ingest performance metrics and graph them. If you're using managed collection, consider a PodMonitoring like the following to ingest Grafana self-metrics into GMP:

apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: grafana
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: grafana
  endpoints:
  - port: http
    interval: 30s

Then, let's assume you configured your Grafana datasource and named it "gmp", it looks like you may be able to use the grafana_datasource_request_duration_seconds histogram to visualize query performance, e.g.

Rate of different error codes:

sum by (code) (rate(grafana_datasource_request_duration_seconds_count{datasource="gmp"}[5m]))

Error rate:

sum(rate(grafana_datasource_request_duration_seconds_count{datasource="gmp",code!="200"}[5m]))
/ 
sum(rate(grafana_datasource_request_duration_seconds_count{datasource="gmp"}[5m]))

p95 latency by status code:

histogram_quantile(0.95, sum by(le, code) (rate(grafana_datasource_request_duration_seconds_bucket{datasource="gmp"}[5m])))

I wonder if the underlying architecture treats non-GCP based queries (such as from Grafana) as second class and thus drops them

We do not have any notion of "second class" queries. Requests from Grafana are treated like any other 🙂 .

Hope that helps.

@lyanco
Copy link
Collaborator

lyanco commented Oct 3, 2024

This issue is almost invariably caused by having a rule evaluator running a ton of really inefficient rules, the most egregious example being looking for a rate over 4 weeks and running that rule every 1 minute. In this case the query takes more than 1 minute to run, causing a backup of queries executed from your project, which then gets throttled by monarch.

The solution is to loosen the period of your rules so that they're not running every 1 minute. For a 4-week rate, it should be sufficient to run that query every 1 hour. Once there's no more bottleneck of inefficient long-running queries, this issue should go away.

@seth-acuitymd
Copy link
Author

Awesome! Thanks @pintohutch and @lyanco !! Much appreciated!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants