Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruler evaluation producing status 500 errors and inconsistent alerts #14441

Closed
LukoJy3D opened this issue Oct 9, 2024 · 8 comments · Fixed by #14481
Closed

Ruler evaluation producing status 500 errors and inconsistent alerts #14441

LukoJy3D opened this issue Oct 9, 2024 · 8 comments · Fixed by #14481
Labels
3.0 area/helm component/ruler type/bug Somehing is not working as expected

Comments

@LukoJy3D
Copy link
Contributor

LukoJy3D commented Oct 9, 2024

Describe the bug
When alerts are deployed to Loki, it produces errors with status=500, and alerts are completely inconsistent compared to what is returned manually in grafana.

"log": "level=info ts=2024-10-09T19:46:01.141151674Z caller=engine.go:248 component=ruler evaluation_mode=local org_id=fake msg=\"executing query\" type=instant query=\"(sum(count_over_time({namespace=\\\"ai\\\"} |= \\\"check completed\\\"[10m])) < 30)\" query_hash=3077565986"

"log": "level=info ts=2024-10-09T19:46:01.228229623Z caller=metrics.go:217 component=ruler evaluation_mode=local org_id=fake latency=fast query=\"(sum(count_over_time({namespace=\\\"ai\\\"} |= \\\"check completed\\\"[10m])) < 30)\" query_hash=3077565986 query_type=metric range_type=instant length=0s start_delta=694.286852ms end_delta=694.286992ms step=0s duration=87.000739ms status=500 limit=0 returned_lines=0 throughput=4.1GB total_bytes=357MB total_bytes_structured_metadata=715kB lines_per_second=847349 total_lines=73720 post_filter_lines=48 total_entries=1 store_chunks_download_time=0s queue_time=0s splits=0 shards=0 query_referenced_structured_metadata=false pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=155.03µs cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s cache_result_query_length_served=0s ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=73 ingester_requests=8 ingester_chunk_head_bytes=2.5MB ingester_chunk_compressed_bytes=41MB ingester_chunk_decompressed_bytes=354MB ingester_post_filter_lines=48 congestion_control_latency=0s index_total_chunks=0 index_post_bloom_filter_chunks=0 index_bloom_filter_ratio=0.00 index_shard_resolver_duration=0s disable_pipeline_wrappers=false"

Ingesters do not seem to have terrible latency:
Image

Alerts keep going into a pending state and even firing sometimes even though on manual query, it never goes below threshold:
Image

To Reproduce
Steps to reproduce the behavior:

  1. Running Loki Chart 6.16.0 in SSD
  2. deploy alert to Loki using https://github.com/grafana/cortex-rules-action
groups:
  - name: AiErrorAlerts
    rules:
      - alert: LowCheckCount
        expr: sum(count_over_time({namespace="ai"} |= `check completed` [10m])) < 30
        for: 10m
        labels:
          route: alerts-ai
          severity: medium
          source: loki
        annotations:
          summary: Low check completion rate.
          details: Checks completed per 10 minutes - `{{ $value }}`. Expected rate - `30`.
  1. Ruler config:
rulerConfig:
    alertmanager_url: "https://<mimir_am>:8080/alertmanager/"
    enable_alertmanager_v2: true
    enable_sharding: true
    remote_write:
      enabled: true
      clients:
        local:
          url: "http://haproxy.prometheus.svc.cluster.local:8080/api/v1/push"
    storage:
      gcs:
        bucket_name: hostinger-loki-ruler
    wal:
      dir: /var/loki/wal

Expected behavior
Alerts are evaluated properly without randomly going into a pending/firing state.

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: helm
@LukoJy3D LukoJy3D changed the title Ruler evaluation producing status 500 errors and inconsitant alerts Ruler evaluation producing status 500 errors and inconsistent alerts Oct 9, 2024
@yapishu
Copy link

yapishu commented Oct 28, 2024

I'm also experiencing frustrating issues with ruler in the 6.18.0 chart (app version 3.2.0); the querier seems to evaluate correctly (e.g. printing the number of lines evaluated, matches, throughput etc in the debug logs) for queries that come from the ruler, but they always print status 500, even for queries that don't hit the logs (like vector(1) > 0), and alerting rules never seem to trigger, which I presume means that the queries aren't returning results. This only applies to queries that come from the ruler, and only seems to apply to alerting rules -- if I make a recording rule with the same query, it works fine; identical queries that come through the frontend (ie curls) work perfectly. This has been driving me crazy for weeks.

OP seemed to indicate the switching from local to remote evaluation fixed his problem, but it did not for me. I switched my deployment to distributed and spent a couple of days playing with settings trying to figure out where the breakdown was, but the querier never stopped printing 500 status codes, even when it seemed to be working normally. There was a single point in time that my queries started populating alerts -- some combination of local rule evaluation and distributed mode -- and I tried to preserve these settings in SimpleScalable deployment, but it stopped working immediately.

@JStickler JStickler added component/ruler 3.0 area/helm type/bug Somehing is not working as expected labels Oct 30, 2024
@gyoza
Copy link

gyoza commented Nov 1, 2024

Working on migrating from loki-distributed to this chart 6.18.0 and app version 3.2.0 in distributed mode have also produced status=500 ruler alerts using the following config.

rulerConfig:
    alertmanager_url: http://${ALTERMANAGER_URL}
    external_url: https://${EXTERNAL_URL}
    enable_alertmanager_v2: true
    storage:
      local:
        directory: /etc/loki/rules
      type: local

For the record logcli against the new loki 3.2.0 gateway produces the desired results.

@LukoJy3D
Copy link
Contributor Author

LukoJy3D commented Nov 4, 2024

Most of my problems disappeared after changing the ruler evaluation mode to remote.

evaluation:
      mode: remote
      query_frontend:
        address: loki-read.loki.svc.cluster.local:9095

Sadly, I needed to dig docs quite a lot to find this, as you would expect it to be under ruler/alerting docs. That's why I opened a PR linked to this issue so that this kind of info would be more easily accessible in the alerting docs.

@JStickler I would appreciate it if you could review it 🙇

@JStickler
Copy link
Contributor

@LukoJy3D I have taken a look at your PR, the problem is I'm still learning about alerts and the ruler. And I'm not sure how the switch from cortex-tool to lokitool in Loki 3.1 affects that particular action, but since you're on a very recent version of the Loki Helm charts, I will take your word for it that the cortex rules action works with your Loki version.

@gyoza
Copy link

gyoza commented Nov 4, 2024

For the record I am running in distributed mode in the latest helm chart 6.18.0 and coincidentally I came across this fix today and made this change.

The ruler is now running and not returning 500 status on alerts but the query-front-end is returning total_lines=0 where source = ruler as logcli is getting total_lines=92220 using the same query as the alert directly against the gateway.

Any idea how to debug this? Nothing seems to be complaining about anything in any of the logs in question.

@LukoJy3D
Copy link
Contributor Author

LukoJy3D commented Nov 5, 2024

@LukoJy3D I have taken a look at your PR, the problem is I'm still learning about alerts and the ruler. And I'm not sure how the switch from cortex-tool to lokitool in Loki 3.1 affects that particular action, but since you're on a very recent version of the Loki Helm charts, I will take your word for it that the cortex rules action works with your Loki version.

Cortex CI action was just something we found very useful when deploying rules, which blended very nicely with existing prometheus alert deployment to Mimir using mimirtool in a similar CI fashion. So, I just wanted to share with others. It's not necessarily should be part of that PR.

The ruler is now running and not returning 500 status on alerts but the query-front-end is returning total_lines=0 where source = ruler as logcli is getting total_lines=92220 using the same query as the alert directly against the gateway.

Is there any chance your alerts are misconfigured and not hitting any results? This can happen when queries get sharded and subqueries do not return any data. Overall, I see total_lines=0 in Loki logs, yet it does not seem like it signals something not working.

@gyoza
Copy link

gyoza commented Nov 5, 2024

Is there any chance your alerts are misconfigured and not hitting any results? This can happen when queries get sharded and subqueries do not return any data. Overall, I see total_lines=0 in Loki logs, yet it does not seem like it signals something not working.

They are the same query but maybe I am missing something! I will focus some effort here! Thanks Luko!

@lpowalka
Copy link

lpowalka commented Nov 29, 2024

@gyoza Did you manage to fix this?
I'm having the same problem right now, despite having the evaluation mode set to remote.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.0 area/helm component/ruler type/bug Somehing is not working as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants