Ruler evaluation producing status 500 errors and inconsistent alerts #14441

LukoJy3D · 2024-10-09T20:47:35Z

Describe the bug
When alerts are deployed to Loki, it produces errors with status=500, and alerts are completely inconsistent compared to what is returned manually in grafana.

"log": "level=info ts=2024-10-09T19:46:01.141151674Z caller=engine.go:248 component=ruler evaluation_mode=local org_id=fake msg=\"executing query\" type=instant query=\"(sum(count_over_time({namespace=\\\"ai\\\"} |= \\\"check completed\\\"[10m])) < 30)\" query_hash=3077565986"

"log": "level=info ts=2024-10-09T19:46:01.228229623Z caller=metrics.go:217 component=ruler evaluation_mode=local org_id=fake latency=fast query=\"(sum(count_over_time({namespace=\\\"ai\\\"} |= \\\"check completed\\\"[10m])) < 30)\" query_hash=3077565986 query_type=metric range_type=instant length=0s start_delta=694.286852ms end_delta=694.286992ms step=0s duration=87.000739ms status=500 limit=0 returned_lines=0 throughput=4.1GB total_bytes=357MB total_bytes_structured_metadata=715kB lines_per_second=847349 total_lines=73720 post_filter_lines=48 total_entries=1 store_chunks_download_time=0s queue_time=0s splits=0 shards=0 query_referenced_structured_metadata=false pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=155.03µs cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s cache_result_query_length_served=0s ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=73 ingester_requests=8 ingester_chunk_head_bytes=2.5MB ingester_chunk_compressed_bytes=41MB ingester_chunk_decompressed_bytes=354MB ingester_post_filter_lines=48 congestion_control_latency=0s index_total_chunks=0 index_post_bloom_filter_chunks=0 index_bloom_filter_ratio=0.00 index_shard_resolver_duration=0s disable_pipeline_wrappers=false"

Ingesters do not seem to have terrible latency:

Alerts keep going into a pending state and even firing sometimes even though on manual query, it never goes below threshold:

To Reproduce
Steps to reproduce the behavior:

Running Loki Chart 6.16.0 in SSD
deploy alert to Loki using https://github.com/grafana/cortex-rules-action

groups:
  - name: AiErrorAlerts
    rules:
      - alert: LowCheckCount
        expr: sum(count_over_time({namespace="ai"} |= `check completed` [10m])) < 30
        for: 10m
        labels:
          route: alerts-ai
          severity: medium
          source: loki
        annotations:
          summary: Low check completion rate.
          details: Checks completed per 10 minutes - `{{ $value }}`. Expected rate - `30`.

Ruler config:

rulerConfig:
    alertmanager_url: "https://<mimir_am>:8080/alertmanager/"
    enable_alertmanager_v2: true
    enable_sharding: true
    remote_write:
      enabled: true
      clients:
        local:
          url: "http://haproxy.prometheus.svc.cluster.local:8080/api/v1/push"
    storage:
      gcs:
        bucket_name: hostinger-loki-ruler
    wal:
      dir: /var/loki/wal

Expected behavior
Alerts are evaluated properly without randomly going into a pending/firing state.

Environment:

Infrastructure: Kubernetes
Deployment tool: helm

The text was updated successfully, but these errors were encountered:

yapishu · 2024-10-28T13:49:38Z

I'm also experiencing frustrating issues with ruler in the 6.18.0 chart (app version 3.2.0); the querier seems to evaluate correctly (e.g. printing the number of lines evaluated, matches, throughput etc in the debug logs) for queries that come from the ruler, but they always print status 500, even for queries that don't hit the logs (like vector(1) > 0), and alerting rules never seem to trigger, which I presume means that the queries aren't returning results. This only applies to queries that come from the ruler, and only seems to apply to alerting rules -- if I make a recording rule with the same query, it works fine; identical queries that come through the frontend (ie curls) work perfectly. This has been driving me crazy for weeks.

OP seemed to indicate the switching from local to remote evaluation fixed his problem, but it did not for me. I switched my deployment to distributed and spent a couple of days playing with settings trying to figure out where the breakdown was, but the querier never stopped printing 500 status codes, even when it seemed to be working normally. There was a single point in time that my queries started populating alerts -- some combination of local rule evaluation and distributed mode -- and I tried to preserve these settings in SimpleScalable deployment, but it stopped working immediately.

gyoza · 2024-11-01T20:05:10Z

Working on migrating from loki-distributed to this chart 6.18.0 and app version 3.2.0 in distributed mode have also produced status=500 ruler alerts using the following config.

rulerConfig:
    alertmanager_url: http://${ALTERMANAGER_URL}
    external_url: https://${EXTERNAL_URL}
    enable_alertmanager_v2: true
    storage:
      local:
        directory: /etc/loki/rules
      type: local

For the record logcli against the new loki 3.2.0 gateway produces the desired results.

LukoJy3D · 2024-11-04T16:22:08Z

Most of my problems disappeared after changing the ruler evaluation mode to remote.

evaluation:
      mode: remote
      query_frontend:
        address: loki-read.loki.svc.cluster.local:9095

Sadly, I needed to dig docs quite a lot to find this, as you would expect it to be under ruler/alerting docs. That's why I opened a PR linked to this issue so that this kind of info would be more easily accessible in the alerting docs.

@JStickler I would appreciate it if you could review it 🙇

JStickler · 2024-11-04T16:47:07Z

@LukoJy3D I have taken a look at your PR, the problem is I'm still learning about alerts and the ruler. And I'm not sure how the switch from cortex-tool to lokitool in Loki 3.1 affects that particular action, but since you're on a very recent version of the Loki Helm charts, I will take your word for it that the cortex rules action works with your Loki version.

gyoza · 2024-11-04T21:18:12Z

For the record I am running in distributed mode in the latest helm chart 6.18.0 and coincidentally I came across this fix today and made this change.

The ruler is now running and not returning 500 status on alerts but the query-front-end is returning total_lines=0 where source = ruler as logcli is getting total_lines=92220 using the same query as the alert directly against the gateway.

Any idea how to debug this? Nothing seems to be complaining about anything in any of the logs in question.

LukoJy3D · 2024-11-05T09:17:15Z

@LukoJy3D I have taken a look at your PR, the problem is I'm still learning about alerts and the ruler. And I'm not sure how the switch from cortex-tool to lokitool in Loki 3.1 affects that particular action, but since you're on a very recent version of the Loki Helm charts, I will take your word for it that the cortex rules action works with your Loki version.

Cortex CI action was just something we found very useful when deploying rules, which blended very nicely with existing prometheus alert deployment to Mimir using mimirtool in a similar CI fashion. So, I just wanted to share with others. It's not necessarily should be part of that PR.

The ruler is now running and not returning 500 status on alerts but the query-front-end is returning total_lines=0 where source = ruler as logcli is getting total_lines=92220 using the same query as the alert directly against the gateway.

Is there any chance your alerts are misconfigured and not hitting any results? This can happen when queries get sharded and subqueries do not return any data. Overall, I see total_lines=0 in Loki logs, yet it does not seem like it signals something not working.

gyoza · 2024-11-05T15:10:34Z

Is there any chance your alerts are misconfigured and not hitting any results? This can happen when queries get sharded and subqueries do not return any data. Overall, I see total_lines=0 in Loki logs, yet it does not seem like it signals something not working.

They are the same query but maybe I am missing something! I will focus some effort here! Thanks Luko!

lpowalka · 2024-11-29T11:04:07Z

@gyoza Did you manage to fix this?
I'm having the same problem right now, despite having the evaluation mode set to remote.

LukoJy3D changed the title ~~Ruler evaluation producing status 500 errors and inconsitant alerts~~ Ruler evaluation producing status 500 errors and inconsistent alerts Oct 9, 2024

LukoJy3D mentioned this issue Oct 12, 2024

docs(alert): clarify remote rule evaluation #14481

Merged

6 tasks

JStickler added component/ruler 3.0 area/helm type/bug Somehing is not working as expected labels Oct 30, 2024

JStickler closed this as completed in #14481 Nov 5, 2024

This was referenced Nov 5, 2024

docs(alert): clarify remote rule evaluation (backport release-3.1.x) #14777

Merged

docs(alert): clarify remote rule evaluation (backport release-3.2.x) #14778

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ruler evaluation producing status 500 errors and inconsistent alerts #14441

Ruler evaluation producing status 500 errors and inconsistent alerts #14441

LukoJy3D commented Oct 9, 2024 •

edited

Loading

yapishu commented Oct 28, 2024 •

edited

Loading

gyoza commented Nov 1, 2024

LukoJy3D commented Nov 4, 2024

JStickler commented Nov 4, 2024

gyoza commented Nov 4, 2024 •

edited

Loading

LukoJy3D commented Nov 5, 2024

gyoza commented Nov 5, 2024

lpowalka commented Nov 29, 2024 •

edited

Loading

Ruler evaluation producing status 500 errors and inconsistent alerts #14441

Ruler evaluation producing status 500 errors and inconsistent alerts #14441

Comments

LukoJy3D commented Oct 9, 2024 • edited Loading

yapishu commented Oct 28, 2024 • edited Loading

gyoza commented Nov 1, 2024

LukoJy3D commented Nov 4, 2024

JStickler commented Nov 4, 2024

gyoza commented Nov 4, 2024 • edited Loading

LukoJy3D commented Nov 5, 2024

gyoza commented Nov 5, 2024

lpowalka commented Nov 29, 2024 • edited Loading

LukoJy3D commented Oct 9, 2024 •

edited

Loading

yapishu commented Oct 28, 2024 •

edited

Loading

gyoza commented Nov 4, 2024 •

edited

Loading

lpowalka commented Nov 29, 2024 •

edited

Loading