-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ruler evaluation producing status 500 errors and inconsistent alerts #14441
Comments
I'm also experiencing frustrating issues with ruler in the OP seemed to indicate the switching from local to remote evaluation fixed his problem, but it did not for me. I switched my deployment to distributed and spent a couple of days playing with settings trying to figure out where the breakdown was, but the querier never stopped printing 500 status codes, even when it seemed to be working normally. There was a single point in time that my queries started populating alerts -- some combination of |
Working on migrating from
For the record logcli against the new loki 3.2.0 gateway produces the desired results. |
Most of my problems disappeared after changing the ruler evaluation mode to remote.
Sadly, I needed to dig docs quite a lot to find this, as you would expect it to be under ruler/alerting docs. That's why I opened a PR linked to this issue so that this kind of info would be more easily accessible in the alerting docs. @JStickler I would appreciate it if you could review it 🙇 |
@LukoJy3D I have taken a look at your PR, the problem is I'm still learning about alerts and the ruler. And I'm not sure how the switch from cortex-tool to lokitool in Loki 3.1 affects that particular action, but since you're on a very recent version of the Loki Helm charts, I will take your word for it that the cortex rules action works with your Loki version. |
For the record I am running in distributed mode in the latest helm chart The ruler is now running and not returning 500 status on alerts but the query-front-end is returning Any idea how to debug this? Nothing seems to be complaining about anything in any of the logs in question. |
Cortex CI action was just something we found very useful when deploying rules, which blended very nicely with existing prometheus alert deployment to Mimir using mimirtool in a similar CI fashion. So, I just wanted to share with others. It's not necessarily should be part of that PR.
Is there any chance your alerts are misconfigured and not hitting any results? This can happen when queries get sharded and subqueries do not return any data. Overall, I see |
They are the same query but maybe I am missing something! I will focus some effort here! Thanks Luko! |
@gyoza Did you manage to fix this? |
Describe the bug
When alerts are deployed to Loki, it produces errors with status=500, and alerts are completely inconsistent compared to what is returned manually in grafana.
"log": "level=info ts=2024-10-09T19:46:01.141151674Z caller=engine.go:248 component=ruler evaluation_mode=local org_id=fake msg=\"executing query\" type=instant query=\"(sum(count_over_time({namespace=\\\"ai\\\"} |= \\\"check completed\\\"[10m])) < 30)\" query_hash=3077565986"
"log": "level=info ts=2024-10-09T19:46:01.228229623Z caller=metrics.go:217 component=ruler evaluation_mode=local org_id=fake latency=fast query=\"(sum(count_over_time({namespace=\\\"ai\\\"} |= \\\"check completed\\\"[10m])) < 30)\" query_hash=3077565986 query_type=metric range_type=instant length=0s start_delta=694.286852ms end_delta=694.286992ms step=0s duration=87.000739ms status=500 limit=0 returned_lines=0 throughput=4.1GB total_bytes=357MB total_bytes_structured_metadata=715kB lines_per_second=847349 total_lines=73720 post_filter_lines=48 total_entries=1 store_chunks_download_time=0s queue_time=0s splits=0 shards=0 query_referenced_structured_metadata=false pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=155.03µs cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s cache_result_query_length_served=0s ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=73 ingester_requests=8 ingester_chunk_head_bytes=2.5MB ingester_chunk_compressed_bytes=41MB ingester_chunk_decompressed_bytes=354MB ingester_post_filter_lines=48 congestion_control_latency=0s index_total_chunks=0 index_post_bloom_filter_chunks=0 index_bloom_filter_ratio=0.00 index_shard_resolver_duration=0s disable_pipeline_wrappers=false"
Ingesters do not seem to have terrible latency:
Alerts keep going into a pending state and even firing sometimes even though on manual query, it never goes below threshold:
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Alerts are evaluated properly without randomly going into a pending/firing state.
Environment:
The text was updated successfully, but these errors were encountered: