Skip to content

Commit

Permalink
Merge pull request grafana/cortex-jsonnet#334 from grafana/playbooks-…
Browse files Browse the repository at this point in the history
…for-compactor-alerts

Improve compactor alerts and playbooks
  • Loading branch information
pracucci authored Jun 21, 2021
2 parents a4b9505 + 7b96c22 commit e676f7c
Show file tree
Hide file tree
Showing 2 changed files with 25 additions and 29 deletions.
30 changes: 14 additions & 16 deletions jsonnet/mimir-mixin/alerts/compactor.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,19 @@
message: 'Cortex Compactor {{ $labels.namespace }}/{{ $labels.instance }} has not run compaction in the last 24 hours.',
},
},
{
// Alert if compactor failed to run 2 consecutive compactions.
alert: 'CortexCompactorHasNotSuccessfullyRunCompaction',
expr: |||
increase(cortex_compactor_runs_failed_total[2h]) >= 2
|||,
labels: {
severity: 'critical',
},
annotations: {
message: 'Cortex Compactor {{ $labels.namespace }}/{{ $labels.instance }} failed to run 2 consecutive compactions.',
},
},
{
// Alert if the compactor has not uploaded anything in the last 24h.
alert: 'CortexCompactorHasNotUploadedBlocks',
Expand All @@ -65,7 +78,7 @@
},
{
// Alert if the compactor has not uploaded anything since its start.
alert: 'CortexCompactorHasNotUploadedBlocksSinceStart',
alert: 'CortexCompactorHasNotUploadedBlocks',
'for': '24h',
expr: |||
thanos_objstore_bucket_last_successful_upload_time{job=~".+/%(compactor)s"} == 0
Expand All @@ -77,21 +90,6 @@
message: 'Cortex Compactor {{ $labels.namespace }}/{{ $labels.instance }} has not uploaded any block in the last 24 hours.',
},
},
{
// Alert if compactor fails.
alert: 'CortexCompactorRunFailed',
expr: |||
increase(cortex_compactor_runs_failed_total[2h]) >= 2
|||,
labels: {
severity: 'critical',
},
annotations: {
message: |||
{{ $labels.job }}/{{ $labels.instance }} failed to run compaction.
|||,
},
},
],
},
],
Expand Down
24 changes: 11 additions & 13 deletions jsonnet/mimir-mixin/docs/playbooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -272,11 +272,21 @@ Same as [`CortexCompactorHasNotSuccessfullyCleanedUpBlocks`](#CortexCompactorHas
This alert fires when a Cortex compactor is not uploading any compacted blocks to the storage since a long time.

How to **investigate**:
- If the alert `CortexCompactorHasNotSuccessfullyRun` or `CortexCompactorHasNotSuccessfullyRunSinceStart` have fired as well, then investigate that issue first
- If the alert `CortexCompactorHasNotSuccessfullyRunCompaction` has fired as well, then investigate that issue first
- If the alert `CortexIngesterHasNotShippedBlocks` or `CortexIngesterHasNotShippedBlocksSinceStart` have fired as well, then investigate that issue first
- Ensure ingesters are successfully shipping blocks to the storage
- Look for any error in the compactor logs

### CortexCompactorHasNotSuccessfullyRunCompaction

This alert fires if the compactor is not able to successfully compact all discovered compactable blocks (across all tenants).

When this alert fires, the compactor may still have successfully compacted some blocks but, for some reason, other blocks compaction is consistently failing. A common case is when the compactor is trying to compact a corrupted block for a single tenant: in this case the compaction of blocks for other tenants is still working, but compaction for the affected tenant is blocked by the corrupted block.

How to **investigate**:
- Look for any error in the compactor logs
- Corruption: [`not healthy index found`](#compactor-is-failing-because-of-not-healthy-index-found)

#### Compactor is failing because of `not healthy index found`

The compactor may fail to compact blocks due a corrupted block index found in one of the source blocks:
Expand All @@ -301,18 +311,6 @@ To rename a block stored on GCS you can use the `gsutil` CLI:
gsutil mv gs://BUCKET/TENANT/BLOCK gs://BUCKET/TENANT/corrupted-BLOCK
```

### CortexCompactorHasNotUploadedBlocksSinceStart

Same as [`CortexCompactorHasNotUploadedBlocks`](#CortexCompactorHasNotUploadedBlocks).

### CortexCompactorHasNotSuccessfullyRunCompaction

_TODO: this playbook has not been written yet._

### CortexCompactorRunFailed

_TODO: this playbook has not been written yet._

### CortexBucketIndexNotUpdated

This alert fires when the bucket index, for a given tenant, is not updated since a long time. The bucket index is expected to be periodically updated by the compactor and is used by queriers and store-gateways to get an almost-updated view over the bucket store.
Expand Down

0 comments on commit e676f7c

Please sign in to comment.