Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LCAM 1650 Setup Grafana Dashboard and Alerting for LAA Crime Evidence Service #28462

Merged
merged 3 commits into from
Dec 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
namespace: laa-crime-evidence-prod
labels:
role: alert-rules
name: prometheus-custom-rules-laa-crime-evidence
spec:
groups:
- name: application-rules
rules:
- alert: Client Response Errors
expr: sum(increase(http_server_requests_seconds_count{outcome="CLIENT_ERROR", namespace="laa-crime-evidence-prod"}[10m])) > 1
for: 1m
labels:
severity: laa-crime-evidence-alerts-prod
namespace: laa-crime-evidence-prod
annotations:
message: There has been an increase in client error responses in the Crime Evidence Service running on production in the past 10 minutes. This may indicate a problem with clients calling the application - including intrusion attempts.
dashboard_url: https://grafana.live.cloud-platform.service.justice.gov.uk/d/6ab2e7a7933f4d42bd5871e8df8ff6b3703625e2/laa-crime-evidence?var-namespace=laa-crime-evidence-prod
- alert: Server Response Error
expr: sum(increase(http_server_requests_seconds_count{outcome="SERVER_ERROR", namespace="laa-crime-evidence-prod"}[10m])) > 1
for: 1m
labels:
severity: laa-crime-evidence-alerts-prod
namespace: laa-crime-evidence-prod
annotations:
message: There has been an increase in server error responses from the Crime Evidence Service running on production in the past 10 minutes. This may indicate a problem with the server processing client requests - likely a bug.
dashboard_url: https://grafana.live.cloud-platform.service.justice.gov.uk/d/6ab2e7a7933f4d42bd5871e8df8ff6b3703625e2/laa-crime-evidence?var-namespace=laa-crime-evidence-prod
- alert: JVM CPU Usage
expr: (process_cpu_usage{namespace="laa-crime-evidence-prod"} * 100) > 95
for: 1m
labels:
severity: laa-crime-evidence-alerts-prod
namespace: laa-crime-evidence-prod
annotations:
message: The crime-evidence running on a production pod has been using over 95% of the CPU for a minute. This may indicate the pods running this application require more CPU resources.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md
dashboard_url: https://grafana.live.cloud-platform.service.justice.gov.uk/d/6ab2e7a7933f4d42bd5871e8df8ff6b3703625e2/laa-crime-evidence?var-namespace=laa-crime-evidence-prod
- alert: System CPU Usage
expr: (system_cpu_usage{namespace="laa-crime-evidence-prod"} * 100) > 95
for: 1m
labels:
severity: laa-crime-evidence-alerts-prod
namespace: laa-crime-evidence-prod
annotations:
message: A pod that runs the crime-evidence on production has been using over 95% of the pod's CPU for a minute. This may indicate there is some underlying process other than our application on the pod that is using up all the CPU resources and warrants further investigation.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md
dashboard_url: https://grafana.live.cloud-platform.service.justice.gov.uk/d/6ab2e7a7933f4d42bd5871e8df8ff6b3703625e2/laa-crime-evidence?var-namespace=laa-crime-evidence-prod
- alert: Object Heap Memory Usage - Tenured Gen
expr: ((jvm_memory_used_bytes{area="heap", id="Tenured Gen", namespace="laa-crime-evidence-prod"}/jvm_memory_max_bytes{area="heap", id="Tenured Gen", namespace="laa-crime-evidence-prod"})*100) > 95
for: 1m
labels:
severity: laa-crime-evidence-alerts-prod
namespace: laa-crime-evidence-prod
annotations:
message: Over 95% of the "Tenured Gen" object heap memory on a production pod had been used. This may indicate our application needs more object heap memory or we have a memory resource leak in our application.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md
dashboard_url: https://grafana.live.cloud-platform.service.justice.gov.uk/d/6ab2e7a7933f4d42bd5871e8df8ff6b3703625e2/laa-crime-evidence?var-namespace=laa-crime-evidence-prod
- alert: Object Heap Memory Usage - Survivor Space
expr: ((jvm_memory_used_bytes{area="heap", id="Survivor Space", namespace="laa-crime-evidence-prod"}/jvm_memory_max_bytes{area="heap", id="Survivor Space", namespace="laa-crime-evidence-prod"})*100) > 95
for: 1m
labels:
severity: laa-crime-evidence-alerts-prod
namespace: laa-crime-evidence-prod
annotations:
message: Over 95% of the "Survivor Space" object heap memory on a production pod had been used. This may indicate our application needs more object heap memory or we have a memory resource leak in our application.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md
dashboard_url: https://grafana.live.cloud-platform.service.justice.gov.uk/d/6ab2e7a7933f4d42bd5871e8df8ff6b3703625e2/laa-crime-evidence?var-namespace=laa-crime-evidence-prod
- alert: Object Heap Memory Usage - Eden Space
expr: ((jvm_memory_used_bytes{area="heap", id="Eden Space", namespace="laa-crime-evidence-prod"}/jvm_memory_max_bytes{area="heap", id="Eden Space", namespace="laa-crime-evidence-prod"})*100) > 95
for: 1m
labels:
severity: laa-crime-evidence-alerts-prod
namespace: laa-crime-evidence-prod
annotations:
message: Over 95% of the "Eden Space" object heap memory on a production pod had been used. This may indicate our application needs more object heap memory or we have a memory resource leak in our application.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md
dashboard_url: https://grafana.live.cloud-platform.service.justice.gov.uk/d/6ab2e7a7933f4d42bd5871e8df8ff6b3703625e2/laa-crime-evidence?var-namespace=laa-crime-evidence-prod
- alert: Response Time Excessive
expr: (sum(http_server_requests_seconds_sum{outcome="SUCCESS", namespace="laa-crime-evidence-prod"})/sum(http_server_requests_seconds_count{outcome="SUCCESS", namespace="laa-crime-evidence-prod"})) > 5
for: 1m
labels:
severity: laa-crime-evidence-alerts-prod
namespace: laa-crime-evidence-prod
annotations:
message: The response time for successful responses on production is taking on average over five seconds. This indicates responses to our clients is taking too long.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md
dashboard_url: https://grafana.live.cloud-platform.service.justice.gov.uk/d/6ab2e7a7933f4d42bd5871e8df8ff6b3703625e2/laa-crime-evidence?var-namespace=laa-crime-evidence-prod
- alert: Logging Error
expr: sum(increase(logback_events_total{level="error", namespace="$namespace"}[10m])) > 1
for: 1m
labels:
severity: laa-crime-evidence-alerts-prod
namespace: laa-crime-evidence-prod
annotations:
message: There had been an error in the logs of the Crime Evidence service in the past 10 minutes. This indicates that there may be a bug with the application.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md
dashboard_url: https://grafana.live.cloud-platform.service.justice.gov.uk/d/6ab2e7a7933f4d42bd5871e8df8ff6b3703625e2/laa-crime-evidence?var-namespace=laa-crime-evidence-prod
- alert: Instance-Down
expr: absent(up{namespace="laa-crime-evidence-prod"}) == 1
for: 1m
labels:
severity: laa-crime-evidence-alerts-prod
namespace: laa-crime-evidence-prod
annotations:
message: The production instance of the Crime Evidence service has been down for >1m.
- alert: Quota-Exceeded
expr: 100 * kube_resourcequota{job="kube-state-metrics",type="used",namespace="laa-crime-evidence-prod"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard",namespace="laa-crime-evidence-prod"} > 0) > 90
for: 1m
labels:
severity: laa-crime-evidence-alerts-prod
namespace: laa-crime-evidence-prod
annotations:
message: Namespace {{ $labels.namespace }} is using {{ printf "%0.0f" $value}}% of its {{ $labels.resource }} quota.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubequotaexceeded
- alert: KubePodCrashLooping
expr: round(rate(kube_pod_container_status_restarts_total{job="kube-state-metrics", namespace="laa-crime-evidence-prod"}[10m]) * 60 * 10) > 1
for: 5m
labels:
severity: laa-crime-evidence-alerts-prod
namespace: laa-crime-evidence-prod
annotations:
message: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting excessively
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodcrashlooping
- alert: Increase in 403 Blocked Requests
expr: sum(increase(nginx_ingress_controller_requests{exported_namespace="laa-crime-evidence-prod", ingress="laa-crime-evidence", status="403"}[5m])) > 1
for: 1m
labels:
severity: laa-crime-evidence-alerts-prod
namespace: laa-crime-evidence-prod
annotations:
message: The rate of requests blocked by the internal ingress has been increasing over the past 5 minutes.
dashboard_url: https://grafana.live.cloud-platform.service.justice.gov.uk/d/6ab2e7a7933f4d42bd5871e8df8ff6b3703625e2/laa-crime-evidence?var-namespace=laa-crime-evidence-prod
Loading
Loading