forked from cloudfoundry/diego-release
-
Notifications
You must be signed in to change notification settings - Fork 0
/
indicators.yml.erb
60 lines (51 loc) · 3.56 KB
/
indicators.yml.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
---
apiVersion: indicatorprotocol.io/v1
kind: IndicatorDocument
metadata:
labels:
deployment: <%= spec.deployment %>
component: locket
spec:
product:
name: diego
version: latest
indicators:
- name: locket_active_locks
promql: max_over_time(ActiveLocks{source_id="locket"}[5m])
documentation:
title: Locket - Active Locks
description: |
Total count of how many locks the system components are holding.
Use: If the ActiveLocks count is not equal to the expected value, there is likely a problem with Diego.
Origin: Firehose
Type: Gauge
Frequency: 60s
recommended_alert_thresholds: Dynamic
recommended_response: |
1. Run monit status to inspect for failing processes.
2. If there are no failing processes, then review the logs for the components using the Locket service: BBS, Auctioneer, TPS Watcher, Routing API, Clock Global (Cloud Controller clock), and others. Look for indications that only one of each component is active at a time.
3. Focus triage on the BBS first:
A healthy BBS shows obvious activity around starting or claiming LRPs.
An unhealthy BBS leads to the Auctioneer showing minimal or no activity. The BBS sends work to the Auctioneer.
Reference the BBS-level Locket metric Locks Held by BBS. A value of 0 indicates Locket issues at the BBS level.
4. If the BBS appears healthy, then check the Auctioneer to ensure it is processing auction payloads.
Recent logs for Auctioneer should show all but one of its instances are currently waiting on locks, and the active Auctioneer should show a record of when it last attempted to execute work. This attempt should correspond to app development activity, such as cf push.
Reference the Auctioneer-level Locket metric Locks Held by Auctioneer. A value of 0 indicates Locket issues at the Auctioneer level.
5. The TPS Watcher is primarily active when app instances crash. Therefore, if the TPS Watcher is suspected, review the most recent logs.
- name: locket_active_presences
promql: max_over_time(ActivePresences{source_id="locket"}[15m])
documentation:
title: Locket - Active Presences
description: |
Total count of active presences. Presences are defined as the registration records that the cells maintain to advertise themselves to the platform.
Use: If the Active Presences count is far from the expected, there might be a problem with Diego.
The number of active presences varies according to the number of cells deployed. Therefore, during purposeful scale adjustments, this alerting threshold should be adjusted.
Establish an initial threshold by observing the historical trends for the deployment over a brief period of time, Increase the threshold as more cells are deployed. During a rolling deploy, this metric shows variance during the BOSH lifecycle when cells are evacuated and restarted. Tolerable variance is within the bounds of the BOSH max inflight range for the instance group.
Origin: Firehose
Type: Gauge
Frequency: 60s
recommended_alert_thresholds: Dynamic
recommended_response: |
1. Ensure that the variance is not the result of an active rolling deploy. Also ensure that the alert threshold is appropriate to the number of cells in the current deployment.
2. Run monit status to inspect for failing processes.
3. If there are no failing processes, then review the logs for the components using the Locket service itself on Diego BBS instances.