-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raft and state store indexes as metrics #5841
Changes from 3 commits
9ab2207
2282fe1
425bd4f
3adb751
aba8d42
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,6 +15,7 @@ import ( | |
"sync/atomic" | ||
"time" | ||
|
||
"github.com/armon/go-metrics" | ||
"github.com/hashicorp/consul/agent/consul/autopilot" | ||
consulapi "github.com/hashicorp/consul/api" | ||
"github.com/hashicorp/consul/lib" | ||
|
@@ -410,6 +411,9 @@ func NewServer(config *Config, consulCatalog consul.CatalogAPI) (*Server, error) | |
// Emit metrics | ||
go s.heartbeatStats() | ||
|
||
// Emit raft and state store metrics | ||
go s.EmitRaftStats(time.Second, s.shutdownCh) | ||
|
||
// Start enterprise background workers | ||
s.startEnterpriseBackground() | ||
|
||
|
@@ -1450,6 +1454,27 @@ func (s *Server) Stats() map[string]map[string]string { | |
return stats | ||
} | ||
|
||
// EmitRaftStats is used to export metrics about raft indexes and state store snapshot index | ||
func (s *Server) EmitRaftStats(period time.Duration, stopCh <-chan struct{}) { | ||
for { | ||
select { | ||
case <-time.After(period): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. very nit-picky - |
||
lastIndex := s.raft.LastIndex() | ||
metrics.SetGauge([]string{"raft", "lastIndex"}, float32(lastIndex)) | ||
appliedIndex := s.raft.AppliedIndex() | ||
metrics.SetGauge([]string{"raft", "appliedIndex"}, float32(appliedIndex)) | ||
stateStoreSnapshotIndex, err := s.State().LatestIndex() | ||
if err != nil { | ||
s.logger.Warn("Unable to read snapshot index from statestore, metric will not be emitted", "error", err) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Curious - what conditions would result into an error getting the latest index? Also, is it meant to be recoverable? I'd be concerned about spurious logging if it happens somewhat frequently. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be rare and indicates state store corruption (either the index table is missing or contains invalid data). A lot of Nomad operations would be broken if it gets to that state so its a good thing to be warning about here. |
||
} else { | ||
metrics.SetGauge([]string{"state", "snapshotIndex"}, float32(stateStoreSnapshotIndex)) | ||
} | ||
case <-stopCh: | ||
return | ||
} | ||
} | ||
} | ||
|
||
// Region returns the region of the server | ||
func (s *Server) Region() string { | ||
return s.config.Region | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -109,6 +109,18 @@ when retrieving metrics using the above described signals. | |
<td>Raft transactions / `interval`</td> | ||
<td>Counter</td> | ||
</tr> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @langmartin @schmichael Can i get a review on these docs before I merge? |
||
<tr> | ||
<td>`nomad.raft.lastIndex`</td> | ||
<td>Index of the last log</td> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's the index of the last log or snapshot, I assume there are cases where the snapshot can lead the logs. Since snapshot is just a view of the logs, this may not be a meaningful distinction |
||
<td>Sequence number</td> | ||
<td>Gauge</td> | ||
</tr> | ||
<tr> | ||
<td>`nomad.raft.appliedIndex`</td> | ||
<td>Index of the last applied log</td> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ~ Is it worth adding some context and/or the caveat in https://godoc.org/github.com/hashicorp/raft#Raft.AppliedIndex :
|
||
<td>Sequence number</td> | ||
<td>Gauge</td> | ||
</tr> | ||
<tr> | ||
<td>`nomad.raft.replication.appendEntries`</td> | ||
<td>Raft transaction commit time</td> | ||
|
@@ -167,6 +179,12 @@ when retrieving metrics using the above described signals. | |
<td>ms / Plan Evaluation</td> | ||
<td>Timer</td> | ||
</tr> | ||
<tr> | ||
<td>`nomad.state.snapshotIndex`</td> | ||
<td>Latest index in the server's in memory state store</td> | ||
<td>Sequence number</td> | ||
<td>Gauge</td> | ||
</tr> | ||
<tr> | ||
<td>`nomad.worker.invoke_scheduler.<type>`</td> | ||
<td>Time to run the scheduler of the given type</td> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see how this matches other stats period, but I suspect that this is too frequent. Statsd agent commonly only flush data every few seconds seconds (e.g. 10 seconds for datadog [1] and statsite [2]), and with gauges, all values in the flush interval are dropped except for the last one [3].
Nothing to change now but raising awareness of potentially wasteful operations here and we can investigate/act later.
[1] https://docs.datadoghq.com/developers/dogstatsd/#how-it-works
[2] http://statsite.github.io/statsite/
[3] https://github.com/statsd/statsd/blob/master/docs/metric_types.md#gauges
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be more reasonable to use 10seconds here? The one potentially expensive call is
state.LatestIndex
which scans multiple memdb tables to find the max index. I'd rather fix this now to be more conservative about emitting stats.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd go with 10s or 5s, and then we can reexamine other stats periods.