Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raft and state store indexes as metrics #5841

Merged
merged 5 commits into from
Jun 19, 2019
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions nomad/server.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ import (
"sync/atomic"
"time"

"github.com/armon/go-metrics"
"github.com/hashicorp/consul/agent/consul/autopilot"
consulapi "github.com/hashicorp/consul/api"
"github.com/hashicorp/consul/lib"
Expand Down Expand Up @@ -410,6 +411,9 @@ func NewServer(config *Config, consulCatalog consul.CatalogAPI) (*Server, error)
// Emit metrics
go s.heartbeatStats()

// Emit raft and state store metrics
go s.EmitRaftStats(time.Second, s.shutdownCh)
Copy link
Contributor

@notnoop notnoop Jun 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see how this matches other stats period, but I suspect that this is too frequent. Statsd agent commonly only flush data every few seconds seconds (e.g. 10 seconds for datadog [1] and statsite [2]), and with gauges, all values in the flush interval are dropped except for the last one [3].

Nothing to change now but raising awareness of potentially wasteful operations here and we can investigate/act later.

[1] https://docs.datadoghq.com/developers/dogstatsd/#how-it-works
[2] http://statsite.github.io/statsite/
[3] https://github.com/statsd/statsd/blob/master/docs/metric_types.md#gauges

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more reasonable to use 10seconds here? The one potentially expensive call is state.LatestIndex which scans multiple memdb tables to find the max index. I'd rather fix this now to be more conservative about emitting stats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd go with 10s or 5s, and then we can reexamine other stats periods.


// Start enterprise background workers
s.startEnterpriseBackground()

Expand Down Expand Up @@ -1450,6 +1454,27 @@ func (s *Server) Stats() map[string]map[string]string {
return stats
}

// EmitRaftStats is used to export metrics about the blocked eval tracker while enabled
func (s *Server) EmitRaftStats(period time.Duration, stopCh <-chan struct{}) {
for {
select {
case <-time.After(period):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nit-picky - timer.Ticker is slightly nicer for using in loops, to reuse channels and timer mechanisms.

commitIndex := s.raft.LastIndex()
metrics.SetGauge([]string{"raft", "commitIndex"}, float32(commitIndex))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be logIndex instead of commitIndex? At least the comments in https://github.com/hashicorp/raft/blob/master/state.go#L66 suggest (because of "cache") that we're getting a possibly stale index from the raft max(log, snapshot). This may be fine, I'm still a little hazy on which ids from raft may move backwards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like logIndex and commitIndex are distinct, and probably both helpful. The only place the commit index is returned from the raft api looks like its in the Stats call, https://github.com/hashicorp/raft/blob/master/api.go#L943

appliedIndex := s.raft.AppliedIndex()
metrics.SetGauge([]string{"raft", "appliedIndex"}, float32(appliedIndex))
stateStoreSnapshotIndex, err := s.State().LatestIndex()
if err != nil {
s.logger.Warn("Unable to read snapshot index from statestore, metric will not be emitted", "error", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious - what conditions would result into an error getting the latest index? Also, is it meant to be recoverable? I'd be concerned about spurious logging if it happens somewhat frequently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be rare and indicates state store corruption (either the index table is missing or contains invalid data). A lot of Nomad operations would be broken if it gets to that state so its a good thing to be warning about here.

} else {
metrics.SetGauge([]string{"state", "snapshotIndex"}, float32(stateStoreSnapshotIndex))
}
case <-stopCh:
return
}
}
}

// Region returns the region of the server
func (s *Server) Region() string {
return s.config.Region
Expand Down