Raft and state store indexes as metrics #5841

preetapan · 2019-06-14T21:32:37Z

This PR makes every server emit raft commit and apply indexes, and the state store's latest index as metrics

…ndex

schmichael

Don't forget docs and changelog. I think it's worth a quick test that spins up a TestServer and reads the metrics if for no other reason than to ensure we don't accidently break it in the future.

langmartin · 2019-06-17T15:53:37Z

nomad/server.go

+		select {
+		case <-time.After(period):
+			commitIndex := s.raft.LastIndex()
+			metrics.SetGauge([]string{"raft", "commitIndex"}, float32(commitIndex))


Should this be logIndex instead of commitIndex? At least the comments in https://github.com/hashicorp/raft/blob/master/state.go#L66 suggest (because of "cache") that we're getting a possibly stale index from the raft max(log, snapshot). This may be fine, I'm still a little hazy on which ids from raft may move backwards.

It looks to me like logIndex and commitIndex are distinct, and probably both helpful. The only place the commit index is returned from the raft api looks like its in the Stats call, https://github.com/hashicorp/raft/blob/master/api.go#L943

preetapan · 2019-06-17T21:10:28Z

website/source/docs/telemetry/index.html.md

@@ -109,6 +109,18 @@ when retrieving metrics using the above described signals.
    <td>Raft transactions / `interval`</td>
    <td>Counter</td>
  </tr>


@langmartin @schmichael Can i get a review on these docs before I merge?

notnoop

lgtm - had few comments but they aren't big deal.

notnoop · 2019-06-18T01:43:17Z

nomad/server.go

+func (s *Server) EmitRaftStats(period time.Duration, stopCh <-chan struct{}) {
+	for {
+		select {
+		case <-time.After(period):


very nit-picky - timer.Ticker is slightly nicer for using in loops, to reuse channels and timer mechanisms.

notnoop · 2019-06-18T01:46:01Z

nomad/server.go

+			metrics.SetGauge([]string{"raft", "appliedIndex"}, float32(appliedIndex))
+			stateStoreSnapshotIndex, err := s.State().LatestIndex()
+			if err != nil {
+				s.logger.Warn("Unable to read snapshot index from statestore, metric will not be emitted", "error", err)


Curious - what conditions would result into an error getting the latest index? Also, is it meant to be recoverable? I'd be concerned about spurious logging if it happens somewhat frequently.

This should be rare and indicates state store corruption (either the index table is missing or contains invalid data). A lot of Nomad operations would be broken if it gets to that state so its a good thing to be warning about here.

notnoop · 2019-06-18T01:47:50Z

website/source/docs/telemetry/index.html.md

+  </tr>
+  <tr>
+    <td>`nomad.raft.appliedIndex`</td>
+    <td>Index of the last applied log</td>


~ Is it worth adding some context and/or the caveat in https://godoc.org/github.com/hashicorp/raft#Raft.AppliedIndex :

This is generally lagging behind the last index, especially for indexes that are persisted but have not yet been considered committed by the leader. NOTE - this reflects the last index that was sent to the application's FSM over the apply channel but DOES NOT mean that the application's FSM has yet consumed it and applied it to its internal state. Thus, the application's state may lag behind this index.

notnoop · 2019-06-18T01:59:18Z

nomad/server.go

@@ -410,6 +411,9 @@ func NewServer(config *Config, consulCatalog consul.CatalogAPI) (*Server, error)
 	// Emit metrics
 	go s.heartbeatStats()

+	// Emit raft and state store metrics
+	go s.EmitRaftStats(time.Second, s.shutdownCh)


I see how this matches other stats period, but I suspect that this is too frequent. Statsd agent commonly only flush data every few seconds seconds (e.g. 10 seconds for datadog [1] and statsite [2]), and with gauges, all values in the flush interval are dropped except for the last one [3].

Nothing to change now but raising awareness of potentially wasteful operations here and we can investigate/act later.

[1] https://docs.datadoghq.com/developers/dogstatsd/#how-it-works
[2] http://statsite.github.io/statsite/
[3] https://github.com/statsd/statsd/blob/master/docs/metric_types.md#gauges

Would it be more reasonable to use 10seconds here? The one potentially expensive call is state.LatestIndex which scans multiple memdb tables to find the max index. I'd rather fix this now to be more conservative about emitting stats.

Yeah, I'd go with 10s or 5s, and then we can reexamine other stats periods.

langmartin · 2019-06-18T16:09:41Z

website/source/docs/telemetry/index.html.md

@@ -109,6 +109,18 @@ when retrieving metrics using the above described signals.
    <td>Raft transactions / `interval`</td>
    <td>Counter</td>
  </tr>
+  <tr>
+    <td>`nomad.raft.lastIndex`</td>
+    <td>Index of the last log</td>


it's the index of the last log or snapshot, I assume there are cases where the snapshot can lead the logs. Since snapshot is just a view of the logs, this may not be a meaningful distinction

github-actions · 2023-02-08T02:17:12Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Emit metrics with raft commit and apply index and statestore latest i…

9ab2207

…ndex

preetapan requested review from langmartin, notnoop and endocrimes June 14, 2019 21:32

schmichael approved these changes Jun 17, 2019

View reviewed changes

langmartin approved these changes Jun 17, 2019

View reviewed changes

Preetha Appan added 2 commits June 17, 2019 15:51

Changed name of metric

2282fe1

docs for new metrics

425bd4f

preetapan commented Jun 17, 2019

View reviewed changes

notnoop approved these changes Jun 18, 2019

View reviewed changes

langmartin reviewed Jun 18, 2019

View reviewed changes

Preetha Appan added 2 commits June 19, 2019 11:58

Change interval of raft stats collection to 10s

3adb751

Add links to godoc for raft related metrics

aba8d42

preetapan merged commit c135f03 into master Jun 19, 2019

notnoop deleted the f-raft-snapshot-metrics branch June 22, 2019 06:11

github-actions bot locked as resolved and limited conversation to collaborators Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raft and state store indexes as metrics #5841

Raft and state store indexes as metrics #5841

preetapan commented Jun 14, 2019

schmichael left a comment

langmartin Jun 17, 2019

langmartin Jun 17, 2019

preetapan Jun 17, 2019

notnoop left a comment

notnoop Jun 18, 2019

notnoop Jun 18, 2019

preetapan Jun 18, 2019

notnoop Jun 18, 2019

notnoop Jun 18, 2019 •

edited

Loading

preetapan Jun 18, 2019

notnoop Jun 18, 2019

langmartin Jun 18, 2019

github-actions bot commented Feb 8, 2023

Raft and state store indexes as metrics #5841

Raft and state store indexes as metrics #5841

Conversation

preetapan commented Jun 14, 2019

schmichael left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

notnoop left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

notnoop Jun 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 8, 2023

notnoop Jun 18, 2019 •

edited

Loading