Added metrics to track task/alloc start/restarts/dead events #3061

diptanu · 2017-08-19T08:36:36Z

There could be better places to track the task state changes and increment counters. The task_runner and allocation runner code has changed a lot since I last touched them, so please suggest what's the best place to do this.

I didn't add the allocation ID to the metrics here since alloc IDs are unique and I think these metrics are more interesting for debugging cluster-wide problems than seeing what's happening with an individual allocation. People can always group the metrics by node IDs to get a sense of any node level outliers for restarts across racks and clusters.

cc/ @dadgar @schmichael

dadgar

Looks good!

dadgar · 2017-08-21T21:35:43Z

client/alloc_runner.go

 		}
 	case structs.TaskStateDead:
 		// Capture the finished time. If it has never started there is no finish
 		// time
 		if !taskState.StartedAt.IsZero() {
 			taskState.FinishedAt = time.Now().UTC()
+			metrics.IncrCounter([]string{"client", "allocs", r.alloc.Job.Name, r.alloc.TaskGroup, taskName, "dead"}, 1)


Is it worth splitting into "complete" and "failed" based on the taskState.Failed variable?

Also I think we should increment the counter outside this if block. We should still record it as dead even if it never started.

@dadgar @schmichael While writing the PR, I was wondering wouldn't it be cleaner if we let alloc runners kill other task runners in a group when a sibling fails? Feels like TR should just bubble the event up to it's supervisor to make that decision.

@diptanu It already does that if you annotate a task as a leader.

schmichael

Thanks @diptanu! Left some nitpicks, but the functionality seems good.

schmichael · 2017-08-23T19:01:29Z

client/alloc_runner.go

 		}
 	case structs.TaskStateDead:
 		// Capture the finished time. If it has never started there is no finish
 		// time
 		if !taskState.StartedAt.IsZero() {
 			taskState.FinishedAt = time.Now().UTC()
+			metrics.IncrCounter([]string{"client", "allocs", r.alloc.Job.Name, r.alloc.TaskGroup, taskName, "dead"}, 1)


Also I think we should increment the counter outside this if block. We should still record it as dead even if it never started.

schmichael · 2017-08-23T19:10:16Z

client/alloc_runner.go

@@ -744,6 +748,9 @@ func (r *AllocRunner) Run() {
 	defer close(r.waitCh)
 	go r.dirtySyncState()

+	// Incr alloc runner start counter
+	metrics.IncrCounter([]string{"client", "allocs", r.alloc.Job.Name, r.alloc.TaskGroup, "start"}, 1)


This metric is going to be a bit confusing to people not familiar with Nomad's internals because it will get incremented even on the Restore path even though the actual tasks are still running. It just signifies an AllocRunner struct has been Run while I would think people unfamiliar with Nomad would think it means "tasks for this task group are being executed."

Unfortunately we don't document our metrics outside of code, so there's no good place to specify what this metric means. Perhaps at least expand the comment with something like:

// Increment alloc runner start counter. Incr'd even when restoring existing tasks so 1 start != 1 task execution

@schmichael @diptanu Not quite true. We should add documentation here: https://www.nomadproject.io/docs/agent/telemetry.html

Oops! Thanks Alex!

schmichael · 2017-08-23T19:11:59Z

client/alloc_runner.go

@@ -926,6 +933,9 @@ func (r *AllocRunner) handleDestroy() {
 	// state as we wait for a destroy.
 	alloc := r.Alloc()

+	// Incr the alloc destroy counter


Like above, I don't think we ever really document or expose what "destroy" means, so this metric seems confusing to end users. Perhaps expand the comment with something like:

// Increment the destroy count for this alloc runner since this allocation is being removed from this client.

diptanu · 2017-08-30T05:37:27Z

@dadgar Updated the PR based on comments.

dadgar

Can you add the metrics to the metrics documentation

dadgar · 2017-08-30T19:46:26Z

client/alloc_runner.go

 		}
 	case structs.TaskStateDead:
 		// Capture the finished time. If it has never started there is no finish
 		// time
+		metrics.IncrCounter([]string{"client", "allocs", r.alloc.Job.Name, r.alloc.TaskGroup, taskName, "dead"}, 1)


Can we remove this given the more fine grain metrics below.

dadgar · 2017-08-30T19:50:06Z

client/alloc_runner.go

@@ -744,6 +754,9 @@ func (r *AllocRunner) Run() {
 	defer close(r.waitCh)
 	go r.dirtySyncState()

+	// Increment alloc runner start counter. Incr'd even when restoring existing tasks so 1 start != 1 task execution
+	metrics.IncrCounter([]string{"client", "allocs", r.alloc.Job.Name, r.alloc.TaskGroup, "start"}, 1)


I would move this below the terminal status check. We don't want to emit it if the alloc_runner is just waiting for a destroy.

dadgar · 2017-09-07T00:35:10Z

This should be updated now to reflect that we have tagged metrics now.

diptanu · 2017-11-02T17:08:56Z

@dadgar done

dadgar

Looks good, only thing missing is documentation on the new metrics

dadgar · 2017-11-02T20:40:36Z

Add to changelog please

github-actions · 2023-03-19T02:17:56Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

diptanu requested review from schmichael and dadgar August 19, 2017 08:39

dadgar reviewed Aug 21, 2017

View reviewed changes

schmichael approved these changes Aug 23, 2017

View reviewed changes

dadgar requested changes Aug 30, 2017

View reviewed changes

diptanu added 3 commits November 2, 2017 09:51

Added metrics to track task/alloc start/restarts/dead events

45583d7

Recording counter for dead allocs properly

0bade76

Incrementing the start counter when we are actually starting a container

9593e12

diptanu force-pushed the f-add-client-metrics branch from 2a97a9b to b87528a Compare November 2, 2017 17:06

Added support for tagged metrics

103ff55

diptanu force-pushed the f-add-client-metrics branch from b87528a to 103ff55 Compare November 2, 2017 17:08

dadgar reviewed Nov 2, 2017

View reviewed changes

diptanu added 2 commits November 2, 2017 13:26

Added docs

1081d50

Added the node_id as a tag

5d36408

dadgar approved these changes Nov 2, 2017

View reviewed changes

Updated changelog

4e37ac1

diptanu merged commit 533a0f1 into master Nov 2, 2017

diptanu deleted the f-add-client-metrics branch November 2, 2017 20:45

github-actions bot locked as resolved and limited conversation to collaborators Mar 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added metrics to track task/alloc start/restarts/dead events #3061

Added metrics to track task/alloc start/restarts/dead events #3061

diptanu commented Aug 19, 2017 •

edited

Loading

dadgar left a comment

dadgar Aug 21, 2017

jippi Aug 22, 2017

schmichael Aug 23, 2017

diptanu Aug 30, 2017

dadgar Aug 30, 2017

schmichael left a comment

schmichael Aug 23, 2017

schmichael Aug 23, 2017

dadgar Aug 23, 2017

schmichael Aug 23, 2017

schmichael Aug 23, 2017

diptanu commented Aug 30, 2017

dadgar left a comment

dadgar Aug 30, 2017

dadgar Aug 30, 2017

dadgar commented Sep 7, 2017

diptanu commented Nov 2, 2017

dadgar left a comment

dadgar commented Nov 2, 2017

github-actions bot commented Mar 19, 2023

Added metrics to track task/alloc start/restarts/dead events #3061

Added metrics to track task/alloc start/restarts/dead events #3061

Conversation

diptanu commented Aug 19, 2017 • edited Loading

dadgar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schmichael left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

diptanu commented Aug 30, 2017

dadgar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dadgar commented Sep 7, 2017

diptanu commented Nov 2, 2017

dadgar left a comment

Choose a reason for hiding this comment

dadgar commented Nov 2, 2017

github-actions bot commented Mar 19, 2023

diptanu commented Aug 19, 2017 •

edited

Loading