Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added metrics to track task/alloc start/restarts/dead events #3061

Merged
merged 7 commits into from
Nov 2, 2017

Conversation

diptanu
Copy link
Contributor

@diptanu diptanu commented Aug 19, 2017

Fixes #3060

There could be better places to track the task state changes and increment counters. The task_runner and allocation runner code has changed a lot since I last touched them, so please suggest what's the best place to do this.

I didn't add the allocation ID to the metrics here since alloc IDs are unique and I think these metrics are more interesting for debugging cluster-wide problems than seeing what's happening with an individual allocation. People can always group the metrics by node IDs to get a sense of any node level outliers for restarts across racks and clusters.

cc/ @dadgar @schmichael

@diptanu diptanu requested review from schmichael and dadgar August 19, 2017 08:39
Copy link
Contributor

@dadgar dadgar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

}
case structs.TaskStateDead:
// Capture the finished time. If it has never started there is no finish
// time
if !taskState.StartedAt.IsZero() {
taskState.FinishedAt = time.Now().UTC()
metrics.IncrCounter([]string{"client", "allocs", r.alloc.Job.Name, r.alloc.TaskGroup, taskName, "dead"}, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth splitting into "complete" and "failed" based on the taskState.Failed variable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think we should increment the counter outside this if block. We should still record it as dead even if it never started.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dadgar @schmichael While writing the PR, I was wondering wouldn't it be cleaner if we let alloc runners kill other task runners in a group when a sibling fails? Feels like TR should just bubble the event up to it's supervisor to make that decision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diptanu It already does that if you annotate a task as a leader.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @diptanu! Left some nitpicks, but the functionality seems good.

}
case structs.TaskStateDead:
// Capture the finished time. If it has never started there is no finish
// time
if !taskState.StartedAt.IsZero() {
taskState.FinishedAt = time.Now().UTC()
metrics.IncrCounter([]string{"client", "allocs", r.alloc.Job.Name, r.alloc.TaskGroup, taskName, "dead"}, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think we should increment the counter outside this if block. We should still record it as dead even if it never started.

@@ -744,6 +748,9 @@ func (r *AllocRunner) Run() {
defer close(r.waitCh)
go r.dirtySyncState()

// Incr alloc runner start counter
metrics.IncrCounter([]string{"client", "allocs", r.alloc.Job.Name, r.alloc.TaskGroup, "start"}, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric is going to be a bit confusing to people not familiar with Nomad's internals because it will get incremented even on the Restore path even though the actual tasks are still running. It just signifies an AllocRunner struct has been Run while I would think people unfamiliar with Nomad would think it means "tasks for this task group are being executed."

Unfortunately we don't document our metrics outside of code, so there's no good place to specify what this metric means. Perhaps at least expand the comment with something like:

// Increment alloc runner start counter. Incr'd even when restoring existing tasks so 1 start != 1 task execution

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schmichael @diptanu Not quite true. We should add documentation here: https://www.nomadproject.io/docs/agent/telemetry.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Thanks Alex!

@@ -926,6 +933,9 @@ func (r *AllocRunner) handleDestroy() {
// state as we wait for a destroy.
alloc := r.Alloc()

// Incr the alloc destroy counter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like above, I don't think we ever really document or expose what "destroy" means, so this metric seems confusing to end users. Perhaps expand the comment with something like:

// Increment the destroy count for this alloc runner since this allocation is being removed from this client.

@diptanu
Copy link
Contributor Author

diptanu commented Aug 30, 2017

@dadgar Updated the PR based on comments.

Copy link
Contributor

@dadgar dadgar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the metrics to the metrics documentation

}
case structs.TaskStateDead:
// Capture the finished time. If it has never started there is no finish
// time
metrics.IncrCounter([]string{"client", "allocs", r.alloc.Job.Name, r.alloc.TaskGroup, taskName, "dead"}, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this given the more fine grain metrics below.

@@ -744,6 +754,9 @@ func (r *AllocRunner) Run() {
defer close(r.waitCh)
go r.dirtySyncState()

// Increment alloc runner start counter. Incr'd even when restoring existing tasks so 1 start != 1 task execution
metrics.IncrCounter([]string{"client", "allocs", r.alloc.Job.Name, r.alloc.TaskGroup, "start"}, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this below the terminal status check. We don't want to emit it if the alloc_runner is just waiting for a destroy.

@dadgar
Copy link
Contributor

dadgar commented Sep 7, 2017

This should be updated now to reflect that we have tagged metrics now.

@diptanu diptanu force-pushed the f-add-client-metrics branch from 2a97a9b to b87528a Compare November 2, 2017 17:06
@diptanu diptanu force-pushed the f-add-client-metrics branch from b87528a to 103ff55 Compare November 2, 2017 17:08
@diptanu
Copy link
Contributor Author

diptanu commented Nov 2, 2017

@dadgar done

Copy link
Contributor

@dadgar dadgar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, only thing missing is documentation on the new metrics

@dadgar
Copy link
Contributor

dadgar commented Nov 2, 2017

Add to changelog please

@diptanu diptanu merged commit 533a0f1 into master Nov 2, 2017
@diptanu diptanu deleted the f-add-client-metrics branch November 2, 2017 20:45
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants