-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added metrics to track task/alloc start/restarts/dead events #3061
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
client/alloc_runner.go
Outdated
} | ||
case structs.TaskStateDead: | ||
// Capture the finished time. If it has never started there is no finish | ||
// time | ||
if !taskState.StartedAt.IsZero() { | ||
taskState.FinishedAt = time.Now().UTC() | ||
metrics.IncrCounter([]string{"client", "allocs", r.alloc.Job.Name, r.alloc.TaskGroup, taskName, "dead"}, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth splitting into "complete" and "failed" based on the taskState.Failed
variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think we should increment the counter outside this if
block. We should still record it as dead even if it never started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dadgar @schmichael While writing the PR, I was wondering wouldn't it be cleaner if we let alloc runners kill other task runners in a group when a sibling fails? Feels like TR should just bubble the event up to it's supervisor to make that decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diptanu It already does that if you annotate a task as a leader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @diptanu! Left some nitpicks, but the functionality seems good.
client/alloc_runner.go
Outdated
} | ||
case structs.TaskStateDead: | ||
// Capture the finished time. If it has never started there is no finish | ||
// time | ||
if !taskState.StartedAt.IsZero() { | ||
taskState.FinishedAt = time.Now().UTC() | ||
metrics.IncrCounter([]string{"client", "allocs", r.alloc.Job.Name, r.alloc.TaskGroup, taskName, "dead"}, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think we should increment the counter outside this if
block. We should still record it as dead even if it never started.
client/alloc_runner.go
Outdated
@@ -744,6 +748,9 @@ func (r *AllocRunner) Run() { | |||
defer close(r.waitCh) | |||
go r.dirtySyncState() | |||
|
|||
// Incr alloc runner start counter | |||
metrics.IncrCounter([]string{"client", "allocs", r.alloc.Job.Name, r.alloc.TaskGroup, "start"}, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This metric is going to be a bit confusing to people not familiar with Nomad's internals because it will get incremented even on the Restore path even though the actual tasks are still running. It just signifies an AllocRunner struct has been Run while I would think people unfamiliar with Nomad would think it means "tasks for this task group are being executed."
Unfortunately we don't document our metrics outside of code, so there's no good place to specify what this metric means. Perhaps at least expand the comment with something like:
// Increment alloc runner start counter. Incr'd even when restoring existing tasks so 1 start != 1 task execution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schmichael @diptanu Not quite true. We should add documentation here: https://www.nomadproject.io/docs/agent/telemetry.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! Thanks Alex!
client/alloc_runner.go
Outdated
@@ -926,6 +933,9 @@ func (r *AllocRunner) handleDestroy() { | |||
// state as we wait for a destroy. | |||
alloc := r.Alloc() | |||
|
|||
// Incr the alloc destroy counter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like above, I don't think we ever really document or expose what "destroy" means, so this metric seems confusing to end users. Perhaps expand the comment with something like:
// Increment the destroy count for this alloc runner since this allocation is being removed from this client.
@dadgar Updated the PR based on comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the metrics to the metrics documentation
client/alloc_runner.go
Outdated
} | ||
case structs.TaskStateDead: | ||
// Capture the finished time. If it has never started there is no finish | ||
// time | ||
metrics.IncrCounter([]string{"client", "allocs", r.alloc.Job.Name, r.alloc.TaskGroup, taskName, "dead"}, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this given the more fine grain metrics below.
client/alloc_runner.go
Outdated
@@ -744,6 +754,9 @@ func (r *AllocRunner) Run() { | |||
defer close(r.waitCh) | |||
go r.dirtySyncState() | |||
|
|||
// Increment alloc runner start counter. Incr'd even when restoring existing tasks so 1 start != 1 task execution | |||
metrics.IncrCounter([]string{"client", "allocs", r.alloc.Job.Name, r.alloc.TaskGroup, "start"}, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this below the terminal status check. We don't want to emit it if the alloc_runner is just waiting for a destroy.
This should be updated now to reflect that we have tagged metrics now. |
2a97a9b
to
b87528a
Compare
b87528a
to
103ff55
Compare
@dadgar done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, only thing missing is documentation on the new metrics
Add to changelog please |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Fixes #3060
There could be better places to track the task state changes and increment counters. The task_runner and allocation runner code has changed a lot since I last touched them, so please suggest what's the best place to do this.
I didn't add the allocation ID to the metrics here since alloc IDs are unique and I think these metrics are more interesting for debugging cluster-wide problems than seeing what's happening with an individual allocation. People can always group the metrics by node IDs to get a sense of any node level outliers for restarts across racks and clusters.
cc/ @dadgar @schmichael