-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add driver.docker counter metric for OOM Killer events #4185
Add driver.docker counter metric for OOM Killer events #4185
Conversation
@@ -17,12 +17,13 @@ import ( | |||
"time" | |||
|
|||
"github.com/armon/circbuf" | |||
docker "github.com/fsouza/go-dockerclient" | |||
"github.com/fsouza/go-dockerclient" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goimports suggested this change. Please let me know if you want me to rollback this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its because the imported package in that folder is already called docker
:)
@@ -1924,6 +1925,21 @@ func (h *DockerHandle) run() { | |||
h.logger.Printf("[ERR] driver.docker: failed to inspect container %s: %v", h.containerID, ierr) | |||
} else if container.State.OOMKilled { | |||
werr = fmt.Errorf("OOM Killed") | |||
labels := []metrics.Label{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for me the interesting labels would be job, group and task :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes for me too, but i don't know how to access them from here. I'm happy to implement a solution but I'd need some help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jesusvazquez you could add those fields to the DockerHandle
struct, and then populate them from the DriverContext (I'm preparing a PR so you can have those available).
Lines 895 to 909 in fb7e0c1
h := &DockerHandle{ | |
client: client, | |
waitClient: waitClient, | |
executor: exec, | |
pluginClient: pluginClient, | |
logger: d.logger, | |
Image: d.driverConfig.ImageName, | |
ImageID: d.imageID, | |
containerID: container.ID, | |
version: d.config.Version.VersionNumber(), | |
killTimeout: GetKillTimeout(task.KillTimeout, maxKill), | |
maxKillTimeout: maxKill, | |
doneCh: make(chan bool), | |
waitCh: make(chan *dstructs.WaitResult, 1), | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See: #4196
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh really nice @jvrplmlmn Ill update this PR as soon as I get back from my vacations.
Adding this fields to the DriverContext object, will allow us to pass them to the drivers. An use case for this, will be to emit tagged metrics in the drivers, which contain all relevant information: - Job - TaskGroup - Task - ... Ref: hashicorp#4185
client/driver/docker.go
Outdated
Value: h.ImageID, | ||
}, | ||
{ | ||
Name: "ContainerID", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a useful metric? Seems like the combination of image ID + job information captures all you would need and be a little less noisy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes now that #4196 introduces job information Ill update the labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6f35d36
to
54e1788
Compare
client/driver/docker.go
Outdated
jobName: d.DriverContext.jobName, | ||
taskGroupName: d.DriverContext.taskGroupName, | ||
taskName: d.DriverContext.taskName, | ||
allocID: d.DriverContext.allocID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two notes on this:
-
AllocID: I've added the allocation id because then we have the full context from the job to the allocation. This is useful to provide user oriented alerts.
-
We're initializing DockerHandle with 4 out of 8 fields of the DriverContext struct. Then I question myself: Would it make sense to pass the whole struct to DockerHandle?.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add to ID and Open method: https://github.com/hashicorp/nomad/blob/master/client/driver/docker.go#L1769-L1798
client/driver/docker.go
Outdated
Value: h.taskGroupName, | ||
}, | ||
{ | ||
Name: "TaskName", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jippi this is what you meant right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah - looks great!
I personally don't care for the allocation id below, its a very high cardinality field, making influxdb & friends unhappy fairly fast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that from the database and query point of view it probably has no value but from the alerting point of view the user can quickly run nomad alloc-status <alloc-id>
and see what is happening.
What would you say here? Should we keep it or drop it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my personal preference would be to drop it - sounds like an alert in your logging rather than telemetry if you ask me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would drop it. As @jippi points out it will make many telemetry systems unhappy
client/driver/docker.go
Outdated
Value: h.taskGroupName, | ||
}, | ||
{ | ||
Name: "TaskName", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would drop it. As @jippi points out it will make many telemetry systems unhappy
client/driver/docker.go
Outdated
@@ -1924,6 +1933,25 @@ func (h *DockerHandle) run() { | |||
h.logger.Printf("[ERR] driver.docker: failed to inspect container %s: %v", h.containerID, ierr) | |||
} else if container.State.OOMKilled { | |||
werr = fmt.Errorf("OOM Killed") | |||
labels := []metrics.Label{ | |||
{ | |||
Name: "JobName", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use consistent names: https://github.com/hashicorp/nomad/blob/master/client/task_runner.go#L274: job, task_group, task
client/driver/docker.go
Outdated
jobName: d.DriverContext.jobName, | ||
taskGroupName: d.DriverContext.taskGroupName, | ||
taskName: d.DriverContext.taskName, | ||
allocID: d.DriverContext.allocID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add to ID and Open method: https://github.com/hashicorp/nomad/blob/master/client/driver/docker.go#L1769-L1798
Ping on comments |
I checked the ID method but I see that the dockerPID struct does not have job, task and task_group values, should I create them? Or maybe I got it wrong, could you give me more detail? |
Ping @dadgar ⬆️ |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Hi!
This PR is a proposal to add a new counter metric for
OOM Killer
events.The current behaviour is that when a container is killed by the
OOM Killer
nomad logs the event. The changes I made emit a counter metric after that with some labels that I'd consider useful for later analysis.I'm starting to learn golang so if I got something wrong give me some guidances.
Regards