-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker: periodically reconcile containers #6325
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this LGTM. I left a few questions/comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! This should help a lot of Docker users!
Add a test from at least the TaskRunner level to ensure there's no undesirable interaction between this and TaskRunner's (indirect) container management.
Not a blocker, but I'd prefer this be an independent struct like the image coordinator to help manage its dependencies and scope.
drivers/docker/reconciler.go
Outdated
} | ||
|
||
func (d *Driver) isNomadContainer(c docker.APIContainers) bool { | ||
if _, ok := c.Labels["com.hashicorp.nomad.alloc_id"]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use a const for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually where is this set? I didn't think we automatically set any labels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very observant :). This isn't being set anywhere yet; I intend to resurrect #5153 to set them and then use a const there for setting it.
drivers/docker/reconciler.go
Outdated
} | ||
|
||
// pre-0.10 containers aren't tagged or labeled in any way, | ||
// so use cheap heauristic based on mount paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// so use cheap heauristic based on mount paths | |
// so use cheap heuristic based on mount paths |
return false | ||
} | ||
|
||
func (d *Driver) trackedContainers() map[string]bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of building a point-in-time-snapshot of tracked containers, I think this should call d.tasks.Get(...)
in untrackedContainers
main loop instead of checking the map. The scenario I think this avoids is:
tracked
map built- container list request to dockerd sent
- 4m pass because of load
- container list returned
cutoff
is set- a number of 0.9 containers exist so InspectContainer is called against a slow dockerd and 1m passes
At this point the tracked
map is >5m old so any containers created since are treated as untracked and eligible for GC.
Removing the InspectContainer call may be sufficient to fix this scenario, but I don't see a reason to build a copy of tracked containers vs doing individual gets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - I'd be in favor of re-ording operations so cutoff is taken before any lookups. I think it's much an easier system if we can reduce mutating data.
I find it much easier to reason and test around time-snapshotted data (and mutating container state), as opposed to changing container lists, changing handler, and changing containers. If we want to use this reconciler so we detect undetected-exited containers and kill containers, having both lists being mutated makes it tricky imo.
drivers/docker/config.go
Outdated
if err != nil { | ||
return fmt.Errorf("failed to parse 'container_delay' duration: %v", err) | ||
} | ||
d.config.GC.DanglingContainers.creationTimeout = dur |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate that this is >0 and probably greater than 10s or 1m or some conservative value to ensure pauses in the Go runtime don't cause us to make the wrong decision (eg a pause between starting a container and tracking it).
drivers/docker/reconciler.go
Outdated
} | ||
|
||
for _, id := range untracked { | ||
d.logger.Info("removing untracked container", "container_id", id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to after removal has succeeded.
d.logger.Info("removing untracked container", "container_id", id) | |
d.logger.Info("removed untracked container", "container_id", id) |
drivers/docker/reconciler.go
Outdated
ctx, cancel := context.WithTimeout(d.ctx, 20*time.Second) | ||
defer cancel() | ||
|
||
ci, err := client.InspectContainerWithContext(c.ID, ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's safer to skip this check. If a container running Nomad has those 3 directories and a Nomad-esque name, I think we can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I was being conservative here, as false positive might be troublesome. But it does indeed add additional side effects; I'll remove it and document it.
drivers/docker/reconciler_test.go
Outdated
} | ||
|
||
func TestDanglingContainerRemoval(t *testing.T) { | ||
if !tu.IsCI() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure - this is pretty much cargo-culting from other tests without fully understanding context.
drivers/docker/reconciler.go
Outdated
func (d *Driver) untrackedContainers(tracked map[string]bool, creationTimeout time.Duration) ([]string, error) { | ||
result := []string{} | ||
|
||
cc, err := client.ListContainers(docker.ListContainersOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe only running containers are listed by default which means we won't GC stopped containers.
Although GCing stopped containers introduces 2 other things to consider:
- We should only GC containers with
taskHandle.removeContainerOnExit
set. - There is a race between this goroutine and task runner stopping and removing the container itself. We may need a stop grace period similar to the create grace period to avoid spurious error logs when this race is hit and the container is removed twice. Although maybe it's not worth the complexity?
Regardless of desired behavior around GCing stopped containers, we should add a test with a stopped container.
|
||
result = append(result, c.ID) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of scope for this PR, but I wonder if we shouldn't loop over tracked
and compare against what's actually running (cc
). I feel like we've had a report of a container exiting and Nomad "not noticing", but I can't find it now (and maybe it got fixed?).
I'm not sure we even have a mechanism to properly propagate that exit back up to the TaskRunner, but perhaps there's a way to force kill the dangling task handle such that TR will notice?
Anyway, a problem for another PR if ever.
drivers/docker/config.go
Outdated
hclspec.NewLiteral(`"5m"`), | ||
), | ||
"creation_timeout": hclspec.NewDefault( | ||
hclspec.NewAttr("creation_timeout", "string", false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grace
might be a more clear name as we use it in the check restart stanza
The "timeout" in creation_timeout
just makes me think this has to do with API timeouts, not a grace period.
drivers/docker/reconciler.go
Outdated
case <-timer.C: | ||
if d.previouslyDetected() && d.fingerprintSuccessful() { | ||
err := d.removeDanglingContainersIteration() | ||
if err != nil && succeeded { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind adding a comment here that this succeeded
check is to deduplicate logs. Maybe renaming it to lastIterSucceeded
or something more descriptive.
drivers/docker/reconciler.go
Outdated
|
||
// untrackedContainers returns the ids of containers that suspected | ||
// to have been started by Nomad but aren't tracked by this driver | ||
func (d *Driver) untrackedContainers(tracked map[string]bool, creationTimeout time.Duration) ([]string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to pass tracked in here? Why not just get it directly for the driver store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having function taking tracked containers as an argument makes logic simpler and easier to test IMO. tracked
is computed from driver store directly.
Also, given that driver store maps task id to handlers with container id, looking up tracked containers map is O(n)
in total, while scanning each time to check container presence in map will result into O(n^2)
logic, assuming the question being why we don't lookup presence while we loop through containers.
drivers/docker/reconciler.go
Outdated
return nil, fmt.Errorf("failed to list containers: %v", err) | ||
} | ||
|
||
cutoff := time.Now().Add(-creationTimeout).Unix() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for using the term grace instead of timeout. Took a min to figure our why we're subtracting a timeout here 😅
When implementing and using Docker labels on Nomad managed containers we should consider netns pause containers and #6385. Would be nice if we could handle them with generic reconciliation logic, but if that's not possible we should just ensure this reconciler won't interact poorly with pause containers. |
When running at scale, it's possible that Docker Engine starts containers successfully but gets wedged in a way where API call fails. The Docker Engine may remain unavailable for arbitrary long time. Here, we introduce a periodic reconcilation process that ensures that any container started by nomad is tracked, and killed if is running unexpectedly. Basically, the periodic job inspects any container that isn't tracked in its handlers. A creation grace period is used to prevent killing newly created containers that aren't registered yet. Also, we aim to avoid killing unrelated containters started by host or through raw_exec drivers. The logic is to pattern against containers environment variables and mounts to infer if they are an alloc docker container. Lastly, the periodic job can be disabled to avoid any interference if need be.
Ensure we wait for some grace period before killing docker containers that may have launched in earlier nomad restore.
dcf9bcb
to
97f0875
Compare
97f0875
to
95bc9b3
Compare
95bc9b3
to
8c3136a
Compare
I have updated this PR and is ready for re-review:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remaining issues are fairly small. Feel free to merge whenever!
drivers/docker/driver.go
Outdated
config.Labels = map[string]string{ | ||
dockerLabelTaskID: task.ID, | ||
dockerLabelTaskName: task.Name, | ||
dockerLabelAllocID: task.AllocID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were we just going to ship this label by default initially? I think we can ask users to pay the cost of at least 1 label, but I wasn't sure where we landed on more expanded labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done - kept dockerLabelAllocID and removed others.
driver.SetConfig is not appropriate for starting up reconciler goroutine. Some ephemeral driver instances are created for validating config and we ought not to side-effecting goroutines for those. We currently lack a lifecycle hook to inject these, so I picked the `Fingerprinter` function for now, and reconciler should only run after fingerprinter started. Use `sync.Once` to ensure that we only start reconciler loop once.
Other labels aren't strictly necessary here, and we may follow up with a better way to customize.
4114138
to
c64647c
Compare
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
When running at scale, it's possible that Docker Engine starts
containers successfully but gets wedged in a way where API call fails.
The Docker Engine may remain unavailable for arbitrary long time.
Here, we introduce a periodic reconciliation process that ensures that any
container started by nomad is tracked, and killed if is running
unexpectedly.
Basically, the periodic job inspects any container that isn't tracked in
its handlers. A creation grace period is used to prevent killing newly
created containers that aren't registered yet.
Also, we aim to avoid killing unrelated containers started by host or
through raw_exec drivers. The logic is to pattern against containers
environment variables and mounts to infer if they are an alloc docker
container.
Lastly, the periodic job can be disabled to avoid any interference if
need be.
On client restart, the grace period and period allows client for restorations before killing containers.
Some background
Nomad 0.8.7 was brittle judging by code in [1], If a container started but we failed to inspect it or docker engine became unavailable, we will leak container without stopping it for example.
Nomad 0.9.0 tried to ensure that we remove container on start failures. Though it doesn't account for failed creation and doesn't retry: if engine becomes unavailable at start time, it may be awhile until it's available again, so a single removal call isn't sufficient.
[1] https://github.com/hashicorp/nomad/blob/v0.8.7/client/driver/docker.go#L899-L935
[2] https://github.com/hashicorp/nomad/blob/v0.9.0/drivers/docker/driver.go#L279-L284