Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker: periodically reconcile containers #6325

Merged
merged 9 commits into from
Oct 18, 2019

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Sep 13, 2019

When running at scale, it's possible that Docker Engine starts
containers successfully but gets wedged in a way where API call fails.
The Docker Engine may remain unavailable for arbitrary long time.

Here, we introduce a periodic reconciliation process that ensures that any
container started by nomad is tracked, and killed if is running
unexpectedly.

Basically, the periodic job inspects any container that isn't tracked in
its handlers. A creation grace period is used to prevent killing newly
created containers that aren't registered yet.

Also, we aim to avoid killing unrelated containers started by host or
through raw_exec drivers. The logic is to pattern against containers
environment variables and mounts to infer if they are an alloc docker
container.

Lastly, the periodic job can be disabled to avoid any interference if
need be.

On client restart, the grace period and period allows client for restorations before killing containers.

Some background

Nomad 0.8.7 was brittle judging by code in [1], If a container started but we failed to inspect it or docker engine became unavailable, we will leak container without stopping it for example.

Nomad 0.9.0 tried to ensure that we remove container on start failures. Though it doesn't account for failed creation and doesn't retry: if engine becomes unavailable at start time, it may be awhile until it's available again, so a single removal call isn't sufficient.

[1] https://github.com/hashicorp/nomad/blob/v0.8.7/client/driver/docker.go#L899-L935
[2] https://github.com/hashicorp/nomad/blob/v0.9.0/drivers/docker/driver.go#L279-L284

@notnoop notnoop added this to the 0.10.0 milestone Sep 13, 2019
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this LGTM. I left a few questions/comments.

drivers/docker/reconciler.go Outdated Show resolved Hide resolved
drivers/docker/reconciler.go Outdated Show resolved Hide resolved
drivers/docker/driver.go Outdated Show resolved Hide resolved
drivers/docker/reconciler.go Outdated Show resolved Hide resolved
drivers/docker/reconciler.go Outdated Show resolved Hide resolved
drivers/docker/reconciler_test.go Outdated Show resolved Hide resolved
@schmichael schmichael modified the milestones: 0.10.0, 0.10.1 Sep 17, 2019
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! This should help a lot of Docker users!

Add a test from at least the TaskRunner level to ensure there's no undesirable interaction between this and TaskRunner's (indirect) container management.

Not a blocker, but I'd prefer this be an independent struct like the image coordinator to help manage its dependencies and scope.

}

func (d *Driver) isNomadContainer(c docker.APIContainers) bool {
if _, ok := c.Labels["com.hashicorp.nomad.alloc_id"]; ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a const for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually where is this set? I didn't think we automatically set any labels?

Copy link
Contributor Author

@notnoop notnoop Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very observant :). This isn't being set anywhere yet; I intend to resurrect #5153 to set them and then use a const there for setting it.

}

// pre-0.10 containers aren't tagged or labeled in any way,
// so use cheap heauristic based on mount paths
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// so use cheap heauristic based on mount paths
// so use cheap heuristic based on mount paths

return false
}

func (d *Driver) trackedContainers() map[string]bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of building a point-in-time-snapshot of tracked containers, I think this should call d.tasks.Get(...) in untrackedContainers main loop instead of checking the map. The scenario I think this avoids is:

  1. tracked map built
  2. container list request to dockerd sent
  3. 4m pass because of load
  4. container list returned
  5. cutoff is set
  6. a number of 0.9 containers exist so InspectContainer is called against a slow dockerd and 1m passes

At this point the tracked map is >5m old so any containers created since are treated as untracked and eligible for GC.

Removing the InspectContainer call may be sufficient to fix this scenario, but I don't see a reason to build a copy of tracked containers vs doing individual gets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - I'd be in favor of re-ording operations so cutoff is taken before any lookups. I think it's much an easier system if we can reduce mutating data.

I find it much easier to reason and test around time-snapshotted data (and mutating container state), as opposed to changing container lists, changing handler, and changing containers. If we want to use this reconciler so we detect undetected-exited containers and kill containers, having both lists being mutated makes it tricky imo.

if err != nil {
return fmt.Errorf("failed to parse 'container_delay' duration: %v", err)
}
d.config.GC.DanglingContainers.creationTimeout = dur
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validate that this is >0 and probably greater than 10s or 1m or some conservative value to ensure pauses in the Go runtime don't cause us to make the wrong decision (eg a pause between starting a container and tracking it).

drivers/docker/config.go Show resolved Hide resolved
}

for _, id := range untracked {
d.logger.Info("removing untracked container", "container_id", id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to after removal has succeeded.

Suggested change
d.logger.Info("removing untracked container", "container_id", id)
d.logger.Info("removed untracked container", "container_id", id)

ctx, cancel := context.WithTimeout(d.ctx, 20*time.Second)
defer cancel()

ci, err := client.InspectContainerWithContext(c.ID, ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safer to skip this check. If a container running Nomad has those 3 directories and a Nomad-esque name, I think we can remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I was being conservative here, as false positive might be troublesome. But it does indeed add additional side effects; I'll remove it and document it.

drivers/docker/reconciler_test.go Show resolved Hide resolved
}

func TestDanglingContainerRemoval(t *testing.T) {
if !tu.IsCI() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure - this is pretty much cargo-culting from other tests without fully understanding context.

func (d *Driver) untrackedContainers(tracked map[string]bool, creationTimeout time.Duration) ([]string, error) {
result := []string{}

cc, err := client.ListContainers(docker.ListContainersOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe only running containers are listed by default which means we won't GC stopped containers.

Although GCing stopped containers introduces 2 other things to consider:

  1. We should only GC containers with taskHandle.removeContainerOnExit set.
  2. There is a race between this goroutine and task runner stopping and removing the container itself. We may need a stop grace period similar to the create grace period to avoid spurious error logs when this race is hit and the container is removed twice. Although maybe it's not worth the complexity?

Regardless of desired behavior around GCing stopped containers, we should add a test with a stopped container.


result = append(result, c.ID)
}

Copy link
Member

@schmichael schmichael Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope for this PR, but I wonder if we shouldn't loop over tracked and compare against what's actually running (cc). I feel like we've had a report of a container exiting and Nomad "not noticing", but I can't find it now (and maybe it got fixed?).

I'm not sure we even have a mechanism to properly propagate that exit back up to the TaskRunner, but perhaps there's a way to force kill the dangling task handle such that TR will notice?

Anyway, a problem for another PR if ever.

hclspec.NewLiteral(`"5m"`),
),
"creation_timeout": hclspec.NewDefault(
hclspec.NewAttr("creation_timeout", "string", false),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grace might be a more clear name as we use it in the check restart stanza

The "timeout" in creation_timeout just makes me think this has to do with API timeouts, not a grace period.

case <-timer.C:
if d.previouslyDetected() && d.fingerprintSuccessful() {
err := d.removeDanglingContainersIteration()
if err != nil && succeeded {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding a comment here that this succeeded check is to deduplicate logs. Maybe renaming it to lastIterSucceeded or something more descriptive.


// untrackedContainers returns the ids of containers that suspected
// to have been started by Nomad but aren't tracked by this driver
func (d *Driver) untrackedContainers(tracked map[string]bool, creationTimeout time.Duration) ([]string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass tracked in here? Why not just get it directly for the driver store?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having function taking tracked containers as an argument makes logic simpler and easier to test IMO. tracked is computed from driver store directly.

Also, given that driver store maps task id to handlers with container id, looking up tracked containers map is O(n) in total, while scanning each time to check container presence in map will result into O(n^2) logic, assuming the question being why we don't lookup presence while we loop through containers.

return nil, fmt.Errorf("failed to list containers: %v", err)
}

cutoff := time.Now().Add(-creationTimeout).Unix()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for using the term grace instead of timeout. Took a min to figure our why we're subtracting a timeout here 😅

@schmichael
Copy link
Member

When implementing and using Docker labels on Nomad managed containers we should consider netns pause containers and #6385. Would be nice if we could handle them with generic reconciliation logic, but if that's not possible we should just ensure this reconciler won't interact poorly with pause containers.

Mahmood Ali added 3 commits October 17, 2019 08:36
When running at scale, it's possible that Docker Engine starts
containers successfully but gets wedged in a way where API call fails.
The Docker Engine may remain unavailable for arbitrary long time.

Here, we introduce a periodic reconcilation process that ensures that any
container started by nomad is tracked, and killed if is running
unexpectedly.

Basically, the periodic job inspects any container that isn't tracked in
its handlers.  A creation grace period is used to prevent killing newly
created containers that aren't registered yet.

Also, we aim to avoid killing unrelated containters started by host or
through raw_exec drivers.  The logic is to pattern against containers
environment variables and mounts to infer if they are an alloc docker
container.

Lastly, the periodic job can be disabled to avoid any interference if
need be.
Ensure we wait for some grace period before killing docker containers
that may have launched in earlier nomad restore.
@notnoop notnoop force-pushed the b-docker-reconcile-periodically branch from dcf9bcb to 97f0875 Compare October 17, 2019 12:37
@notnoop notnoop force-pushed the b-docker-reconcile-periodically branch from 97f0875 to 95bc9b3 Compare October 17, 2019 14:29
@notnoop notnoop requested a review from schmichael October 17, 2019 14:32
@notnoop notnoop force-pushed the b-docker-reconcile-periodically branch from 95bc9b3 to 8c3136a Compare October 17, 2019 14:45
@notnoop
Copy link
Contributor Author

notnoop commented Oct 17, 2019

I have updated this PR and is ready for re-review:

  • Refactor reconciler to be a separate struct and a bit more generic so we can extract it if we want in future
  • added docker label tags and use const as appropriate
  • Added more tests to cover stopped container, and better coverage for labels and grace period handling

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining issues are fairly small. Feel free to merge whenever!

config.Labels = map[string]string{
dockerLabelTaskID: task.ID,
dockerLabelTaskName: task.Name,
dockerLabelAllocID: task.AllocID,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were we just going to ship this label by default initially? I think we can ask users to pay the cost of at least 1 label, but I wasn't sure where we landed on more expanded labels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done - kept dockerLabelAllocID and removed others.

drivers/docker/driver.go Outdated Show resolved Hide resolved
drivers/docker/reconciler.go Outdated Show resolved Hide resolved
Mahmood Ali added 3 commits October 18, 2019 14:43
driver.SetConfig is not appropriate for starting up reconciler
goroutine.  Some ephemeral driver instances are created for validating
config and we ought not to side-effecting goroutines for those.

We currently lack a lifecycle hook to inject these, so I picked the
`Fingerprinter` function for now, and reconciler should only run after
fingerprinter started.

Use `sync.Once` to ensure that we only start reconciler loop once.
Other labels aren't strictly necessary here, and we may follow up with a
better way to customize.
@notnoop notnoop force-pushed the b-docker-reconcile-periodically branch from 4114138 to c64647c Compare October 18, 2019 19:31
@notnoop notnoop merged commit 75acbcc into master Oct 18, 2019
@notnoop notnoop deleted the b-docker-reconcile-periodically branch October 18, 2019 19:53
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants