fix failed log collection from short-lived containers (docklog) #10910

tonyhb · 2021-07-19T23:18:46Z

Short-lived containers (especially those < 1 second) often do not have
thier logs sent to Nomad.

This PR adjusts the nomad docker driver and docker logging driver to:

Enable docklog to run after a container has stopped (to some grace
period limit)
Collect logs from stopped containers up until the grace period

This fixes the current issues:

docklog is killed by the handle as soon as the task finishes, which
means fast running containers can never have their logs scraped
docklog quits streaming logs in its event loop if the container has
stopped
docklog closes the fifo streams as soon as the container quits (or
destroys them after creating, if the container is stopped), so the logs
can never be streamed to nomad.

In order to do this, we need to know whether we have read logs for the
current container in order to apply a grace period. We add a copier to
the fifo streams which sets an atomic flag, letting us know whether we
need to retry reading the logs and use a grace period or if we can quit
early.

Fixes #2457, #6931.

Notes; I'm not sure on the coding style and am not necessarily a fan
of the two goroutines within Start(). This fixes the issues (with tests)
but I can change it or it can be carried by someone who knows your style
a little more :)

tonyhb · 2021-08-02T14:34:20Z

hey hashicorp! can i help anyone review this code? i have a suspicion that a large cohort of nomad users run docker containers. the current docker log collection may have several faults which many users can run into.

i would love to get this merged and/or at least a discussion around it, given it could impact many, many users (and obviously impacts us!) :D

hashicorp-cla · 2022-03-12T17:07:21Z

All committers have signed the CLA.

josegonzalez · 2022-06-09T19:47:08Z

Any chance this can be reviewed? I've been playing around with short-lived containers and am noticing that logs are being dropped on the floor, and I suspect this is the primary suspect.

Short-lived containers (especially those < 1 second) often do not have thier logs sent to Nomad. This PR adjusts the nomad docker driver and docker logging driver to: 1. Enable docklog to run after a container has stopped (to some grace period limit) 2. Collect logs from stopped containers up until the grace period This fixes the current issues: 1. docklog is killed by the handlea as soon as the task finishes, which means fast running containers can never have their logs scraped 2. docklog quits streaming logs in its event loop if the container has stopped In order to do this, we need to know _whether_ we have read logs for the current container in order to apply a grace period. We add a copier to the fifo streams which sets an atomic flag, letting us know whether we need to retry reading the logs and use a grace period or if we can quit early. Fixes hashicorp#2475, hashicorp#6931. Always wait to read from logs before exiting Store number of bytes read vs a simple counter

schmichael

Thanks for the submission and sorry it has languished so long.

I'm inclined to accept it because we have had so many corroborating, however I still can't manage to reproduce the bug myself!

I've been trying jobspecs like this and then asserting that the full date can be read via nomad alloc logs, and I have not been able to reproduce the bug.

Is there something I'm missing?

job "hello" {
  datacenters = ["dc1"]

  type = "batch"

  group "hello" {
    count = 20
    task "hello" {
      driver = "docker"
      config {
        image   = "busybox"
        command = "date"
      }
    }
  }
}

Obviously after all of this time this PR needs rebasing, a changelog entry, and website/docs update, but I can handle that if you'd like!

schmichael · 2022-07-11T16:41:03Z

drivers/docker/handle.go

-	}
-	h.dloggerPluginClient.Kill()
+	go func() {
+		h.logger.Info("sending stop signal to logger", "job_id", h.task.JobID, "container_id", h.containerID)


Does this need to be info level?

Suggested change

h.logger.Info("sending stop signal to logger", "job_id", h.task.JobID, "container_id", h.containerID)

h.logger.Debug("sending stop signal to logger", "job_id", h.task.JobID, "container_id", h.containerID)

schmichael · 2022-07-11T17:02:05Z

drivers/docker/docklog/docker_logger.go

+
+	// read indicates whether we have read anything from the logs.  This is manipulated
+	// using the sync package via multiple goroutines.
+	read int64


AFAICT read serves 2 purposes:

If we've ever read anything, we can exit immediately if the container has exited and be safe from the truncation bug. (L160)

If we've received at least 1 read within 1s of container exit, we can exit safely. (L110)

schmichael · 2022-07-11T17:06:48Z

Hm, on future testing with the hello9000.nomad jobspec I'm seeing logs getting emitted 2 or 3 times. I think because docker_logger's main for loop now continues after the container has exited and the unix timestamp in the Since field is insufficient to prevent reading the last seconds logs more than once.

This is arguably better than dropping the logs the though.

Bash script I used when testing with that jobspec:

#!/bin/bash

completed=$(nomad job status  -all-allocs  hello9000 | rg 'complete' | cut -f1 -d' ')

for alloc in $completed; do
    echo "$alloc" $(nomad alloc logs $alloc)
done

tonyhb · 2023-03-22T04:57:03Z

I'm no longer working on this particular logging problem and don't have the time to investigate dupes or carry this; apologies. Framework and explanations of the three issues are hopefully helpful, though!

vercel bot temporarily deployed to Preview – nomad July 19, 2021 23:18 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui July 19, 2021 23:18 View deployment

vercel bot temporarily deployed to Preview – nomad July 20, 2021 02:24 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui July 20, 2021 02:27 View deployment

tonyhb force-pushed the fix/collect-logs-from-stopped-containers branch from 87f35cd to 5a3c83f Compare July 20, 2021 12:46

vercel bot temporarily deployed to Preview – nomad July 20, 2021 12:46 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui July 20, 2021 12:46 View deployment

tonyhb changed the title ~~Allow docklog to collect logs from stopped containers~~ fix failed log collection from short-lived containers (docklog) Jul 21, 2021

tonyhb mentioned this pull request Jul 23, 2021

When running docker plugin with short lived jobs the stdout\stderr log files will occasionally be empty #6931

Open

tonyhb force-pushed the fix/collect-logs-from-stopped-containers branch from 5a3c83f to 18a5df6 Compare October 21, 2021 15:35

vercel bot deployed to Preview – nomad-storybook-and-ui October 21, 2021 15:38 View deployment

vercel bot temporarily deployed to Preview – nomad October 21, 2021 15:38 Inactive

tonyhb force-pushed the fix/collect-logs-from-stopped-containers branch from fae0fc8 to ac59129 Compare October 21, 2021 15:46

vercel bot temporarily deployed to Preview – nomad October 21, 2021 15:47 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui October 21, 2021 15:47 View deployment

tonyhb force-pushed the fix/collect-logs-from-stopped-containers branch from ac59129 to 84afa66 Compare November 1, 2021 13:44

vercel bot deployed to Preview – nomad-storybook-and-ui November 1, 2021 13:44 View deployment

vercel bot temporarily deployed to Preview – nomad November 1, 2021 13:44 Inactive

tonyhb force-pushed the fix/collect-logs-from-stopped-containers branch from 84afa66 to 63d4039 Compare January 26, 2022 12:54

vercel bot deployed to Preview – nomad-storybook-and-ui January 26, 2022 12:54 View deployment

vercel bot temporarily deployed to Preview – nomad January 26, 2022 12:54 Inactive

shoenig self-requested a review March 16, 2022 13:52

shoenig force-pushed the fix/collect-logs-from-stopped-containers branch from 63d4039 to e1aabe6 Compare June 9, 2022 20:24

vercel bot deployed to Preview – nomad-storybook-and-ui June 9, 2022 20:27 View deployment

schmichael reviewed Jul 11, 2022

View reviewed changes

schmichael self-assigned this Jul 11, 2022

schmichael added the stage/waiting-reply label Jul 11, 2022

schmichael reviewed Jul 11, 2022

View reviewed changes

tgross added the stage/needs-rebase This PR needs to be rebased on main before it can be backported to pick up new BPA workflows label May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix failed log collection from short-lived containers (docklog) #10910

fix failed log collection from short-lived containers (docklog) #10910

tonyhb commented Jul 19, 2021 •

edited

Loading

tonyhb commented Aug 2, 2021

hashicorp-cla commented Mar 12, 2022 •

edited

Loading

josegonzalez commented Jun 9, 2022

schmichael left a comment

schmichael Jul 11, 2022

schmichael Jul 11, 2022

schmichael commented Jul 11, 2022

tonyhb commented Mar 22, 2023

	h.logger.Info("sending stop signal to logger", "job_id", h.task.JobID, "container_id", h.containerID)
	h.logger.Debug("sending stop signal to logger", "job_id", h.task.JobID, "container_id", h.containerID)

fix failed log collection from short-lived containers (docklog) #10910

Are you sure you want to change the base?

fix failed log collection from short-lived containers (docklog) #10910

Conversation

tonyhb commented Jul 19, 2021 • edited Loading

tonyhb commented Aug 2, 2021

hashicorp-cla commented Mar 12, 2022 • edited Loading

josegonzalez commented Jun 9, 2022

schmichael left a comment

Choose a reason for hiding this comment

schmichael Jul 11, 2022

Choose a reason for hiding this comment

schmichael Jul 11, 2022

Choose a reason for hiding this comment

schmichael commented Jul 11, 2022

tonyhb commented Mar 22, 2023

tonyhb commented Jul 19, 2021 •

edited

Loading

hashicorp-cla commented Mar 12, 2022 •

edited

Loading