Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix failed log collection from short-lived containers (docklog) #10910

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tonyhb
Copy link

@tonyhb tonyhb commented Jul 19, 2021

Short-lived containers (especially those < 1 second) often do not have
thier logs sent to Nomad.

This PR adjusts the nomad docker driver and docker logging driver to:

  1. Enable docklog to run after a container has stopped (to some grace
    period limit)
  2. Collect logs from stopped containers up until the grace period

This fixes the current issues:

  1. docklog is killed by the handle as soon as the task finishes, which
    means fast running containers can never have their logs scraped
  2. docklog quits streaming logs in its event loop if the container has
    stopped
  3. docklog closes the fifo streams as soon as the container quits (or
    destroys them after creating, if the container is stopped), so the logs
    can never be streamed to nomad.

In order to do this, we need to know whether we have read logs for the
current container in order to apply a grace period. We add a copier to
the fifo streams which sets an atomic flag, letting us know whether we
need to retry reading the logs and use a grace period or if we can quit
early.

Fixes #2457, #6931.

Notes; I'm not sure on the coding style and am not necessarily a fan
of the two goroutines within Start(). This fixes the issues (with tests)
but I can change it or it can be carried by someone who knows your style
a little more :)

@tonyhb
Copy link
Author

tonyhb commented Aug 2, 2021

hey hashicorp! can i help anyone review this code? i have a suspicion that a large cohort of nomad users run docker containers. the current docker log collection may have several faults which many users can run into.

i would love to get this merged and/or at least a discussion around it, given it could impact many, many users (and obviously impacts us!) :D

@hashicorp-cla
Copy link

hashicorp-cla commented Mar 12, 2022

CLA assistant check
All committers have signed the CLA.

@shoenig shoenig self-requested a review March 16, 2022 13:52
@josegonzalez
Copy link
Contributor

Any chance this can be reviewed? I've been playing around with short-lived containers and am noticing that logs are being dropped on the floor, and I suspect this is the primary suspect.

Short-lived containers (especially those < 1 second) often do not have
thier logs sent to Nomad.

This PR adjusts the nomad docker driver and docker logging driver to:

1. Enable docklog to run after a container has stopped (to some grace
   period limit)
2. Collect logs from stopped containers up until the grace period

This fixes the current issues:

1. docklog is killed by the handlea as soon as the task finishes, which
   means fast running containers can never have their logs scraped
2. docklog quits streaming logs in its event loop if the container has
   stopped

In order to do this, we need to know _whether_ we have read logs for the
current container in order to apply a grace period.  We add a copier to
the fifo streams which sets an atomic flag, letting us know whether we
need to retry reading the logs and use a grace period or if we can quit
early.

Fixes hashicorp#2475, hashicorp#6931.

Always wait to read from logs before exiting

Store number of bytes read vs a simple counter
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the submission and sorry it has languished so long.

I'm inclined to accept it because we have had so many corroborating, however I still can't manage to reproduce the bug myself!

I've been trying jobspecs like this and then asserting that the full date can be read via nomad alloc logs, and I have not been able to reproduce the bug.

Is there something I'm missing?

job "hello" {
  datacenters = ["dc1"]

  type = "batch"

  group "hello" {
    count = 20
    task "hello" {
      driver = "docker"
      config {
        image   = "busybox"
        command = "date"
      }
    }
  }
}

Obviously after all of this time this PR needs rebasing, a changelog entry, and website/docs update, but I can handle that if you'd like!

}
h.dloggerPluginClient.Kill()
go func() {
h.logger.Info("sending stop signal to logger", "job_id", h.task.JobID, "container_id", h.containerID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be info level?

Suggested change
h.logger.Info("sending stop signal to logger", "job_id", h.task.JobID, "container_id", h.containerID)
h.logger.Debug("sending stop signal to logger", "job_id", h.task.JobID, "container_id", h.containerID)


// read indicates whether we have read anything from the logs. This is manipulated
// using the sync package via multiple goroutines.
read int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT read serves 2 purposes:

  1. If we've ever read anything, we can exit immediately if the container has exited and be safe from the truncation bug. (L160)
  2. If we've received at least 1 read within 1s of container exit, we can exit safely. (L110)

@schmichael
Copy link
Member

Hm, on future testing with the hello9000.nomad jobspec I'm seeing logs getting emitted 2 or 3 times. I think because docker_logger's main for loop now continues after the container has exited and the unix timestamp in the Since field is insufficient to prevent reading the last seconds logs more than once.

This is arguably better than dropping the logs the though.

Bash script I used when testing with that jobspec:

#!/bin/bash

completed=$(nomad job status  -all-allocs  hello9000 | rg 'complete' | cut -f1 -d' ')

for alloc in $completed; do
    echo "$alloc" $(nomad alloc logs $alloc)
done

@tonyhb
Copy link
Author

tonyhb commented Mar 22, 2023

I'm no longer working on this particular logging problem and don't have the time to investigate dupes or carry this; apologies. Framework and explanations of the three issues are hopefully helpful, though!

@tgross tgross added the stage/needs-rebase This PR needs to be rebased on main before it can be backported to pick up new BPA workflows label May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/needs-rebase This PR needs to be rebased on main before it can be backported to pick up new BPA workflows stage/waiting-reply
Projects
Development

Successfully merging this pull request may close these issues.

Nomad Logs doesn't return logs on short running containers.
5 participants