-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix failed log collection from short-lived containers (docklog) #10910
base: main
Are you sure you want to change the base?
fix failed log collection from short-lived containers (docklog) #10910
Conversation
87f35cd
to
5a3c83f
Compare
hey hashicorp! can i help anyone review this code? i have a suspicion that a large cohort of nomad users run docker containers. the current docker log collection may have several faults which many users can run into. i would love to get this merged and/or at least a discussion around it, given it could impact many, many users (and obviously impacts us!) :D |
5a3c83f
to
18a5df6
Compare
fae0fc8
to
ac59129
Compare
ac59129
to
84afa66
Compare
84afa66
to
63d4039
Compare
Any chance this can be reviewed? I've been playing around with short-lived containers and am noticing that logs are being dropped on the floor, and I suspect this is the primary suspect. |
Short-lived containers (especially those < 1 second) often do not have thier logs sent to Nomad. This PR adjusts the nomad docker driver and docker logging driver to: 1. Enable docklog to run after a container has stopped (to some grace period limit) 2. Collect logs from stopped containers up until the grace period This fixes the current issues: 1. docklog is killed by the handlea as soon as the task finishes, which means fast running containers can never have their logs scraped 2. docklog quits streaming logs in its event loop if the container has stopped In order to do this, we need to know _whether_ we have read logs for the current container in order to apply a grace period. We add a copier to the fifo streams which sets an atomic flag, letting us know whether we need to retry reading the logs and use a grace period or if we can quit early. Fixes hashicorp#2475, hashicorp#6931. Always wait to read from logs before exiting Store number of bytes read vs a simple counter
63d4039
to
e1aabe6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the submission and sorry it has languished so long.
I'm inclined to accept it because we have had so many corroborating, however I still can't manage to reproduce the bug myself!
I've been trying jobspecs like this and then asserting that the full date can be read via nomad alloc logs
, and I have not been able to reproduce the bug.
Is there something I'm missing?
job "hello" {
datacenters = ["dc1"]
type = "batch"
group "hello" {
count = 20
task "hello" {
driver = "docker"
config {
image = "busybox"
command = "date"
}
}
}
}
Obviously after all of this time this PR needs rebasing, a changelog entry, and website/docs update, but I can handle that if you'd like!
} | ||
h.dloggerPluginClient.Kill() | ||
go func() { | ||
h.logger.Info("sending stop signal to logger", "job_id", h.task.JobID, "container_id", h.containerID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be info level?
h.logger.Info("sending stop signal to logger", "job_id", h.task.JobID, "container_id", h.containerID) | |
h.logger.Debug("sending stop signal to logger", "job_id", h.task.JobID, "container_id", h.containerID) |
|
||
// read indicates whether we have read anything from the logs. This is manipulated | ||
// using the sync package via multiple goroutines. | ||
read int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT read
serves 2 purposes:
- If we've ever read anything, we can exit immediately if the container has exited and be safe from the truncation bug. (L160)
- If we've received at least 1 read within 1s of container exit, we can exit safely. (L110)
Hm, on future testing with the This is arguably better than dropping the logs the though. Bash script I used when testing with that jobspec: #!/bin/bash
completed=$(nomad job status -all-allocs hello9000 | rg 'complete' | cut -f1 -d' ')
for alloc in $completed; do
echo "$alloc" $(nomad alloc logs $alloc)
done |
I'm no longer working on this particular logging problem and don't have the time to investigate dupes or carry this; apologies. Framework and explanations of the three issues are hopefully helpful, though! |
Short-lived containers (especially those < 1 second) often do not have
thier logs sent to Nomad.
This PR adjusts the nomad docker driver and docker logging driver to:
period limit)
This fixes the current issues:
means fast running containers can never have their logs scraped
stopped
destroys them after creating, if the container is stopped), so the logs
can never be streamed to nomad.
In order to do this, we need to know whether we have read logs for the
current container in order to apply a grace period. We add a copier to
the fifo streams which sets an atomic flag, letting us know whether we
need to retry reading the logs and use a grace period or if we can quit
early.
Fixes #2457, #6931.
Notes; I'm not sure on the coding style and am not necessarily a fan
of the two goroutines within Start(). This fixes the issues (with tests)
but I can change it or it can be carried by someone who knows your style
a little more :)