Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

journald logger: fix race condition #10431

Merged
merged 1 commit into from
May 26, 2021

Conversation

vrothberg
Copy link
Member

Fix a race in journald driver. Following the logs implies streaming
until the container is dead. Streaming happened in one goroutine,
waiting for the container to exit/die and signaling that event happened
in another goroutine.

The nature of having two goroutines running simultaneously is pretty
much the core of the race condition. When the streaming goroutines
received the signal that the container has exitted, the routine may not
have read and written all of the container's logs.

Fix this race by reading both, the logs and the events, of the container
and stop streaming when the died/exited event has been read. The died
event is guaranteed to be after all logs in the journal which guarantees
not only consistencty but also a deterministic behavior.

Note that the journald log driver now requires the journald event
backend to be set.

Fixes: #10323
Signed-off-by: Valentin Rothberg [email protected]

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 21, 2021
@vrothberg
Copy link
Member Author

WIP for now as I still need to add some errors checks to enforce the journald log/event requirement.

I also want to go through the tests and run a bunch of tests with the journald logger.

@vrothberg
Copy link
Member Author

@mheon @edsantiago PTAL if you find time. Early feedback is good feedback.

The reproducer from #10323 is running well on my machine.

@edsantiago
Copy link
Member

I can't look closely right now, sorry. Can you try clearing this and re-pushing?

# Force the k8s-file driver until #10323 is fixed.
run_podman run --log-driver=k8s-file -d $IMAGE sh -c \

@vrothberg
Copy link
Member Author

vrothberg commented May 25, 2021

The origami continues :^) --events-backend is ignored (after validation) and always set to the defaults from containers.conf.

Will spin up another PR to address that first.

EDIT: My bad; the events-backend is not stored in the container.

@vrothberg vrothberg force-pushed the journald-logs branch 2 times, most recently from 31576ae to 5a38974 Compare May 26, 2021 08:38
@vrothberg
Copy link
Member Author

@edsantiago, can you take a look at the system-test changes?

@vrothberg vrothberg changed the title WIP - journald logger: fix race condition journald logger: fix race condition May 26, 2021
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 26, 2021
@vrothberg
Copy link
Member Author

@containers/podman-maintainers PTAL

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@edsantiago edsantiago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Teeny fixes requested

test/system/035-logs.bats Outdated Show resolved Hide resolved
test/system/035-logs.bats Outdated Show resolved Hide resolved
test/system/035-logs.bats Show resolved Hide resolved
Fix a race in journald driver.  Following the logs implies streaming
until the container is dead.  Streaming happened in one goroutine,
waiting for the container to exit/die and signaling that event happened
in another goroutine.

The nature of having two goroutines running simultaneously is pretty
much the core of the race condition.  When the streaming goroutines
received the signal that the container has exitted, the routine may not
have read and written all of the container's logs.

Fix this race by reading both, the logs and the events, of the container
and stop streaming when the died/exited event has been read.  The died
event is guaranteed to be after all logs in the journal which guarantees
not only consistencty but also a deterministic behavior.

Note that the journald log driver now requires the journald event
backend to be set.

Fixes: containers#10323
Signed-off-by: Valentin Rothberg <[email protected]>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 26, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [edsantiago,vrothberg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vrothberg
Copy link
Member Author

The flake is stubborn

@edsantiago
Copy link
Member

Six flake restarts later, tests are green. Tests LGTM.

@rhatdan
Copy link
Member

rhatdan commented May 26, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 26, 2021
@openshift-merge-robot openshift-merge-robot merged commit 5b4ffc7 into containers:master May 26, 2021
@vrothberg vrothberg deleted the journald-logs branch May 27, 2021 06:23
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

race condition in logs -f with journald driver
5 participants