Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a backoff and retries to retrieving exited event #11681

Merged
merged 1 commit into from
Sep 22, 2021

Conversation

mheon
Copy link
Member

@mheon mheon commented Sep 21, 2021

There's a potential race around extremely short-running containers and events with journald. Events may not be written for some time (small, but appreciable) after they are received, and as such we can fail to retrieve it if there is a sufficiently short time between us writing the event and trying to read it.

Work around this by just retrying, with a 0.25 second delay between retries, up to 4 times.

Fixes #11633

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 21, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 21, 2021
@mheon mheon force-pushed the retry_event_lookup branch from 3b7246c to 567faed Compare September 21, 2021 19:17
There's a potential race around extremely short-running
containers and events with journald. Events may not be written
for some time (small, but appreciable) after they are received,
and as such we can fail to retrieve it if there is a sufficiently
short time between us writing the event and trying to read it.

Work around this by just retrying, with a 0.25 second delay
between retries, up to 4 times.

[NO TESTS NEEDED] because I have no idea how to reproduce this
race in CI.

Fixes containers#11633

Signed-off-by: Matthew Heon <[email protected]>
@mheon mheon force-pushed the retry_event_lookup branch from 567faed to 4ecbc7c Compare September 21, 2021 19:32
@rhatdan
Copy link
Member

rhatdan commented Sep 21, 2021

LGTM
But tests are very angry.

@TomSweeneyRedHat
Copy link
Member

LGTM
once tests are happy

@mheon
Copy link
Member Author

mheon commented Sep 21, 2021

[+0194s] Error: initializing source docker://registry.fedoraproject.org/f32/fedora-toolbox:latest: pinging container registry registry.fedoraproject.org: Get "https://registry.fedoraproject.org/v2/": dial tcp 38.145.60.20:443: i/o timeout

Looks like a registry died

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 22, 2021
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 22, 2021
@vrothberg
Copy link
Member

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 22, 2021
@openshift-merge-robot openshift-merge-robot merged commit e9214ce into containers:main Sep 22, 2021
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

race condition yielding "Cannot get exit code: died not found: unable to find event"
5 participants