-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Followon to #14559 #14685
Followon to #14559 #14685
Conversation
The commit message saying "gitlab:" as the prefix makes no sense. Can you please drop it? |
Got inside a CI VM to test. I was only able to reproduce an actual timeout once, and it was on |
ID'd some potential issues in the DB. Have not had a chance to test, but I'll let CI bake overnight. |
Hot damn it's actually passing (minus a flake). I'll clean this up in the morning, then we can get final review + merge done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two non-blocking nits, otherwise LGTM
We should just silently fall through. The log was flooding the system-service logs when running Gitlab runner. Signed-off-by: Valentin Rothberg <[email protected]>
This commit addresses three intertwined bugs to fix an issue when using Gitlab runner on Podman. The three bug fixes are not split into separate commits as tests won't pass otherwise; avoidable noise when bisecting future issues. 1) Podman conflated states: even when asking to wait for the `exited` state, Podman returned as soon as a container transitioned to `stopped`. The issues surfaced in Gitlab tests to fail [1] as `conmon`'s buffers have not (yet) been emptied when attaching to a container right after a wait. The race window was extremely narrow, and I only managed to reproduce with the Gitlab runner [1] unit tests. 2) The clearer separation between `exited` and `stopped` revealed a race condition predating the changes. If a container is configured for autoremoval (e.g., via `run --rm`), the "run" process competes with the "cleanup" process running in the background. The window of the race condition was sufficiently large that the "cleanup" process has already removed the container and storage before the "run" process could read the exit code and hence waited indefinitely. Address the exit-code race condition by recording exit codes in the main libpod database. Exit codes can now be read from a database. When waiting for a container to exit, Podman first waits for the container to transition to `exited` and will then query the database for its exit code. Outdated exit codes are pruned during cleanup (i.e., non-performance critical) and when refreshing the database after a reboot. An exit code is considered outdated when it is older than 5 minutes. While the race condition predates this change, the waiting process has apparently always been fast enough in catching the exit code due to issue 1): `exited` and `stopped` were conflated. The waiting process hence caught the exit code after the container transitioned to `stopped` but before it `exited` and got removed. 3) With 1) and 2), Podman is now waiting for a container to properly transition to the `exited` state. Some tests did not pass after 1) and 2) which revealed the third bug: `conmon` was executed with its working directory pointing to the OCI runtime bundle of the container. The changed working directory broke resolving relative paths in the "cleanup" process. The "cleanup" process error'ed before actually cleaning up the container and waiting "main" process ran indefinitely - or until hitting a timeout. Fix the issue by executing `conmon` with the same working directory as Podman. Note that fixing 3) *may* address a number of issues we have seen in the past where for *some* reason cleanup processes did not fire. [1] https://gitlab.com/gitlab-org/gitlab-runner/-/issues/27119#note_970712864 Signed-off-by: Valentin Rothberg <[email protected]> [MH: Minor reword of commit message] Signed-off-by: Matthew Heon <[email protected]>
Force-pushed, should be ready now. |
build is failing |
Firstly: don't prune exit codes after a refresh - instead, clear the table entirely. We are guaranteed that all containers are gone after a refresh, we should not worry about exit codes given this. Secondly: alter the way pruning was done. We were updating the DB by calling Update from within an existing View, and stacking an RW transaction on top of an existing RO one seems dodgy; further, modifying a bucket while iterating over it with ForEach is undefined behavior. Hopefully this will resolve our CI issues. Signed-off-by: Matthew Heon <[email protected]>
Oops, fixed |
Ready to merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for kicking it over the finish line, @mheon. Great work!
@giuseppe PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giuseppe, mheon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
As an outsider: Good job, everyone! :) Looking forward to (eventually) using Podman in Gitlab. |
Taking over #14559 from @vrothberg and trying to get CI to pass.
No changes (besides a rebase) yes, I just want to verify we have the same symptoms.