Followon to #14559 #14685

mheon · 2022-06-21T18:59:37Z

Taking over #14559 from @vrothberg and trying to get CI to pass.

No changes (besides a rebase) yes, I just want to verify we have the same symptoms.

Fix a bug in the wait logic to enable Podman on Gitlab runner.

Conan-Kudo · 2022-06-22T01:01:52Z

The commit message saying "gitlab:" as the prefix makes no sense. Can you please drop it?

mheon · 2022-06-22T20:05:36Z

Got inside a CI VM to test. I was only able to reproduce an actual timeout once, and it was on podman rm -t 0, not podman kill. The cause was one of more stuck cleanup processes holding container locks. Still unclear as to why they were stuck, and how that could affect podman kill as it does in the CI runs (cleanup process should only be running after kill?).

mheon · 2022-06-22T20:40:45Z

ID'd some potential issues in the DB. Have not had a chance to test, but I'll let CI bake overnight.

mheon · 2022-06-22T23:39:08Z

Hot damn it's actually passing (minus a flake).

I'll clean this up in the morning, then we can get final review + merge done.

giuseppe

two non-blocking nits, otherwise LGTM

libpod/boltdb_state.go

We should just silently fall through. The log was flooding the system-service logs when running Gitlab runner. Signed-off-by: Valentin Rothberg <[email protected]>

This commit addresses three intertwined bugs to fix an issue when using Gitlab runner on Podman. The three bug fixes are not split into separate commits as tests won't pass otherwise; avoidable noise when bisecting future issues. 1) Podman conflated states: even when asking to wait for the `exited` state, Podman returned as soon as a container transitioned to `stopped`. The issues surfaced in Gitlab tests to fail [1] as `conmon`'s buffers have not (yet) been emptied when attaching to a container right after a wait. The race window was extremely narrow, and I only managed to reproduce with the Gitlab runner [1] unit tests. 2) The clearer separation between `exited` and `stopped` revealed a race condition predating the changes. If a container is configured for autoremoval (e.g., via `run --rm`), the "run" process competes with the "cleanup" process running in the background. The window of the race condition was sufficiently large that the "cleanup" process has already removed the container and storage before the "run" process could read the exit code and hence waited indefinitely. Address the exit-code race condition by recording exit codes in the main libpod database. Exit codes can now be read from a database. When waiting for a container to exit, Podman first waits for the container to transition to `exited` and will then query the database for its exit code. Outdated exit codes are pruned during cleanup (i.e., non-performance critical) and when refreshing the database after a reboot. An exit code is considered outdated when it is older than 5 minutes. While the race condition predates this change, the waiting process has apparently always been fast enough in catching the exit code due to issue 1): `exited` and `stopped` were conflated. The waiting process hence caught the exit code after the container transitioned to `stopped` but before it `exited` and got removed. 3) With 1) and 2), Podman is now waiting for a container to properly transition to the `exited` state. Some tests did not pass after 1) and 2) which revealed the third bug: `conmon` was executed with its working directory pointing to the OCI runtime bundle of the container. The changed working directory broke resolving relative paths in the "cleanup" process. The "cleanup" process error'ed before actually cleaning up the container and waiting "main" process ran indefinitely - or until hitting a timeout. Fix the issue by executing `conmon` with the same working directory as Podman. Note that fixing 3) *may* address a number of issues we have seen in the past where for *some* reason cleanup processes did not fire. [1] https://gitlab.com/gitlab-org/gitlab-runner/-/issues/27119#note_970712864 Signed-off-by: Valentin Rothberg <[email protected]> [MH: Minor reword of commit message] Signed-off-by: Matthew Heon <[email protected]>

mheon · 2022-06-23T13:12:16Z

Force-pushed, should be ready now.

mheon · 2022-06-23T13:12:57Z

@baude @Luap99 @rhatdan PTAL

Luap99 · 2022-06-23T13:23:09Z

build is failing

Firstly: don't prune exit codes after a refresh - instead, clear the table entirely. We are guaranteed that all containers are gone after a refresh, we should not worry about exit codes given this. Secondly: alter the way pruning was done. We were updating the DB by calling Update from within an existing View, and stacking an RW transaction on top of an existing RO one seems dodgy; further, modifying a bucket while iterating over it with ForEach is undefined behavior. Hopefully this will resolve our CI issues. Signed-off-by: Matthew Heon <[email protected]>

mheon · 2022-06-23T14:44:56Z

Oops, fixed

libpod/events.go

mheon · 2022-06-23T20:47:07Z

Ready to merge

vrothberg

LGTM
Thanks for kicking it over the finish line, @mheon. Great work!

mheon · 2022-06-24T16:24:39Z

@baude @Luap99 @rhatdan PTAL and merge

vrothberg · 2022-06-27T10:55:01Z

@giuseppe PTAL

giuseppe

/lgtm

openshift-ci · 2022-06-27T10:57:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [giuseppe,mheon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

runiq · 2022-06-27T13:28:36Z

As an outsider: Good job, everyone! :) Looking forward to (eventually) using Podman in Gitlab.

openshift-ci bot added release-note approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 21, 2022

giuseppe reviewed Jun 23, 2022

View reviewed changes

libpod/boltdb_state.go Outdated Show resolved Hide resolved

libpod/boltdb_state.go Show resolved Hide resolved

vrothberg added 2 commits June 23, 2022 09:11

conmon: silence json-file error

15188dc

We should just silently fall through. The log was flooding the system-service logs when running Gitlab runner. Signed-off-by: Valentin Rothberg <[email protected]>

mheon force-pushed the gitlab branch from 64a61b8 to 0b906ae Compare June 23, 2022 13:12

mheon force-pushed the gitlab branch from 0b906ae to 3a810b8 Compare June 23, 2022 14:44

Luap99 reviewed Jun 23, 2022

View reviewed changes

libpod/events.go Show resolved Hide resolved

vrothberg reviewed Jun 24, 2022

View reviewed changes

vrothberg mentioned this pull request Jun 24, 2022

gitlab: fix wait and exit-code logic #14559

Closed

giuseppe approved these changes Jun 27, 2022

View reviewed changes

openshift-ci bot assigned giuseppe Jun 27, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 27, 2022

openshift-ci bot merged commit 3176b3f into containers:main Jun 27, 2022

This was referenced Jun 27, 2022

Use Regexp in volume ls --filter name #14597

Merged

podman wait: new timeout, possibly deadlock #14761

Closed

mheon mentioned this pull request Jul 29, 2022

podman hangs when sending large amounts of data on stdout #13779

Open

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Followon to #14559 #14685

Followon to #14559 #14685

mheon commented Jun 21, 2022

Conan-Kudo commented Jun 22, 2022

mheon commented Jun 22, 2022

mheon commented Jun 22, 2022

mheon commented Jun 22, 2022

giuseppe left a comment

mheon commented Jun 23, 2022

mheon commented Jun 23, 2022

Luap99 commented Jun 23, 2022

mheon commented Jun 23, 2022

mheon commented Jun 23, 2022

vrothberg left a comment

mheon commented Jun 24, 2022

vrothberg commented Jun 27, 2022

giuseppe left a comment

openshift-ci bot commented Jun 27, 2022

runiq commented Jun 27, 2022

Followon to #14559 #14685

Followon to #14559 #14685

Conversation

mheon commented Jun 21, 2022

Conan-Kudo commented Jun 22, 2022

mheon commented Jun 22, 2022

mheon commented Jun 22, 2022

mheon commented Jun 22, 2022

giuseppe left a comment

Choose a reason for hiding this comment

mheon commented Jun 23, 2022

mheon commented Jun 23, 2022

Luap99 commented Jun 23, 2022

mheon commented Jun 23, 2022

mheon commented Jun 23, 2022

vrothberg left a comment

Choose a reason for hiding this comment

mheon commented Jun 24, 2022

vrothberg commented Jun 27, 2022

giuseppe left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Jun 27, 2022

runiq commented Jun 27, 2022