-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sdnotify play kube policies: podman container wait, hangs #16076
Comments
Will take a look, thanks! |
Possibly related to #16062? |
I recently attempted to fix some issues down there. @edsantiago, do you know when the flake happened for the first time? |
Just saw it at the bottom "Looks like Sept 9 is the first logged instance." |
Starting listening for the READY messages on the sdnotify proxies before starting the Pod. Otherwise, we may be missing messages. [NO NEW TESTS NEEDED] as it's hard to test this very narrow race. Related to but may not be fixing containers#16076. Signed-off-by: Valentin Rothberg <[email protected]>
Closing as #16118 merged. I am not 100 percent sure it fixes the flake but it's the only potential source for the flake I could spot so far. Let's reopen in case it continues. |
Starting listening for the READY messages on the sdnotify proxies before starting the Pod. Otherwise, we may be missing messages. [NO NEW TESTS NEEDED] as it's hard to test this very narrow race. Related to but may not be fixing containers#16076. Signed-off-by: Valentin Rothberg <[email protected]>
Reopening: this is still happening (and yes, I confirmed that the PR in question is forked from a |
Can we close #16246 as a duplicate, or is there a need to track it separately? |
Seen also in fedora gating tests. [Edit: yes, dup. I was hunting for the log link before closing that one] |
Thanks, @edsantiago! I will take a look at this one. A stubborn issue! |
The notify proxy has a watcher to check whether the container has left the running state. In that case, Podman should stop waiting for the ready message to prevent a dead lock. Fix this watcher but adding a loop. Fixes the dead lock in containers#16076 surfacing in a timeout. The underlying issue persists though. Also use a timer in the select statement to prevent the goroutine from running unnecessarily long [NO NEW TESTS NEEDED] Signed-off-by: Valentin Rothberg <[email protected]>
Reopening as I don't expect #16284 to fix the underlying issue. It may help to resolve the timeout but should fail. |
@edsantiago have you seen this one flake recently? |
|
One of our most popular flakes recently
|
Help is on the way in #16709. This has the highest priority for me this week. Currently refining tests, so I am hopeful the flake will be buried this week. |
The flake in containers#16076 is likely related to the notify message not being delivered/read correctly. Move sending the message into an exec session such that flakes will reveal an error message. Signed-off-by: Valentin Rothberg <[email protected]>
As outlined in containers#16076, a subsequent BARRIER *may* follow the READY message sent by a container. To correctly imitate the behavior of systemd's NOTIFY_SOCKET, the notify proxies span up by `kube play` must hence process messages for the entirety of the workload. We know that the workload is done and that all containers and pods have exited when the service container exits. Hence, all proxies are closed at that time. The above changes imply that Podman runs for the entirety of the workload and will henceforth act as the MAINPID when running inside of systemd. Prior to this change, the service container acted as the MAINPID which is now not possible anymore; Podman would be killed immediately on exit of the service container and could not clean up. The kube template now correctly transitions to in-active instead of failed in systemd. Fixes: containers#16076 Fixes: containers#16515 Signed-off-by: Valentin Rothberg <[email protected]>
@edsantiago any indications of a flake after the merge? |
In the last ten minutes, no :-) Patience, grasshopper. Data collection takes time. I'll let you know late December. |
GitHub claims it to be 13 hours :^)
Thanks! I intended to be pro-active; it has been quite a flake/ride. |
Two flakes on December 14 (after the Dec 8 merge). Both on PR 16781 which, if I'm gitting correctly, was parented on a Dec 7 commit that did not include #16709. So I think we're good. Thank you @vrothberg!
|
That was a tough cookie! Thanks for your help and patience, @edsantiago |
podman pod logs -l
no longer panics #15846Looks like Sept 9 is the first logged instance.
So far, f36 only (both amd64 and aarch64), root and rootless.
The text was updated successfully, but these errors were encountered: