-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote: run --restart=always, then wait: timeout #23473
Comments
I looked into that as it happened a lot in my parallel PR as well but I didn't find any obvious. Overall the wait code is rather ugly as it is polling the status so in theory it could be possible that it misses an exit on a fast restarting container but I don't think this is the case for 20s. And I tried reproducing for a while without out luck. |
Soooo..... this is
I've lumped it into this issue. [remote rawhide] root](https://api.cirrus-ci.com/v1/artifact/task/5363842849439744/html/sys-remote-rawhide-root-host-sqlite.log.html#t--00216p) |
Yes you are right kube play --wait uses the same wait API internally possible that they have the same cause. |
Current flake list. There is one non-remote instance (debian), all others are remote. I'm tempted to remove the
|
I will try to reproduce this, last time I tried to run this for hours in a loop without any errors. Not sure if there is some magic requirement I was missing to trigger it. |
--restart=always has some extra special logic to "make it work" better but using on-failure shows that the current logic is clearly not working properly |
The current code did several complicated state checks that simply do not work properly on a fast restarting container. It uses a special case for --restart=always but forgot to take care of --restart=on-failure which always hang for 20s until it run into the timeout. The old logic also used to call CheckConmonRunning() but synced the state before which means it may check a new conmon every time and thus misses exits. To fix the new the code is much simpler. Check the conmon pid, if it is no longer running then get then check exit file and get exit code. This is related to containers#23473 but I am not sure if this fixes it because we cannot reproduce. Signed-off-by: Paul Holzinger <[email protected]>
Let me know if you see any hangs after #23601 |
Last logged instance on the flake tracker was 8-14 so I am going to close this. |
The text was updated successfully, but these errors were encountered: