-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs are not correctly restarted on Podman 4.6.0 #277
Comments
Thanks for the bug report! I'm running in to the same issue on Fedora 38. I upgraded to podman 4.6.0-1.fc38.x86_64 on August 1st and that's when I started seeing the issue. |
I also ran into this! Thanks for taking the time to investigate and dropping a clear explanation of the issue @jonasdemoor. I've got a small fix that works for me. PR incoming. |
I can report that I am experiencing the same problem in the following environment:
Moreover, on a Job version update, the previous allocation's container(s) cannot be stopped, and the new allocation stays pending forever For this specific instance, this is what I see from the not-stopping allocation's status log:
|
I compiled the |
Description
We're running Nomad 1.5.3 with Podman 4.6.0 compiled from sources on Debian 11 (bullseye). I attached more details about our environment below.
This is how the state of Nomad and Podman looks like after starting the job:
After stopping the container, Nomad lets Podman correctly stop the container, but doesn't gc/remove it. Even though the job is dead according to Nomad, it keeps trying to get stats for that container.
When I try to start the Nomad job again, Nomad thinks everything is fine and the job is running, but the old Podman container is still there and in a stopped state. Normally, by default, the old container should have been removed on stopping the job and a new one should have been started by Podman when starting the job again.
It takes a manual
podman rm <stopped_container_id>
to get Nomad back in the right state. Then, Podman starts a new container and the job is running fine.Downgrading to Podman 4.5.1 solves the issue for us. With that version, the containers are successfully stopped + deleted, as illustrated by the
journalctl
output:After some digging, I found that the stats API changed a bit between Podman 4.5.1 and 4.6.0. It used to return HTTP 200 with an empty response for stopped containers, but now it returns HTTP 200 with an error message:
I think the upstream Podman PR that introduced those changes is this one: containers/podman#15867
What I think is happening, is that the Nomad driver thinks the container is running fine, since it doesn't handle that new stats API response, it only takes the old API behaviour into account: https://github.com/hashicorp/nomad-driver-podman/blob/main/api/container_stats.go#L46
I tried poking around in the
api/container_stats.go
code to try add a new case for this new Podman behaviour, but my Golang skills are unfortunately not very good. I however want to assist in anyway I can, feel free to let me know if I can provide additional information and/or testing.Environment
The text was updated successfully, but these errors were encountered: