-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exitcode/timestamp consistency after checkpoint creation/restore #9388
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: buck2202 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
pinging @rhatdan @baude since you commented in the original issue. hope this looks ok! My familiarity with doing PRs is very low, so sorry if I'm missing something here edit: also not that familiar with git. didn't realize that the first line of a commit message was treated as a "subject" with length restriction. will try to fix edit2: my force-push to shorten the commit message looks like it brought in a different commit. I'll leave this to finish its checks, but if it doesn't work, I don't think I know git well enough to do anything less drastic than reforking and trying again :/ |
You can try rebasing atop master - that should remove the extra commit. Overall LGTM but I'd feel more comfortable if someone more familiar with CRIU reviewed this prior to merge. |
Turns out I hadn't fixed the commit msg length issue anyway...guess it doesn't count as a subject without a blank line following. Think I've fixed the errant commit and the string length now. I guess I didn't fully understand the tests requirement. I think flagging none needed would be appropriate here? Does adding another commit with that message take care of it, or does it have to be present in each commit to be merged? Sorry for the problems! I wanted to contribute and I've used git a little in isolation, but collaborative git/CI are new to me. |
I guess I could actually modify one of the checkpoint/restore tests in e2e/checkpoint_test.go to cover these changes, but I haven't had any luck yet following the instructions to actually run a test locally. I'll keep trying when I find time again |
A friendly reminder that this PR had no activity for 30 days. |
@buck2202 what is the scoop on this one. |
@rhatdan Sorry, missed the bot action on this. I commented in the issue thread when it got marked stale (#8432 (comment)) but not here. I wasn't able to get Beyond general review of my approach, the only open code question on my end was whether |
@buck2202, to pass CI, you need to add Can you rebase and repush? |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: buck2202 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I've played with this off and on, but was never able to get the lint, validate, and ginkgo frameworks to function in my build environment following the docs. Seems like there could easily be a regression test on a container's exitcode after a stopping-checkpoint ( So, I've added the edit: just to bring my rationale into the discussion here--the current behavior of edit2: test failure seems unrelated to my commits
|
@adrianreber PTAL @buck2202, apologies. Could you do another rebase? |
From my point of view this does not make sense. Why should a checkpointed container have 137 as exit code? From my point of view that sounds like something did not work. 137 is a SIGKILL which is correct because without If the command There is no reason to do what Docker does for me, because the whole checkpoint/restore is completely different in Podman.
The right solution would be to track the state correctly as checkpointed. But if the Podman maintainers feel that 137 is the right exit status then I am also not against it. For me it just does not make sense. |
I don't think an entire state for Checkpointed is appropriate (adding new states is a substantial burden as we have a lot of internal state-checking logic to ensure commands are in the right state), but we could easily have a bool in the database for "Checkpointed" which is set on a container killed by a checkpoint operation, and cleared again on the container restarting; we can expose this in |
That sounds like a really good idea. |
…me if left running -replace direct ContainerStateStopped assignment with waitForExitFileAndSync() call. follows example of libpod/container_internal.go::stop() to properly capture stopped exit code -remove unconditional assignment to container FinishedTime. this leaves that unset if the container was left running, and is handled elsewhere if not Signed-off-by: David J. Buckmaster <[email protected]>
Signed-off-by: David J. Buckmaster <[email protected]>
[NO TESTS NEEDED] Signed-off-by: David J. Buckmaster <[email protected]>
I agree that both the current behavior and my commits leave something undefined--or, at least, difficult to discern--without additional tracking. From my perspective, I thought it made more sense to rely solely on the My experience may be atypical, though, so I understand that opinions may differ. I run transient containers in a preemptible cloud environment with periodic, non-terminating, exported checkpoints; and a terminating checkpoint either on-demand or when the VM reaches its max-allowed uptime. Because my VMs can be killed at any time (including during a checkpoint operation), I use the I also like @mheon's idea. I rebased, but the checks seem to have failed again unrelated to my commits
|
A friendly reminder that this PR had no activity for 30 days. |
@mheon @buck2202 @adrianreber What should we do with this PR? |
I think we close, and I take over and create a PR to add a Checkpointed bool to the output of |
No objection from me |
A friendly reminder that this PR had no activity for 30 days. |
@mheon Did you ever work on this? |
Negative. Let me do it now. |
Closing in favor of #11471 |
In docker, the exitcode of a container checkpointed without
--leave-running
is 137, which matches the exitcode of a container that is manually stopped. podman leaves the exitcode as 0In
libpod/container_internal_linux.go::checkpoint()
, I usedlibpod/container_internal.go::stop()
as an example and copied the call toc.waitForExitFileAndSync()
rather than manually assigning the exitcode to 137. I assumed this would be better than using 137 as a magic number, but maybe that would've been fine. I'm not totally sure if there are other implications to this function call. Additionally incheckpoint()
, I removed the unconditional assignment to the container's FinishedTime. This happens within thewaitForExitFileAndSync
call chain if the container is stopped, and probably shouldn't have been happening at all if--leave-running
was specified.In
libpod/container_internal_linux.go::restore()
after the container state is set to running, I explicitly set the container exitcode to zero, its StartedTime to now, and clear its FinishedTime.With these changes, the timestamps and exitcode almost track with docker's behavior. I noticed in collecting the output here that docker never actually resets the "FinishedAt" time when a container is restored. If the goal is completely matching docker, I can remove the line that clears it upon restore. IMO, it seems more logical to have a null FinishedAt time for a container that's running.
Podman:
Docker:
Unfortunately I couldn't get the
validate
orlint
tests to function. Not sure what I'm missing in following the instructions in the contributing doc :/Closes #8432