exitcode/timestamp consistency after checkpoint creation/restore #9388

buck2202 · 2021-02-15T21:29:44Z

In docker, the exitcode of a container checkpointed without --leave-running is 137, which matches the exitcode of a container that is manually stopped. podman leaves the exitcode as 0

In libpod/container_internal_linux.go::checkpoint(), I used libpod/container_internal.go::stop() as an example and copied the call to c.waitForExitFileAndSync() rather than manually assigning the exitcode to 137. I assumed this would be better than using 137 as a magic number, but maybe that would've been fine. I'm not totally sure if there are other implications to this function call. Additionally in checkpoint(), I removed the unconditional assignment to the container's FinishedTime. This happens within the waitForExitFileAndSync call chain if the container is stopped, and probably shouldn't have been happening at all if --leave-running was specified.

In libpod/container_internal_linux.go::restore() after the container state is set to running, I explicitly set the container exitcode to zero, its StartedTime to now, and clear its FinishedTime.

With these changes, the timestamps and exitcode almost track with docker's behavior. I noticed in collecting the output here that docker never actually resets the "FinishedAt" time when a container is restored. If the goal is completely matching docker, I can remove the line that clears it upon restore. IMO, it seems more logical to have a null FinishedAt time for a container that's running.

Podman:

$ sudo podman run --detach --name cr busybox /bin/sleep 60
7f2de709a7be88542d03cebf194deca877b5ad69557f48597259767e1b4a8f83
$ sudo podman container inspect cr |grep -E "(\"Created|FinishedAt|StartedAt|ExitCode)"
        "Created": "2021-02-15T16:07:48.815788417-05:00",
            "ExitCode": 0,
            "StartedAt": "2021-02-15T16:07:49.374606028-05:00",
            "FinishedAt": "0001-01-01T00:00:00Z",
$ sudo podman container checkpoint cr
7f2de709a7be88542d03cebf194deca877b5ad69557f48597259767e1b4a8f83
$ sudo podman container inspect cr |grep -E "(\"Created|FinishedAt|StartedAt|ExitCode)"
        "Created": "2021-02-15T16:07:48.815788417-05:00",
            "ExitCode": 137,
            "StartedAt": "2021-02-15T16:07:49.374606028-05:00",
            "FinishedAt": "2021-02-15T16:08:03.296056384-05:00",
$ sudo podman container restore cr
7f2de709a7be88542d03cebf194deca877b5ad69557f48597259767e1b4a8f83
$ sudo podman container inspect cr |grep -E "(\"Created|FinishedAt|StartedAt|ExitCode)"
        "Created": "2021-02-15T16:07:48.815788417-05:00",
            "ExitCode": 0,
            "StartedAt": "2021-02-15T16:08:23.071588181-05:00",
            "FinishedAt": "0001-01-01T00:00:00Z",
$ sleep 60
$ sudo podman container inspect cr |grep -E "(\"Created|FinishedAt|StartedAt|ExitCode)"
        "Created": "2021-02-15T16:07:48.815788417-05:00",
            "ExitCode": 0,
            "StartedAt": "2021-02-15T16:08:23.071588181-05:00",
            "FinishedAt": "2021-02-15T16:09:09.804056384-05:00",

Docker:

$ docker run --detach --name cr busybox /bin/sleep 60
66d0545e3d3932f58e2b59f8ea7489dea1c909ebbf02789883478f98ebd9fc3d
$ docker container inspect cr | grep -E "(\"Created|FinishedAt|StartedAt|ExitCode)"
        "Created": "2021-02-15T21:11:54.129271959Z",
            "ExitCode": 0,
            "StartedAt": "2021-02-15T21:11:54.772211374Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
$ docker checkpoint create cr chk
chk
$ docker container inspect cr | grep -E "(\"Created|FinishedAt|StartedAt|ExitCode)"
        "Created": "2021-02-15T21:11:54.129271959Z",
            "ExitCode": 137,
            "StartedAt": "2021-02-15T21:11:54.772211374Z",
            "FinishedAt": "2021-02-15T21:12:06.879920447Z"
$ docker start --checkpoint chk cr
$ docker container inspect cr | grep -E "(\"Created|FinishedAt|StartedAt|ExitCode)"
        "Created": "2021-02-15T21:11:54.129271959Z",
            "ExitCode": 0,
            "StartedAt": "2021-02-15T21:12:25.566876155Z",
            "FinishedAt": "2021-02-15T21:12:06.879920447Z"
$ sleep 60
$ docker container inspect cr | grep -E "(\"Created|FinishedAt|StartedAt|ExitCode)"
        "Created": "2021-02-15T21:11:54.129271959Z",
            "ExitCode": 0,
            "StartedAt": "2021-02-15T21:12:25.566876155Z",
            "FinishedAt": "2021-02-15T21:13:14.151229075Z"

Unfortunately I couldn't get the validate or lint tests to function. Not sure what I'm missing in following the instructions in the contributing doc :/

Closes #8432

openshift-ci-robot · 2021-02-15T21:29:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: buck2202
To complete the pull request process, please assign umohnani8 after the PR has been reviewed.
You can assign the PR to them by writing /assign @umohnani8 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

buck2202 · 2021-02-15T21:34:43Z

pinging @rhatdan @baude since you commented in the original issue. hope this looks ok! My familiarity with doing PRs is very low, so sorry if I'm missing something here

edit: also not that familiar with git. didn't realize that the first line of a commit message was treated as a "subject" with length restriction. will try to fix

edit2: my force-push to shorten the commit message looks like it brought in a different commit. I'll leave this to finish its checks, but if it doesn't work, I don't think I know git well enough to do anything less drastic than reforking and trying again :/

mheon · 2021-02-15T22:13:06Z

You can try rebasing atop master - that should remove the extra commit.

Overall LGTM but I'd feel more comfortable if someone more familiar with CRIU reviewed this prior to merge.

buck2202 · 2021-02-15T23:13:38Z

Turns out I hadn't fixed the commit msg length issue anyway...guess it doesn't count as a subject without a blank line following. Think I've fixed the errant commit and the string length now.

I guess I didn't fully understand the tests requirement. I think flagging none needed would be appropriate here? Does adding another commit with that message take care of it, or does it have to be present in each commit to be merged?

Sorry for the problems! I wanted to contribute and I've used git a little in isolation, but collaborative git/CI are new to me.

buck2202 · 2021-02-16T00:23:10Z

I guess I could actually modify one of the checkpoint/restore tests in e2e/checkpoint_test.go to cover these changes, but I haven't had any luck yet following the instructions to actually run a test locally. I'll keep trying when I find time again

github-actions · 2021-03-19T00:17:39Z

A friendly reminder that this PR had no activity for 30 days.

rhatdan · 2021-03-22T19:05:17Z

@buck2202 what is the scoop on this one.

buck2202 · 2021-03-22T19:19:36Z

@rhatdan Sorry, missed the bot action on this. I commented in the issue thread when it got marked stale (#8432 (comment)) but not here.

I wasn't able to get make validate, make lint, and tests to run locally when I was working on it; the last of those stopped me from creating a validation/regression test to pass the CI checks (and I think it does deserve a test). I still plan to revisit, but it's been a very busy month.

Beyond general review of my approach, the only open code question on my end was whether FinishedAt should retain its old value after restore (matching docker) or reset to nil once the container is running (what I did, not realizing it was not what docker did until the very end).

vrothberg · 2021-05-31T13:08:52Z

@buck2202, to pass CI, you need to add [NO TESTS NEEDED] to the commit message (see the bottom of https://github.com/containers/podman/pull/9388/checks?check_run_id=1906735716).

Can you rebase and repush?

openshift-ci · 2021-06-08T17:49:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: buck2202
To complete the pull request process, please assign umohnani8 after the PR has been reviewed.
You can assign the PR to them by writing /assign @umohnani8 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

buck2202 · 2021-06-08T18:16:59Z

I've played with this off and on, but was never able to get the lint, validate, and ginkgo frameworks to function in my build environment following the docs. Seems like there could easily be a regression test on a container's exitcode after a stopping-checkpoint (137 rather than 0), but I hesitate to write it without actually being able to exercise it.

So, I've added the [NO TESTS NEEDED] flag per @vrothberg in the interest of getting this past checks for discussion. @adrianreber suggested that he did not see this as an issue that needed fixing--if you decide that this is something you'd like to merge, maybe someone could help me with the regression test.

edit: just to bring my rationale into the discussion here--the current behavior of podman container checkpoint foo forcibly stops the container but leaves a container exitcode of 0, while my naive expectation coming from docker was that the container exitcode would reflect the force-stop. The 0 doesn't allow a "natural" exit to be distinguished from a checkpoint stop in the inspect output--to maintain that distinction in my own scripts for managing transient containers, my workaround has been to exclusively use checkpoint --leave-running, followed by stop if desired.

edit2: test failure seems unrelated to my commits

[+0447s] not ok 212 podman networking: port with --userns=keep-id
[+0447s] # (in test file test/system/500-networking.bats, line 145)
[+0447s] #   `echo "$teststring" | nc 127.0.0.1 $myport' failed
[+0447s] # # /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_hwB7 rm --all --force
[+0447s] # # /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_hwB7 ps --all --external --format {{.ID}} {{.Names}}
[+0447s] # # /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_hwB7 images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
[+0447s] # quay.io/libpod/testimage:20210427 aadc32e2a626
[+0447s] # # /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_hwB7 run -d --userns=keep-id -p 127.0.0.1:54322:54322 quay.io/libpod/testimage:20210427 nc -l -n -v -p 54322
[+0447s] # 9d9cbd582a67af8fc968b4fab7b766a5524b7ec4b411e091c26857735c93e062
[+0447s] # Ncat: No route to host.
[+0447s] # # [teardown]
[+0447s] # # /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_hwB7 pod rm --all --force
[+0447s] # # /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_hwB7 rm --all --force
[+0447s] # 9d9cbd582a67af8fc968b4fab7b766a5524b7ec4b411e091c26857735c93e062

vrothberg · 2021-06-21T11:19:26Z

@adrianreber PTAL

@buck2202, apologies. Could you do another rebase?

adrianreber · 2021-06-21T12:04:58Z

From my point of view this does not make sense. Why should a checkpointed container have 137 as exit code? From my point of view that sounds like something did not work. 137 is a SIGKILL which is correct because without --leave-running CRIU terminates the container process, but this would leave us in the situation to not be able to distinguish between a checkpointed container and a killed container process. This just moves the unclear state of the checkpointed container to something else.

If the command podman container checkpoint returns with a 0 the container should be successfully checkpointed.

There is no reason to do what Docker does for me, because the whole checkpoint/restore is completely different in Podman.

The 0 doesn't allow a "natural" exit to be distinguished from a checkpoint stop in the inspect output--to maintain that distinction in my own scripts for managing transient containers, my workaround has been to exclusively use checkpoint --leave-running, followed by stop if desired.

The right solution would be to track the state correctly as checkpointed. But if the Podman maintainers feel that 137 is the right exit status then I am also not against it. For me it just does not make sense.

mheon · 2021-06-22T15:57:07Z

I don't think an entire state for Checkpointed is appropriate (adding new states is a substantial burden as we have a lot of internal state-checking logic to ensure commands are in the right state), but we could easily have a bool in the database for "Checkpointed" which is set on a container killed by a checkpoint operation, and cleared again on the container restarting; we can expose this in podman inspect for the container.

adrianreber · 2021-06-23T06:19:56Z

I don't think an entire state for Checkpointed is appropriate (adding new states is a substantial burden as we have a lot of internal state-checking logic to ensure commands are in the right state), but we could easily have a bool in the database for "Checkpointed" which is set on a container killed by a checkpoint operation, and cleared again on the container restarting; we can expose this in podman inspect for the container.

That sounds like a really good idea.

…me if left running -replace direct ContainerStateStopped assignment with waitForExitFileAndSync() call. follows example of libpod/container_internal.go::stop() to properly capture stopped exit code -remove unconditional assignment to container FinishedTime. this leaves that unset if the container was left running, and is handled elsewhere if not Signed-off-by: David J. Buckmaster <[email protected]>

Signed-off-by: David J. Buckmaster <[email protected]>

[NO TESTS NEEDED] Signed-off-by: David J. Buckmaster <[email protected]>

buck2202 · 2021-06-23T18:38:18Z

From my point of view this does not make sense. Why should a checkpointed container have 137 as exit code? From my point of view that sounds like something did not work. 137 is a SIGKILL which is correct because without --leave-running CRIU terminates the container process, but this would leave us in the situation to not be able to distinguish between a checkpointed container and a killed container process. This just moves the unclear state of the checkpointed container to something else.

If the command podman container checkpoint returns with a 0 the container should be successfully checkpointed.

I agree that both the current behavior and my commits leave something undefined--or, at least, difficult to discern--without additional tracking. From my perspective, I thought it made more sense to rely solely on the podman container checkpoint exitcode for the checkpoint operation, with the container exitcode only reflecting the container processes' state (so, 137 if SIGKILLed, regardless of the signal source).

My experience may be atypical, though, so I understand that opinions may differ. I run transient containers in a preemptible cloud environment with periodic, non-terminating, exported checkpoints; and a terminating checkpoint either on-demand or when the VM reaches its max-allowed uptime. Because my VMs can be killed at any time (including during a checkpoint operation), I use the podman container checkpoint exitcode to maintain a fs-persistent list of known-valid checkpoints per container. This additional tracking probably biases me away from seeing an issue with a container reporting 137 after a successful checkpoint. My preference was that exitcode 0 be unambiguous, and I do have a workaround for that under the current behavior.

I also like @mheon's idea.

I rebased, but the checks seem to have failed again unrelated to my commits

[+0040s] Successfully built 6cdbe067979c
[+0040s] Successfully tagged env_and_volume_reader
[+0040s] Image for service reader was built because it did not already exist. To rebuild this image you must use `docker-compose build` or `docker-compose up --build`.
[+0040s] Creating env_and_volume_reader_1 ... done
[+0040s] Creating env_and_volume_writer_1 ... done
[+0047s] not ok 3 env_and_volume : port 5001
[+0047s] #  expected: podman_rulez
[+0047s] #    actual: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
[+0047s] <title>500 Internal Server Error</title>
[+0047s] <h1>Internal Server Error</h1>
[+0047s] <p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>

github-actions · 2021-07-31T00:03:24Z

A friendly reminder that this PR had no activity for 30 days.

rhatdan · 2021-08-01T16:29:15Z

@mheon @buck2202 @adrianreber What should we do with this PR?

mheon · 2021-08-01T18:06:55Z

I think we close, and I take over and create a PR to add a Checkpointed bool to the output of podman inspect.

buck2202 · 2021-08-03T22:18:08Z

No objection from me

github-actions · 2021-09-03T00:03:31Z

A friendly reminder that this PR had no activity for 30 days.

rhatdan · 2021-09-07T11:15:03Z

@mheon Did you ever work on this?

mheon · 2021-09-07T14:52:17Z

Negative. Let me do it now.

mheon · 2021-09-07T18:17:49Z

Closing in favor of #11471

github-actions bot added the stale-pr label Mar 19, 2021

buck2202 mentioned this pull request Apr 26, 2021

minor mismatch with docker: container exitcode is 0 after checkpoint without --leave-running #8432

Closed

github-actions bot removed the stale-pr label May 18, 2021

buck2202 added 3 commits June 23, 2021 13:38

reset ExitCode and FinishedTime for restored container

84a1572

Signed-off-by: David J. Buckmaster <[email protected]>

update container StartedTime upon restore from checkpoint

7488a1c

[NO TESTS NEEDED] Signed-off-by: David J. Buckmaster <[email protected]>

github-actions bot added the stale-pr label Jul 31, 2021

rhatdan removed the stale-pr label Aug 1, 2021

github-actions bot added the stale-pr label Sep 3, 2021

mheon closed this Sep 7, 2021

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exitcode/timestamp consistency after checkpoint creation/restore #9388

exitcode/timestamp consistency after checkpoint creation/restore #9388

buck2202 commented Feb 15, 2021 •

edited

Loading

openshift-ci-robot commented Feb 15, 2021

buck2202 commented Feb 15, 2021 •

edited

Loading

mheon commented Feb 15, 2021

buck2202 commented Feb 15, 2021

buck2202 commented Feb 16, 2021

github-actions bot commented Mar 19, 2021

rhatdan commented Mar 22, 2021

buck2202 commented Mar 22, 2021

vrothberg commented May 31, 2021

openshift-ci bot commented Jun 8, 2021

buck2202 commented Jun 8, 2021 •

edited

Loading

vrothberg commented Jun 21, 2021

adrianreber commented Jun 21, 2021

mheon commented Jun 22, 2021

adrianreber commented Jun 23, 2021

buck2202 commented Jun 23, 2021

github-actions bot commented Jul 31, 2021

rhatdan commented Aug 1, 2021

mheon commented Aug 1, 2021

buck2202 commented Aug 3, 2021

github-actions bot commented Sep 3, 2021

rhatdan commented Sep 7, 2021

mheon commented Sep 7, 2021

mheon commented Sep 7, 2021

exitcode/timestamp consistency after checkpoint creation/restore #9388

exitcode/timestamp consistency after checkpoint creation/restore #9388

Conversation

buck2202 commented Feb 15, 2021 • edited Loading

openshift-ci-robot commented Feb 15, 2021

buck2202 commented Feb 15, 2021 • edited Loading

mheon commented Feb 15, 2021

buck2202 commented Feb 15, 2021

buck2202 commented Feb 16, 2021

github-actions bot commented Mar 19, 2021

rhatdan commented Mar 22, 2021

buck2202 commented Mar 22, 2021

vrothberg commented May 31, 2021

openshift-ci bot commented Jun 8, 2021

buck2202 commented Jun 8, 2021 • edited Loading

vrothberg commented Jun 21, 2021

adrianreber commented Jun 21, 2021

mheon commented Jun 22, 2021

adrianreber commented Jun 23, 2021

buck2202 commented Jun 23, 2021

github-actions bot commented Jul 31, 2021

rhatdan commented Aug 1, 2021

mheon commented Aug 1, 2021

buck2202 commented Aug 3, 2021

github-actions bot commented Sep 3, 2021

rhatdan commented Sep 7, 2021

mheon commented Sep 7, 2021

mheon commented Sep 7, 2021

buck2202 commented Feb 15, 2021 •

edited

Loading

buck2202 commented Feb 15, 2021 •

edited

Loading

buck2202 commented Jun 8, 2021 •

edited

Loading