Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exitcode/timestamp consistency after checkpoint creation/restore #9388

Closed
wants to merge 3 commits into from
Closed

Conversation

buck2202
Copy link

@buck2202 buck2202 commented Feb 15, 2021

In docker, the exitcode of a container checkpointed without --leave-running is 137, which matches the exitcode of a container that is manually stopped. podman leaves the exitcode as 0

In libpod/container_internal_linux.go::checkpoint(), I used libpod/container_internal.go::stop() as an example and copied the call to c.waitForExitFileAndSync() rather than manually assigning the exitcode to 137. I assumed this would be better than using 137 as a magic number, but maybe that would've been fine. I'm not totally sure if there are other implications to this function call. Additionally in checkpoint(), I removed the unconditional assignment to the container's FinishedTime. This happens within the waitForExitFileAndSync call chain if the container is stopped, and probably shouldn't have been happening at all if --leave-running was specified.

In libpod/container_internal_linux.go::restore() after the container state is set to running, I explicitly set the container exitcode to zero, its StartedTime to now, and clear its FinishedTime.

With these changes, the timestamps and exitcode almost track with docker's behavior. I noticed in collecting the output here that docker never actually resets the "FinishedAt" time when a container is restored. If the goal is completely matching docker, I can remove the line that clears it upon restore. IMO, it seems more logical to have a null FinishedAt time for a container that's running.

Podman:

$ sudo podman run --detach --name cr busybox /bin/sleep 60
7f2de709a7be88542d03cebf194deca877b5ad69557f48597259767e1b4a8f83
$ sudo podman container inspect cr |grep -E "(\"Created|FinishedAt|StartedAt|ExitCode)"
        "Created": "2021-02-15T16:07:48.815788417-05:00",
            "ExitCode": 0,
            "StartedAt": "2021-02-15T16:07:49.374606028-05:00",
            "FinishedAt": "0001-01-01T00:00:00Z",
$ sudo podman container checkpoint cr
7f2de709a7be88542d03cebf194deca877b5ad69557f48597259767e1b4a8f83
$ sudo podman container inspect cr |grep -E "(\"Created|FinishedAt|StartedAt|ExitCode)"
        "Created": "2021-02-15T16:07:48.815788417-05:00",
            "ExitCode": 137,
            "StartedAt": "2021-02-15T16:07:49.374606028-05:00",
            "FinishedAt": "2021-02-15T16:08:03.296056384-05:00",
$ sudo podman container restore cr
7f2de709a7be88542d03cebf194deca877b5ad69557f48597259767e1b4a8f83
$ sudo podman container inspect cr |grep -E "(\"Created|FinishedAt|StartedAt|ExitCode)"
        "Created": "2021-02-15T16:07:48.815788417-05:00",
            "ExitCode": 0,
            "StartedAt": "2021-02-15T16:08:23.071588181-05:00",
            "FinishedAt": "0001-01-01T00:00:00Z",
$ sleep 60
$ sudo podman container inspect cr |grep -E "(\"Created|FinishedAt|StartedAt|ExitCode)"
        "Created": "2021-02-15T16:07:48.815788417-05:00",
            "ExitCode": 0,
            "StartedAt": "2021-02-15T16:08:23.071588181-05:00",
            "FinishedAt": "2021-02-15T16:09:09.804056384-05:00",

Docker:

$ docker run --detach --name cr busybox /bin/sleep 60
66d0545e3d3932f58e2b59f8ea7489dea1c909ebbf02789883478f98ebd9fc3d
$ docker container inspect cr | grep -E "(\"Created|FinishedAt|StartedAt|ExitCode)"
        "Created": "2021-02-15T21:11:54.129271959Z",
            "ExitCode": 0,
            "StartedAt": "2021-02-15T21:11:54.772211374Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
$ docker checkpoint create cr chk
chk
$ docker container inspect cr | grep -E "(\"Created|FinishedAt|StartedAt|ExitCode)"
        "Created": "2021-02-15T21:11:54.129271959Z",
            "ExitCode": 137,
            "StartedAt": "2021-02-15T21:11:54.772211374Z",
            "FinishedAt": "2021-02-15T21:12:06.879920447Z"
$ docker start --checkpoint chk cr
$ docker container inspect cr | grep -E "(\"Created|FinishedAt|StartedAt|ExitCode)"
        "Created": "2021-02-15T21:11:54.129271959Z",
            "ExitCode": 0,
            "StartedAt": "2021-02-15T21:12:25.566876155Z",
            "FinishedAt": "2021-02-15T21:12:06.879920447Z"
$ sleep 60
$ docker container inspect cr | grep -E "(\"Created|FinishedAt|StartedAt|ExitCode)"
        "Created": "2021-02-15T21:11:54.129271959Z",
            "ExitCode": 0,
            "StartedAt": "2021-02-15T21:12:25.566876155Z",
            "FinishedAt": "2021-02-15T21:13:14.151229075Z"

Unfortunately I couldn't get the validate or lint tests to function. Not sure what I'm missing in following the instructions in the contributing doc :/

Closes #8432

@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: buck2202
To complete the pull request process, please assign umohnani8 after the PR has been reviewed.
You can assign the PR to them by writing /assign @umohnani8 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@buck2202
Copy link
Author

buck2202 commented Feb 15, 2021

pinging @rhatdan @baude since you commented in the original issue. hope this looks ok! My familiarity with doing PRs is very low, so sorry if I'm missing something here

edit: also not that familiar with git. didn't realize that the first line of a commit message was treated as a "subject" with length restriction. will try to fix

edit2: my force-push to shorten the commit message looks like it brought in a different commit. I'll leave this to finish its checks, but if it doesn't work, I don't think I know git well enough to do anything less drastic than reforking and trying again :/

@mheon
Copy link
Member

mheon commented Feb 15, 2021

You can try rebasing atop master - that should remove the extra commit.

Overall LGTM but I'd feel more comfortable if someone more familiar with CRIU reviewed this prior to merge.

@buck2202
Copy link
Author

Turns out I hadn't fixed the commit msg length issue anyway...guess it doesn't count as a subject without a blank line following. Think I've fixed the errant commit and the string length now.

I guess I didn't fully understand the tests requirement. I think flagging none needed would be appropriate here? Does adding another commit with that message take care of it, or does it have to be present in each commit to be merged?

Sorry for the problems! I wanted to contribute and I've used git a little in isolation, but collaborative git/CI are new to me.

@buck2202
Copy link
Author

I guess I could actually modify one of the checkpoint/restore tests in e2e/checkpoint_test.go to cover these changes, but I haven't had any luck yet following the instructions to actually run a test locally. I'll keep trying when I find time again

@github-actions
Copy link

A friendly reminder that this PR had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Mar 22, 2021

@buck2202 what is the scoop on this one.

@buck2202
Copy link
Author

@rhatdan Sorry, missed the bot action on this. I commented in the issue thread when it got marked stale (#8432 (comment)) but not here.

I wasn't able to get make validate, make lint, and tests to run locally when I was working on it; the last of those stopped me from creating a validation/regression test to pass the CI checks (and I think it does deserve a test). I still plan to revisit, but it's been a very busy month.

Beyond general review of my approach, the only open code question on my end was whether FinishedAt should retain its old value after restore (matching docker) or reset to nil once the container is running (what I did, not realizing it was not what docker did until the very end).

@vrothberg
Copy link
Member

@buck2202, to pass CI, you need to add [NO TESTS NEEDED] to the commit message (see the bottom of https://github.com/containers/podman/pull/9388/checks?check_run_id=1906735716).

Can you rebase and repush?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 8, 2021

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: buck2202
To complete the pull request process, please assign umohnani8 after the PR has been reviewed.
You can assign the PR to them by writing /assign @umohnani8 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@buck2202
Copy link
Author

buck2202 commented Jun 8, 2021

I've played with this off and on, but was never able to get the lint, validate, and ginkgo frameworks to function in my build environment following the docs. Seems like there could easily be a regression test on a container's exitcode after a stopping-checkpoint (137 rather than 0), but I hesitate to write it without actually being able to exercise it.

So, I've added the [NO TESTS NEEDED] flag per @vrothberg in the interest of getting this past checks for discussion. @adrianreber suggested that he did not see this as an issue that needed fixing--if you decide that this is something you'd like to merge, maybe someone could help me with the regression test.

edit: just to bring my rationale into the discussion here--the current behavior of podman container checkpoint foo forcibly stops the container but leaves a container exitcode of 0, while my naive expectation coming from docker was that the container exitcode would reflect the force-stop. The 0 doesn't allow a "natural" exit to be distinguished from a checkpoint stop in the inspect output--to maintain that distinction in my own scripts for managing transient containers, my workaround has been to exclusively use checkpoint --leave-running, followed by stop if desired.

edit2: test failure seems unrelated to my commits
[+0447s] not ok 212 podman networking: port with --userns=keep-id
[+0447s] # (in test file test/system/500-networking.bats, line 145)
[+0447s] #   `echo "$teststring" | nc 127.0.0.1 $myport' failed
[+0447s] # # /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_hwB7 rm --all --force
[+0447s] # # /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_hwB7 ps --all --external --format {{.ID}} {{.Names}}
[+0447s] # # /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_hwB7 images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
[+0447s] # quay.io/libpod/testimage:20210427 aadc32e2a626
[+0447s] # # /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_hwB7 run -d --userns=keep-id -p 127.0.0.1:54322:54322 quay.io/libpod/testimage:20210427 nc -l -n -v -p 54322
[+0447s] # 9d9cbd582a67af8fc968b4fab7b766a5524b7ec4b411e091c26857735c93e062
[+0447s] # Ncat: No route to host.
[+0447s] # # [teardown]
[+0447s] # # /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_hwB7 pod rm --all --force
[+0447s] # # /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_hwB7 rm --all --force
[+0447s] # 9d9cbd582a67af8fc968b4fab7b766a5524b7ec4b411e091c26857735c93e062

@vrothberg
Copy link
Member

@adrianreber PTAL

@buck2202, apologies. Could you do another rebase?

@adrianreber
Copy link
Collaborator

From my point of view this does not make sense. Why should a checkpointed container have 137 as exit code? From my point of view that sounds like something did not work. 137 is a SIGKILL which is correct because without --leave-running CRIU terminates the container process, but this would leave us in the situation to not be able to distinguish between a checkpointed container and a killed container process. This just moves the unclear state of the checkpointed container to something else.

If the command podman container checkpoint returns with a 0 the container should be successfully checkpointed.

There is no reason to do what Docker does for me, because the whole checkpoint/restore is completely different in Podman.

The 0 doesn't allow a "natural" exit to be distinguished from a checkpoint stop in the inspect output--to maintain that distinction in my own scripts for managing transient containers, my workaround has been to exclusively use checkpoint --leave-running, followed by stop if desired.

The right solution would be to track the state correctly as checkpointed. But if the Podman maintainers feel that 137 is the right exit status then I am also not against it. For me it just does not make sense.

@mheon
Copy link
Member

mheon commented Jun 22, 2021

I don't think an entire state for Checkpointed is appropriate (adding new states is a substantial burden as we have a lot of internal state-checking logic to ensure commands are in the right state), but we could easily have a bool in the database for "Checkpointed" which is set on a container killed by a checkpoint operation, and cleared again on the container restarting; we can expose this in podman inspect for the container.

@adrianreber
Copy link
Collaborator

I don't think an entire state for Checkpointed is appropriate (adding new states is a substantial burden as we have a lot of internal state-checking logic to ensure commands are in the right state), but we could easily have a bool in the database for "Checkpointed" which is set on a container killed by a checkpoint operation, and cleared again on the container restarting; we can expose this in podman inspect for the container.

That sounds like a really good idea.

buck2202 added 3 commits June 23, 2021 13:38
…me if left running

-replace direct ContainerStateStopped assignment with waitForExitFileAndSync() call. follows example of libpod/container_internal.go::stop() to properly capture stopped exit code
-remove unconditional assignment to container FinishedTime. this leaves that unset if the container was left running, and is handled elsewhere if not

Signed-off-by: David J. Buckmaster <[email protected]>
[NO TESTS NEEDED]

Signed-off-by: David J. Buckmaster <[email protected]>
@buck2202
Copy link
Author

From my point of view this does not make sense. Why should a checkpointed container have 137 as exit code? From my point of view that sounds like something did not work. 137 is a SIGKILL which is correct because without --leave-running CRIU terminates the container process, but this would leave us in the situation to not be able to distinguish between a checkpointed container and a killed container process. This just moves the unclear state of the checkpointed container to something else.

If the command podman container checkpoint returns with a 0 the container should be successfully checkpointed.

I agree that both the current behavior and my commits leave something undefined--or, at least, difficult to discern--without additional tracking. From my perspective, I thought it made more sense to rely solely on the podman container checkpoint exitcode for the checkpoint operation, with the container exitcode only reflecting the container processes' state (so, 137 if SIGKILLed, regardless of the signal source).

My experience may be atypical, though, so I understand that opinions may differ. I run transient containers in a preemptible cloud environment with periodic, non-terminating, exported checkpoints; and a terminating checkpoint either on-demand or when the VM reaches its max-allowed uptime. Because my VMs can be killed at any time (including during a checkpoint operation), I use the podman container checkpoint exitcode to maintain a fs-persistent list of known-valid checkpoints per container. This additional tracking probably biases me away from seeing an issue with a container reporting 137 after a successful checkpoint. My preference was that exitcode 0 be unambiguous, and I do have a workaround for that under the current behavior.

I also like @mheon's idea.

I rebased, but the checks seem to have failed again unrelated to my commits
[+0040s] Successfully built 6cdbe067979c
[+0040s] Successfully tagged env_and_volume_reader
[+0040s] Image for service reader was built because it did not already exist. To rebuild this image you must use `docker-compose build` or `docker-compose up --build`.
[+0040s] Creating env_and_volume_reader_1 ... done
[+0040s] Creating env_and_volume_writer_1 ... done
[+0047s] not ok 3 env_and_volume : port 5001
[+0047s] #  expected: podman_rulez
[+0047s] #    actual: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
[+0047s] <title>500 Internal Server Error</title>
[+0047s] <h1>Internal Server Error</h1>
[+0047s] <p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>

@github-actions
Copy link

A friendly reminder that this PR had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Aug 1, 2021

@mheon @buck2202 @adrianreber What should we do with this PR?

@rhatdan rhatdan removed the stale-pr label Aug 1, 2021
@mheon
Copy link
Member

mheon commented Aug 1, 2021

I think we close, and I take over and create a PR to add a Checkpointed bool to the output of podman inspect.

@buck2202
Copy link
Author

buck2202 commented Aug 3, 2021

No objection from me

@github-actions
Copy link

github-actions bot commented Sep 3, 2021

A friendly reminder that this PR had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Sep 7, 2021

@mheon Did you ever work on this?

@mheon
Copy link
Member

mheon commented Sep 7, 2021

Negative. Let me do it now.

@mheon
Copy link
Member

mheon commented Sep 7, 2021

Closing in favor of #11471

@mheon mheon closed this Sep 7, 2021
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-pr
Projects
None yet
Development

Successfully merging this pull request may close these issues.

minor mismatch with docker: container exitcode is 0 after checkpoint without --leave-running
6 participants