Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final v2.0.5 backports #7402

Conversation

mheon
Copy link
Member

@mheon mheon commented Aug 21, 2020

These are really the last ones. Definitely. For sure this time.

@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 21, 2020
@baude
Copy link
Member

baude commented Aug 21, 2020

LGTM

@jwhonce
Copy link
Member

jwhonce commented Aug 21, 2020

/hold
/lgtm

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 21, 2020
@mheon
Copy link
Member Author

mheon commented Aug 21, 2020

@rhatdan Hold/LGTM this please

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 21, 2020
@mheon
Copy link
Member Author

mheon commented Aug 21, 2020

Oh, TYTY @jwhonce

@TomSweeneyRedHat
Copy link
Member

LGTM
I sense a nice cold beverage in @mheon's near future....

@mheon
Copy link
Member Author

mheon commented Aug 21, 2020

CI is failing, and we are also waiting on one more c/common revendor from @baude to fix Podman remote on Windows and OS X.

@mheon
Copy link
Member Author

mheon commented Aug 24, 2020

@edsantiago Poke - could I ask for a moment of your time to take a look at these tests?

@edsantiago
Copy link
Member

@mheon I can reproduce, it's a real error: looking at logs gives me the supremely unhelpful:

{
  "cause": "some containers failed",
  "message": "error starting some containers: some containers failed",
  "response": 500
}

@mheon
Copy link
Member Author

mheon commented Aug 24, 2020

At this point I'm wondering if it depends on some other part of the test that wasn't backported. Somewhat tempted to disable it.

@edsantiago
Copy link
Member

Server logs with --log-level=debug does not give me any clues either, posting in case it helps you:

time="2020-08-24T07:50:49-06:00" level=debug msg="APIHandler -- Method: POST URL: /v1.40/libpod/pods/bar/start"
time="2020-08-24T07:50:49-06:00" level=debug msg="Strongconnecting node 6af392e59fc0dc894c533504ba775d8fe4a23120a9628f8eb0d7c23e2b01f559"
time="2020-08-24T07:50:49-06:00" level=debug msg="Pushed 6af392e59fc0dc894c533504ba775d8fe4a23120a9628f8eb0d7c23e2b01f559 onto stack"
time="2020-08-24T07:50:49-06:00" level=debug msg="Recursing to successor node 7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2"
time="2020-08-24T07:50:49-06:00" level=debug msg="Strongconnecting node 7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2"
time="2020-08-24T07:50:49-06:00" level=debug msg="Pushed 7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2 onto stack"
time="2020-08-24T07:50:49-06:00" level=debug msg="Finishing node 7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2. Popped 7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2 off stack"
time="2020-08-24T07:50:49-06:00" level=debug msg="Finishing node 6af392e59fc0dc894c533504ba775d8fe4a23120a9628f8eb0d7c23e2b01f559. Popped 6af392e59fc0dc894c533504ba775d8fe4a23120a9628f8eb0d7c23e2b01f559 off stack"
time="2020-08-24T07:50:49-06:00" level=debug msg="mounted container \"7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2\" at \"/dev/shm/test-apiv2.tmp.KfdsUS/overlay/e58e0f3de08a6b1be4818e3f4c5b1d83333ad697196863a0684a7159e539c9df/merged\""
time="2020-08-24T07:50:49-06:00" level=debug msg="Created root filesystem for container 7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2 at /dev/shm/test-apiv2.tmp.KfdsUS/overlay/e58e0f3de08a6b1be4818e3f4c5b1d83333ad697196863a0684a7159e539c9df/merged"
time="2020-08-24T07:50:49-06:00" level=debug msg="Made network namespace at /run/user/14904/netns/cni-6decf55c-1c60-166a-3ad0-065d2c3a696a for container 7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2"
time="2020-08-24T07:50:49-06:00" level=debug msg="slirp4netns command: /usr/bin/slirp4netns --disable-host-loopback --mtu 65520 --enable-sandbox --enable-seccomp -c -e 3 -r 4 --netns-type=path /run/user/14904/netns/cni-6decf55c-1c60-166a-3ad0-065d2c3a696a tap0"
time="2020-08-24T07:50:49-06:00" level=debug msg="/etc/system-fips does not exist on host, not mounting FIPS mode secret"
time="2020-08-24T07:50:49-06:00" level=warning msg="User mount overriding libpod mount at \"/dev/shm\""
time="2020-08-24T07:50:49-06:00" level=debug msg="Setting CGroups for container 7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2 to user-libpod_pod_58022376464d800a078afba95f76124c805933a5e6f5c935d87c7b955f93a616.slice:libpod:7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2"
time="2020-08-24T07:50:49-06:00" level=debug msg="reading hooks from /usr/share/containers/oci/hooks.d"
time="2020-08-24T07:50:49-06:00" level=debug msg="Created OCI spec for container 7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2 at /dev/shm/test-apiv2.tmp.KfdsUS/overlay-containers/7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2/userdata/config.json"
time="2020-08-24T07:50:49-06:00" level=debug msg="/usr/bin/conmon messages will be logged to syslog"
time="2020-08-24T07:50:49-06:00" level=debug msg="running conmon: /usr/bin/conmon" args="[--api-version 1 -c 7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2 -u 7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2 -r /usr/bin/crun -b /dev/shm/test-apiv2.tmp.KfdsUS/overlay-containers/7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2/userdata -p /run/user/14904/containers/overlay-containers/7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2/userdata/pidfile -n 58022376464d-infra --exit-dir /run/user/14904/libpod/tmp/exits --socket-dir-path /run/user/14904/libpod/tmp/socket -s -l k8s-file:/dev/shm/test-apiv2.tmp.KfdsUS/overlay-containers/7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2/userdata/ctr.log --log-level debug --syslog --conmon-pidfile /run/user/14904/containers/overlay-containers/7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2/userdata/conmon.pid --exit-command /home/esm/src/atomic/2018-02.podman/libpod/bin/podman --exit-command-arg --root --exit-command-arg /dev/shm/test-apiv2.tmp.KfdsUS --exit-command-arg --runroot --exit-command-arg /run/user/14904/containers --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/user/14904/libpod/tmp --exit-command-arg --runtime --exit-command-arg /usr/bin/crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mount_program=/usr/bin/fuse-overlayfs --exit-command-arg --events-backend --exit-command-arg file --exit-command-arg --syslog --exit-command-arg true --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2]"
[conmon:d]: failed to write to /proc/self/oom_score_adj: Permission denied

time="2020-08-24T07:50:49-06:00" level=debug msg="Received: -1"
time="2020-08-24T07:50:49-06:00" level=debug msg="Cleaning up container 7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2"
time="2020-08-24T07:50:49-06:00" level=debug msg="Tearing down network namespace at /run/user/14904/netns/cni-6decf55c-1c60-166a-3ad0-065d2c3a696a for container 7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2"
time="2020-08-24T07:50:49-06:00" level=debug msg="unmounted container \"7cb597e875dac131e9020f458ac889b8d24ff4e2afe2943db1ff60efc34ab6d2\""
time="2020-08-24T07:50:49-06:00" level=info msg="Request Failed(Internal Server Error): error starting some containers: some containers failed"

@edsantiago
Copy link
Member

well... except that AFAICT the test itself has not changed, only the pod code. This implies that, before your PR, that test was passing.

@mheon
Copy link
Member Author

mheon commented Aug 24, 2020

Damn, you're right.

@edsantiago
Copy link
Member

881af314d introduces the error. I can't seem to reproduce using podman-remote commands

@mheon
Copy link
Member Author

mheon commented Aug 24, 2020

That landed in a separate PR that passed tests with flying colors...

@mheon
Copy link
Member Author

mheon commented Aug 24, 2020

@edsantiago Anything from Conmon in syslog? It seems to be doing the thing where it prints the actual error to the logs and doesn't give it to us.

@edsantiago
Copy link
Member

I see nothing useful via journalctl -f nor tail -f /var/log/messages.

You can very easily reproduce this:

$ ./test/apiv2/test-apiv2 40

@edsantiago
Copy link
Member

Through selective commenting-out of individual tests, I see that what's causing the error is a restart:

#t POST  "libpod/pods/bar/restart (restart on nonexistent pod)" '' 404
t POST libpod/pods/create name=bar 201 .Id~[0-9a-f]\\{64\\}
pod_bar_id=$(jq -r .Id <<<"$output")

#t POST  libpod/pods/bar/restart '' 200 \
#  .Errs=null \
#  .Id=$pod_bar_id

#t GET  libpod/pods/bar/json        200 \
#  .State=Running

t POST  libpod/pods/bar/restart '' 200 \
  .Errs=null \
  .Id=$pod_bar_id

#t POST "libpod/pods/bar/stop?t=invalid" '' 400 \
#  .cause="schema: error converting value for \"t\"" \
#  .message~"Failed to parse parameters for"

podman run -d --pod bar busybox sleep 999

t POST libpod/pods/bar/stop?t=1 '' 200 \
  .Errs=null \
  .Id=$pod_bar_id

t POST libpod/pods/bar/start '' 200 sdf

If I comment out that "restart", the error goes away.

Once again, I cannot reproduce this via command line:

$ ./bin/podman-remote pod create --name=foo;./bin/podman-remote pod restart foo;./bin/podman-remote run -d --pod foo busybox sleep 999;./bin/podman-remote pod stop -t 1 foo;./bin/podman-remote pod start foo
4c75633ad2e1e78ccd9d873e5e45c5f336af13699691b40d5cec0ec769cf9606
4c75633ad2e1e78ccd9d873e5e45c5f336af13699691b40d5cec0ec769cf9606
e00074fa049046a7c9467b09efa237b7fe1a71c794fd13087356fce034bfbcee
4c75633ad2e1e78ccd9d873e5e45c5f336af13699691b40d5cec0ec769cf9606
4c75633ad2e1e78ccd9d873e5e45c5f336af13699691b40d5cec0ec769cf9606

@mheon
Copy link
Member Author

mheon commented Aug 24, 2020

Hmmmm. Investigating.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Aug 24, 2020
@edsantiago
Copy link
Member

Got it! This is the smallest reproducer I can come up with. The key seems to be to start an already-started container:

$ ./bin/podman-remote pod create --name=foo;./bin/podman-remote pod start foo;./bin/podman-remote pod create --name=bar;./bin/podman-remote run -d --pod bar busybox sleep 999;./bin/podman-remote pod stop -t 1 bar;./bin/podman-remote pod start bar
71e850427778688a4464685cc82874ad2418d1c338a17ce8ff509f13a2e8260a
71e850427778688a4464685cc82874ad2418d1c338a17ce8ff509f13a2e8260a
7449c1cfc2f14c40f476312e73eaab06f919169cd5b2b0c073c5f2d8ee107742
3418123acafdf565cd7bb88c96e60218a3ef26edfa54d5a14d440ebc1c824cd3
7449c1cfc2f14c40f476312e73eaab06f919169cd5b2b0c073c5f2d8ee107742
Error: error starting some containers: some containers failed

There may be a smaller or different reproducer, but I think this should get you on the right track.

baude and others added 3 commits August 24, 2020 11:31
because a pod's network information is dictated by the infra container at creation, a container cannot be created with network attributes.  this has been difficult for users to understand.  we now return an error when a container is being created inside a pod and passes any of the following attributes:

* static IP (v4 and v6)
* static mac
* ports -p (i.e. -p 8080:80)
* exposed ports (i.e. 222-225)
* publish ports from image -P

Signed-off-by: Brent Baude <[email protected]>

<MH: Fixed cherry pick conflicts and compile>

Signed-off-by: Matthew Heon <[email protected]>
Most Libpod containers are made via `pkg/specgen/generate` which
includes code to generate an appropriate exit command which will
handle unmounting the container's storage, cleaning up the
container's network, etc. There is one notable exception: pod
infra containers, which are made entirely within Libpod and do
not touch pkg/specgen. As such, no cleanup process, network never
cleaned up, bad things can happen.

There is good news, though - it's not that difficult to add this,
and it's done in this PR. Generally speaking, we don't allow
passing options directly to the infra container at create time,
but we do (optionally) proxy a pre-approved set of options into
it when we create it. Add ExitCommand to these options, and set
it at time of pod creation using the same code we use to generate
exit commands for normal containers.

Fixes containers#7103

Signed-off-by: Matthew Heon <[email protected]>

<MH: Fixed cherry-pick conflicts>

Signed-off-by: Matthew Heon <[email protected]>
This should help alleviate races where the pod is not fully
cleaned up before subsequent API calls happen.

Signed-off-by: Matthew Heon <[email protected]>
Really. I promise. No more after this.

Signed-off-by: Matthew Heon <[email protected]>
@mheon mheon force-pushed the last_pr_before_205_really_this_time branch from 5bad9c3 to 884355c Compare August 24, 2020 15:36
@edsantiago
Copy link
Member

Update: the following will also reproduce it:

$ ./bin/podman pod create --name=foo;./bin/podman pod start foo;./bin/podman pod create --name=bar;./bin/podman-remote run -d --pod bar busybox sleep 999;./bin/podman-remote pod stop -t 1 bar;./bin/podman-remote pod start bar
cc89c70513e8c93b5f57cba57fd2c869a04e89e3e68cac899bf4c11bbfe1e1a1
cc89c70513e8c93b5f57cba57fd2c869a04e89e3e68cac899bf4c11bbfe1e1a1
84c3ccadb39e23d0dd70384fed896d60bb23d68e33acf7ffe43393a83fce427b
827e76d45de187f17ec67d28508a4e2d2b4758e5d4ef3a54faf7af39e9fd9d71
84c3ccadb39e23d0dd70384fed896d60bb23d68e33acf7ffe43393a83fce427b
Error: error starting some containers: some containers failed

(difference is: the first two commands, the ones with foo, are podman only, not podman-remote).

There is something about podman pod start already-running-pod that is leaving bad state behind somewhere.

@mheon
Copy link
Member Author

mheon commented Aug 24, 2020

Needs containers/storage#698 to fix CI

We need this release out by end of day, so we don't have time to
do this right. Disable the vendor task and manually add c/storage
PR containers#698 to the vendored copy of c/storage to make the tests pass.

Once containers#698 merges into c/storage, we need to remove this commit
and backport it to the v1.20 stable branch, then cut a release
there.

Signed-off-by: Matthew Heon <[email protected]>
@mheon mheon force-pushed the last_pr_before_205_really_this_time branch from d6c2e92 to ae2ee65 Compare August 24, 2020 19:14
mheon added 2 commits August 24, 2020 15:18
Signed-off-by: Matthew Heon <[email protected]>
Signed-off-by: Matthew Heon <[email protected]>
@rhatdan
Copy link
Member

rhatdan commented Aug 24, 2020

Released containers/storage https://github.com/containers/storage/releases/tag/v1.23.2

@baude
Copy link
Member

baude commented Aug 24, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 24, 2020
@edsantiago
Copy link
Member

Cirrus shows "Failed to stop: Remote host terminated the handshake" on a red background. Probably a flake, although not one that I'm accustomed to. Restarting.

@mheon
Copy link
Member Author

mheon commented Aug 24, 2020

[+0256s] # error opening file /sys/fs/cgroup//system.slice/buildah-buildah651017202-477311.scope/cgroup.freeze: No such file or directory

Lovely. Restarting again.

@edsantiago
Copy link
Member

Oh yeah, that's the elusive #7148

@mheon
Copy link
Member Author

mheon commented Aug 24, 2020

It's happened 3 times in a row now. I think it's escalated from a flake.

I'm going to restart this one more time. If it still fails, we need to chase this down tomorrow morning.

@mheon
Copy link
Member Author

mheon commented Aug 24, 2020

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 24, 2020
@openshift-merge-robot openshift-merge-robot merged commit 024f470 into containers:v2.0 Aug 24, 2020
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 24, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants