Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman checkpoint failed because of 'external socket is used' #12275

Closed
Zplusless opened this issue Nov 12, 2021 · 6 comments
Closed

podman checkpoint failed because of 'external socket is used' #12275

Zplusless opened this issue Nov 12, 2021 · 6 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@Zplusless
Copy link

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

checkpoint a container failed and the dump.log says it result from "external socket is used"

Steps to reproduce the issue:

  1. start the container using
 sudo xhost +local:root && sudo podman run -d --env="DISPLAY" --env="QT_X11_NO_MITSHM=1" --volume="/tmp/.X11-unix:/tmp/.X11-unix:rw" -v /tmp/podman/test:/tmp/podman  docker.io/zzdflyz351/snake-edge "python /tmp/podman/src/Snakepygame.py -n podman -i 10.112.145.90 -p 5600"
  1. checkpoint the container with
sudo podman container checkpoint --tcp-established 3e02 -e test_ssnake.tar.gz
  1. the image docker.io/zzdflyz351/snake-edge starts a container that can run a little GUI multiplayer snake game

Describe the results you received:

2021-11-12T09:02:52.000920806Z: CRIU checkpointing failed -52
Please check CRIU logfile /var/lib/containers/storage/overlay-containers/3e02a714edf2824319301bcadb73afd889badc7f4ad2e100c7ae06afd9a85f46/userdata/dump.log

Error: /usr/bin/crun checkpoint --image-path /var/lib/containers/storage/overlay-containers/3e02a714edf2824319301bcadb73afd889badc7f4ad2e100c7ae06afd9a85f46/userdata/checkpoint --work-path /var/lib/containers/storage/overlay-containers/3e02a714edf2824319301bcadb73afd889badc7f4ad2e100c7ae06afd9a85f46/userdata --tcp-established 3e02a714edf2824319301bcadb73afd889badc7f4ad2e100c7ae06afd9a85f46 failed: exit status 1

Describe the results you expected:
the container is supposed to be dumped

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

Version:      3.0.1
API Version:  3.0.0
Go Version:   go1.15.2
Built:        Thu Jan  1 08:00:00 1970
OS/Arch:      linux/amd64

Output of podman info --debug:


host:
  arch: amd64
  buildahVersion: 1.19.4
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: 'conmon: /usr/libexec/podman/conmon'
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.0.30, commit: '
  cpus: 8
  distribution:
    distribution: ubuntu
    version: "18.04"
  eventLogger: journald
  hostname: edge-1
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.4.0-90-generic
  linkmode: dynamic
  memFree: 4595224576
  memTotal: 16669401088
  ociRuntime:
    name: crun
    package: 'crun: /usr/bin/crun'
    path: /usr/bin/crun
    version: |-
      crun version 0.18.1-7931a-dirty
      commit: 7931a1eab0590eff4041c1f74e2844b297c31cea
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    selinuxEnabled: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: 'slirp4netns: /usr/bin/slirp4netns'
    version: |-
      slirp4netns version 1.1.8
      commit: unknown
      libslirp: 4.3.1-git
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.3.1
  swapFree: 2138554368
  swapTotal: 2147479552
  uptime: 27h 34m 12.94s (Approximately 1.12 days)
registries:
  search:
  - docker.io
  - quay.io
store:
  configFile: /home/edge/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: vfs
  graphOptions: {}
  graphRoot: /home/edge/.local/share/containers/storage
  graphStatus: {}
  imageStore:
    number: 8
  runRoot: /run/user/1000/containers
  volumePath: /home/edge/.local/share/containers/storage/volumes
version:
  APIVersion: 3.0.0
  Built: 0
  BuiltTime: Thu Jan  1 08:00:00 1970
  GitCommit: ""
  GoVersion: go1.15.2
  OsArch: linux/amd64
  Version: 3.0.1

Package info (e.g. output of rpm -q podman or apt list podman):

$ apt list podman

Listing... Done
podman/unknown,now 100:3.0.1-2 amd64 [installed]

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)

No, it is lately installed on Ubuntu18.04.6 using apt install. In my environment, ubuntu18.04 must be used.

Additional environment details (AWS, VirtualBox, physical, etc.):

the last few lines of dump.log:

(00.210030) Dumping file-locks
(00.210037) 
(00.210038) Dumping pstree (pid: 7584)
(00.210040) ----------------------------------------
(00.210044) Process: 1(7584)
(00.210058) ----------------------------------------
(00.210067) Dumping 1(7584)'s namespaces
(00.210210) Dump UTS namespace 12 via 7584
(00.210257) Dump IPC namespace 11 via 7584
(00.210294) Dump NET namespace info 10 via 7584
(00.210363) Dumping netns links
(00.210407) IPC shared memory segments: 0
(00.210413) IPC message queues: 0
(00.210417) IPC semaphore sets: 0
(00.245842) Dumping netns links
(00.247155) Namespaces dump complete
(00.247233) cg: Dumping 1 sets
(00.247241) cg:    `- Dumping  of /machine.slice/libpod-3e02a714edf2824319301bcadb73afd889badc7f4ad2e100c7ae06afd9a85f46.scope
(00.247244) cg:    `- Dumping blkio of /machine.slice/libpod-3e02a714edf2824319301bcadb73afd889badc7f4ad2e100c7ae06afd9a85f46.scope
(00.247247) cg:    `- Dumping cpu,cpuacct of /machine.slice/libpod-3e02a714edf2824319301bcadb73afd889badc7f4ad2e100c7ae06afd9a85f46.scope
(00.247249) cg:    `- Dumping cpuset of /machine.slice/libpod-3e02a714edf2824319301bcadb73afd889badc7f4ad2e100c7ae06afd9a85f46.scope
(00.247252) cg:    `- Dumping devices of /machine.slice/libpod-3e02a714edf2824319301bcadb73afd889badc7f4ad2e100c7ae06afd9a85f46.scope
(00.247255) cg:    `- Dumping freezer of /machine.slice/libpod-3e02a714edf2824319301bcadb73afd889badc7f4ad2e100c7ae06afd9a85f46.scope
(00.247257) cg:    `- Dumping hugetlb of /machine.slice/libpod-3e02a714edf2824319301bcadb73afd889badc7f4ad2e100c7ae06afd9a85f46.scope
(00.247260) cg:    `- Dumping memory of /machine.slice/libpod-3e02a714edf2824319301bcadb73afd889badc7f4ad2e100c7ae06afd9a85f46.scope
(00.247263) cg:    `- Dumping name=systemd of /machine.slice/libpod-3e02a714edf2824319301bcadb73afd889badc7f4ad2e100c7ae06afd9a85f46.scope
(00.247265) cg:    `- Dumping net_cls,net_prio of /machine.slice/libpod-3e02a714edf2824319301bcadb73afd889badc7f4ad2e100c7ae06afd9a85f46.scope
(00.247267) cg:    `- Dumping perf_event of /machine.slice/libpod-3e02a714edf2824319301bcadb73afd889badc7f4ad2e100c7ae06afd9a85f46.scope
(00.247270) cg:    `- Dumping pids of /machine.slice/libpod-3e02a714edf2824319301bcadb73afd889badc7f4ad2e100c7ae06afd9a85f46.scope
(00.247272) cg:    `- Dumping rdma of /machine.slice/libpod-3e02a714edf2824319301bcadb73afd889badc7f4ad2e100c7ae06afd9a85f46.scope
(00.247320) cg: Writing CG image
(00.247383) unix: Dumping external sockets
(00.247387) unix: 	Dumping extern: ino 355510 peer_ino 348004 family    1 type    1 state  1 name /tmp/.X11-unix/X10
(00.247398) unix: 	Dumped extern: id 0xb2 ino 355510 peer 0 type 2 state 10 name 19 bytes
(00.247404) unix: 	Runaway socket: ino 355510 peer_ino 348004 family    1 type    1 state  1 name /tmp/.X11-unix/X10
(00.247407) Error (criu/sk-unix.c:865): unix: External socket is used. Consider using --ext-unix-sk option.
(00.247459) Unlock network
(00.247475) Running network-unlock scripts
iptables-restore: invalid option -- 'w'
ip6tables-restore: invalid option -- 'w'
(00.250371) Unfreezing tasks into 1
(00.250387) 	Unseizing 7584 into 1
(00.250424) Error (criu/cr-dump.c:1781): Dumping FAILED.

@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 12, 2021
@Zplusless
Copy link
Author

Zplusless commented Nov 12, 2021

latest test on Ubuntu 20.04 with Podman 3.3.1

dump.log:

(00.512897) Switching to 2009046's net for collecting sockets
(00.512982) unix: 	Collected: ino 65686633 peer_ino 65685700 family    1 type    1 state  1 name null
(00.512989) unix: 	Collected: ino 65686635 peer_ino 65685702 family    1 type    1 state  1 name null
(00.512997) unix: 	Collected: ino 65685702 peer_ino 65686635 family    1 type    1 state  1 name /tmp/.X11-unix/X11
(00.513000) unix: 	Collected: ino 65685700 peer_ino 65686633 family    1 type    1 state  1 name /tmp/.X11-unix/X11
(00.513003) unix: 	Collected: ino 65685714 peer_ino 65685713 family    1 type    1 state  1 name null
(00.513006) unix: 	Collected: ino 65685713 peer_ino 65685714 family    1 type    1 state  1 name null
(00.513009) unix: 	Collected: ino 65686956 peer_ino 0 family    1 type    5 state  7 name null
(00.513921) inet: 	Collected: ino 0x3ea48d3 family AF_INET    type SOCK_STREAM    port    37092 state TCP_ESTABLISHED  src_addr 10.88.0.2
(00.515022) netlink: Collect netlink sock 0x3ea3f6b
(00.515027) netlink: Collect netlink sock 0x3ea4dab
(00.515030) netlink: Collect netlink sock 0x3ea3f79
(00.515033) netlink: Collect netlink sock 0x3ea3f83
(00.515035) netlink: Collect netlink sock 0x3ea3f6c
(00.515037) netlink: Collect netlink sock 0x3ea3f6f
(00.515039) netlink: Collect netlink sock 0x3ea3f82
(00.515042) netlink: Collect netlink sock 0x3ea3f6d
(00.515044) netlink: Collect netlink sock 0x3ea3f6e
(00.515052) Collecting pidns 9/2009046
(00.515091) seccomp: Use SECCOMP_FILTER_FLAG_TSYNC for tid_real 2009046
(00.515094) seccomp: 	 Disable filter on tid_rea 2009052, will be propagated
(00.515166) No parent images directory provided
(00.515191) 2009046 has lsm profile containers-default-0.42.1
(00.515208) 2009052 has lsm profile containers-default-0.42.1
(00.515257) ========================================
(00.515261) Dumping task (pid: 2009046)
(00.515264) ========================================
(00.515266) Obtaining task stat ... 
(00.515297) 
(00.515300) Collecting mappings (pid: 2009046)
(00.515302) ----------------------------------------
(00.515495) Error (criu/files-reg.c:1629): Can't lookup mount=1563 for fd=-3 path=/usr/local/bin/python3.7
(00.515507) Error (criu/cr-dump.c:1262): Collect mappings (pid: 2009046) failed with -1
(00.515540) Unlock network
(00.515543) Running network-unlock scripts
(00.518649) Unfreezing tasks into 1
(00.518663) 	Unseizing 2009046 into 1
(00.518689) Error (criu/cr-dump.c:1781): Dumping FAILED.

@mheon
Copy link
Member

mheon commented Nov 15, 2021

@adrianreber PTAL

@adrianreber
Copy link
Collaborator

Those are tow different errors. The second should be fixed by a kernel update.

Ubuntu has non-upstreamed kernel patches to support shiftfs and they break CRIU on overlayfs: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1857257 and checkpoint-restore/criu#1316

Updating the kernel should fix this, although we have seen the same error on 21.10.

The first error could be fixed with echo ext-unix-sk > /etc/criu/default.conf (if using crun). echo ext-unix-sk > /etc/criu/runc.conf (if using runc).

But: Not sure what the container is doing, but seeing an X socket sounds like this will not work anyway.

CRIU cannot checkpoint processes using a graphics card because it is not possible to extract the state from the GPU. Anything using the graphics card will not work. There are patches to support AMD GPGPUs but only as an accelerator and not as a graphics card.

@Zplusless
Copy link
Author

@adrianreber Thanks for your reply.

But: Not sure what the container is doing, but seeing an X socket sounds like this will not work anyway.

yes, X socket is used because I'm tryting to migrating games to test the performance of cloud gaming running in an mobile edge computing environment.

Since cloud game is faced with challenge of service migration, is there any chance to support checkpointing GUI applications in the future?

@adrianreber
Copy link
Collaborator

@adrianreber Thanks for your reply.

But: Not sure what the container is doing, but seeing an X socket sounds like this will not work anyway.

yes, X socket is used because I'm tryting to migrating games to test the performance of cloud gaming running in an mobile edge computing environment.

Since cloud game is faced with challenge of service migration, is there any chance to support checkpointing GUI applications in the future?

Theoretically it is possible. But, people have been asking this for years and it still does not exist. Assuming it is possible to extract the hardware state out of the GPU it should be doable, and I do not know if all graphics cards can even do this. From what I have heard from AMD (https://linuxplumbersconf.org/event/11/contributions/891/) it works for GPGPU use cases and so it should be theoretically be possible for graphics use cases. But that is a lot of development work. The kernel interfaces do not exist yet and if they exist a plugin for CRIU needs to exist to handle this and then you need full support from the hardware vendor to implement this.

It is doable, but, from my point of view, only with the help of the hardware vendor and even then it will take a lot of time.

This is all a CRIU only problem and not something Podman can solve.

@Zplusless
Copy link
Author

@adrianreber
Thanks for your detailed explanation.

I'll close this issue

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

3 participants