Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Podman build hangs when running apt-get install #8240 #8677

Closed
joharohl opened this issue Dec 10, 2020 · 12 comments
Closed

Podman build hangs when running apt-get install #8240 #8677

joharohl opened this issue Dec 10, 2020 · 12 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue

Comments

@joharohl
Copy link

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description
Realizing that this is a pretty special case, but thought it useful to report none the less.

Trying to build a fairly standard Dockerfile using a ubuntu base image and running apt-get hangs. Works with builda bud so reporting here.

While writing this report I tried with a newer kernel (5.8.x) and then it magically works. Would still be useful to try and figure out what is the problem with the older kernel as that is what Qubes ships with.

Have been looking a bit to see if I can find anything suspicious. The closest I have got is that an strace of the apt process shows it hanging trying to write some of it's stdout to a fd. Could this somehow be a problem with how podman sets up stdout/stderr forwarding? Also a bit weird that it works with a newer kernel. What changed?

Steps to reproduce the issue:

  1. Use Qubes R4 with kernel 4.19.155
  2. Try to build this dockerfile using podman build .
FROM ubuntu:bionic

RUN echo "foo"

RUN apt-get update

RUN apt-get update && apt-get install -y \
      acl \
      aptitude \
      bash \
      ca-certificates \
      cron \
      iproute2 \
      python \
      python-apt \
      sudo \
      systemd  \
 && apt-get clean

Describe the results you received:
The echo outputs to stdout, the second RUN with apt-get update finishes, but does not echo to stdout. Hangs forever on the third RUN statement.

Describe the results you expected:
Should finish (and output to stdout)

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

Version:      2.2.1
API Version:  2.1.0
Go Version:   go1.14.10
Built:        Tue Dec  8 15:37:43 2020
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.18.0
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.0.21-2.fc32.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.21, commit: 81d18b6c3ffc266abdef7ca94c1450e669a6a388'
  cpus: 4
  distribution:
    distribution: fedora
    version: "32"
  eventLogger: journald
  hostname: dev
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.8.16-1.qubes.x86_64
  linkmode: dynamic
  memFree: 6549520384
  memTotal: 8342241280
  ociRuntime:
    name: crun
    package: crun-0.15.1-1.fc32.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 0.15.1
      commit: eb0145e5ad4d8207e84a327248af76663d4e50dd
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  rootless: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.1.4-1.fc32.x86_64
    version: |-
      slirp4netns version 1.1.4
      commit: b66ffa8e262507e37fca689822d23430f3357fe8
      libslirp: 4.3.1
      SLIRP_CONFIG_VERSION_MAX: 2
  swapFree: 1073737728
  swapTotal: 1073737728
  uptime: 7m 30.44s
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - registry.centos.org
  - docker.io
store:
  configFile: /home/user/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: fuse-overlayfs-1.2.0-1.fc32.x86_64
      Version: |-
        fusermount3 version: 3.9.1
        fuse-overlayfs: version 1.1.0
        FUSE library version 3.9.1
        using FUSE kernel interface version 7.31
  graphRoot: /home/user/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 21
  runRoot: /run/user/1000/containers
  volumePath: /home/user/.local/share/containers/storage/volumes
version:
  APIVersion: 2.1.0
  Built: 1607438263
  BuiltTime: Tue Dec  8 15:37:43 2020
  GitCommit: ""
  GoVersion: go1.14.10
  OsArch: linux/amd64
  Version: 2.2.1

Package info (e.g. output of rpm -q podman or apt list podman):

podman-2.2.1-1.fc32.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):
This is running in a fedora 32 VM in QubesOS. Not unlikely that this might have something to do with it.

@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 10, 2020
@giuseppe
Copy link
Member

could be related to fuse-overlayfs.

Could you strace the fuse-overlayfs process when the container is hanging?

@joharohl
Copy link
Author

Sorry for being slow. Will check soon and come back with what I can find.

@joharohl
Copy link
Author

joharohl commented Dec 16, 2020

@giuseppe I have checked again and managed to narrow it down a bit. As far as I can tell overlayfs is working as it should. my strace skills are not that great so not 100% sure.

I'm now testing with the following two Dockerfiles:

FROM ubuntu:bionic

RUN bash -c 'for i in $(seq 10000); do echo $i; done'
FROM ubuntu:bionic

RUN bash -c 'for i in $(seq 10000); do echo $i >> /tmp/out; done; '
RUN cat /tmp/out

The first one never finishes. Sometimes there is output to stdout, sometimes not. The second one also never finishes, but the for loop step always does. It is the cat that hangs.

If you happen to have any strace tips or other things I should look at I'd be happy to help. But for my own personal usecase I think I'll write this off as a "kernel" bug and just use a newer kernel.

@giuseppe
Copy link
Member

while it hangs, could you check whether there is any cat process active? I'd strace that, if it exists, and see where it is hanging.

@xrow
Copy link

xrow commented Dec 31, 2020

I have a similar case. It is reproduceable when the podman rest api is used. Please tell me what i woudl run to provide more info. Might the buildah prozess be the cause that is running at 100% on one core?

[root@server005 ~]# date && podman system info && date
Thu Dec 31 05:21:17 CET 2020
host:
  arch: amd64
  buildahVersion: 1.18.0
  cgroupManager: systemd
  cgroupVersion: v1
  conmon:
    package: conmon-2.0.21-1.el8.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.21, commit: 2994b6043045317d0c8d65974ce57c612c4e8809-dirty'
  cpus: 12
  distribution:
    distribution: '"centos"'
    version: "8"
  eventLogger: journald
  hostname: server005.dc02.xrow.net
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 4.18.0-257.el8.x86_64
  linkmode: dynamic
  memFree: 85004591104
  memTotal: 101076951040
  ociRuntime:
    name: runc
    package: runc-1.0.0-145.rc91.git24a3cf8.el8.x86_64
    path: /usr/bin/runc
    version: 'runc version spec: 1.0.2-dev'
  os: linux
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  rootless: false
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 21474832384
  swapTotal: 21474832384
  uptime: 1h 48m 12.99s (Approximately 0.04 days)
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - registry.centos.org
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 21
    paused: 0
    running: 1
    stopped: 20
  graphDriverName: overlay
  graphOptions:
    overlay.ignore_chown_errors: "true"
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: fuse-overlayfs-1.3.0-1.el8.x86_64
      Version: |-
        fusermount3 version: 3.2.1
        fuse-overlayfs: version 1.3
        FUSE library version 3.2.1
        using FUSE kernel interface version 7.26
  graphRoot: /var/lib/containers/storage
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 20
  runRoot: /var/run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 2.1.0
  Built: 1607538258
  BuiltTime: Wed Dec  9 19:24:18 2020
  GitCommit: ""
  GoVersion: go1.15.2
  OsArch: linux/amd64
  Version: 2.2.1

Thu Dec 31 05:21:51 CET 2020
top - 05:21:47 up  1:48,  2 users,  load average: 3.43, 2.06, 1.03
Tasks: 342 total,   4 running, 338 sleeping,   0 stopped,   0 zombie
%Cpu(s): 14.9 us,  3.9 sy,  0.0 ni, 67.8 id, 12.9 wa,  0.4 hi,  0.1 si,  0.0 st
MiB Mem :  96394.5 total,  78881.0 free,   3547.0 used,  13966.6 buff/cache
MiB Swap:  20480.0 total,  20480.0 free,      0.0 used.  92046.0 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                        
  57850 root      20   0 1893872  78444  21784 R 134.9   0.1   0:27.50 buildah                        
  65418 root      20   0 1374540  40836  18272 R  37.5   0.0   0:01.18 exe                            
  63823 root      20   0   22408  10772   1428 S  14.6   0.0   0:04.11 fuse-overlayfs                 
  64583 root      20   0  263188  49160   6852 S  11.6   0.0   0:02.07 ansible-playboo                
  64412 root      20   0  256168  51072  11420 S   7.0   0.1   0:02.99 ansible-playboo                
   1129 root      20   0  145488  53584  27252 S   3.0   0.1   1:35.56 gitlab-runner                  
  65574 root      20   0   68124  15812   7544 R   2.0   0.0   0:00.06 platform-python                
  36990 root      20   0       0      0      0 D   1.0   0.0   0:01.31 kworker/u50:1+flush-253:0      
     11 root      20   0       0      0      0 I   0.3   0.0   0:01.01 rcu_sched                      
   1858 qemu      20   0   14.3g 945532  20272 S   0.3   1.0   0:54.92 qemu-kvm                       
  41449 root      20   0       0      0      0 I   0.3   0.0   0:00.24 kworker/8:1-xfs-conv/dm-0      
  59895 root      20   0   40728  27080   1424 S   0.3   0.0   0:16.33 fuse-overlayfs                 
  65285 root      20   0   65640   4900   3992 R   0.3   0.0   0:00.02 top                            
      1 root      20   0  243288  11812   8480 S   0.0   0.0   0:04.57 systemd                        
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.01 kthreadd                       
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp                         
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp                     
      6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/0:0H-kblockd           
      9 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq                   
     10 root      20   0       0      0      0 S   0.0   0.0   0:00.02 ksoftirqd/0                    
     12 root      rt   0       0      0      0 S   0.0   0.0   0:00.00 migration/0                    
     13 root      rt   0       0      0      0 S   0.0   0.0   0:00.00 watchdog/0                     
     14 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/0                        
     15 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/1                        
     16 root      rt   0       0      0      0 S   0.0   0.0   0:00.00 watchdog/1                     
     17 root      rt   0       0      0      0 S   0.0   0.0   0:00.00 migration/1                    
     18 root      20   0       0      0      0 S   0.0   0.0   0:00.02 ksoftirqd/1                    
     20 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/1:0H-kblockd           
     21 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/2                        
     22 root      rt   0       0      0      0 S   0.0   0.0   0:00.00 watchdog/2 

@xrow
Copy link

xrow commented Dec 31, 2020

@giuseppe
I can add more info.... I happends when the image layers are written to the disk. The my main problem ist though. While this is happening I also get timeouts on the api and my gitlab runner processes crash

Crash report from gitlab runner

Checking out 15942c68 as 3.0...
Skipping Git submodules setup
Executing "step_script" stage of the job script
00:01
Cleaning up file based variables
00:00
ERROR: Job failed (system failure): unable to upgrade to tcp, received 500 (exec.go:50:0s)

IOTOP

Total DISK READ :      80.84 K/s | Total DISK WRITE :     313.55 M/s
Actual DISK READ:      11.55 K/s | Actual DISK WRITE:      32.49 M/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                               
 256475 be/4 root        0.00 B/s   48.02 M/s  0.00 % 99.99 % buildah build-using-~ckerfile docker/helm
 254513 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % buildah build-using-~ckerfile docker/helm
   2899 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % [kworker/u49:0+flush-253:0]
 254898 be/4 root       11.55 K/s  415.74 K/s  0.00 % 54.72 % fuse-overlayfs -o me~c70f92dd0e6a0/merged
 255458 be/4 root        0.00 B/s  273.31 K/s  0.00 % 17.13 % fuse-overlayfs -o me~6374bc37e68cc/merged
  91061 be/4 root        0.00 B/s    0.00 B/s  0.00 % 16.60 % [kworker/10:0-kdmflush]
 223725 be/4 root        0.00 B/s    0.00 B/s  0.00 %  5.46 % [kworker/u50:0-xfs-cil/dm-0]
 223159 be/4 root        0.00 B/s    0.00 B/s  0.00 %  1.97 % [kworker/u49:2+xfs-cil/dm-0]
 232101 be/4 root        0.00 B/s    0.00 B/s  0.00 %  1.89 % [kworker/3:0-xfs-conv/dm-0]
 249143 be/4 root        0.00 B/s    0.00 B/s  0.00 %  1.79 % [kworker/0:8-kdmflush]
 169788 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.93 % [kworker/2:0-kdmflush]
 255621 be/4 root       69.29 K/s  415.74 K/s  0.00 %  0.44 % python /usr/bin/yum ~s yum-utils yamllint
 105227 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.37 % [kworker/1:4-kdmflush]
  73296 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.06 % [kworker/4:1-xfs-conv/dm-0]
 244257 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.03 % [kworker/10:7-kdmflush]

This there any temporary workaround I can use or can I limit the disk useage so that the API will not report 500. I have seen Write IO up to 600 MB per sec on a single SSD.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Feb 1, 2021

@giuseppe any update on this issue?

@giuseppe
Copy link
Member

giuseppe commented Feb 2, 2021

it could be related to: containers/conmon#237

@github-actions
Copy link

github-actions bot commented Mar 5, 2021

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Mar 5, 2021

Ok, I am going to assume this is fixed in the current release. Reopen if I am mistaken.

@rhatdan rhatdan closed this as completed Mar 5, 2021
@xrow
Copy link

xrow commented Mar 5, 2021

Hi,

since 3.0.X it has improved for me. But I feel not for 100%, but that might be other use cases. If I find some I open new tickets.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue
Projects
None yet
Development

No branches or pull requests

5 participants