Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman machine network connectivity stalls after some uptime #12495

Closed
fingon opened this issue Dec 3, 2021 · 24 comments
Closed

podman machine network connectivity stalls after some uptime #12495

fingon opened this issue Dec 3, 2021 · 24 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. machine

Comments

@fingon
Copy link

fingon commented Dec 3, 2021

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

After some uptime of podman machine on MacOS 12, TCP outbound connectivity stops working. ICMP works, for some reason.

Steps to reproduce the issue:

  1. Leave podman machine running

  2. Day or two later, outbound TCP stops suddenly working; inbound works (e.g. podman machine ssh). The state is persistent, e.g. sudo reboot of the podman machine does not fix it. It is fixed only by podman machine stop ; podman machine start.

Describe the results you received:

[core@localhost ~]$ uptime
 10:51:15 up 1 day, 19:00,  3 users,  load average: 0.04, 0.07, 0.08
[core@localhost ~]$ ping ftp.funet.fi
PING ftp.funet.fi (193.166.3.2) 56(84) bytes of data.
64 bytes from 193.166.3.2 (193.166.3.2): icmp_seq=1 ttl=64 time=0.875 ms
64 bytes from 193.166.3.2 (193.166.3.2): icmp_seq=2 ttl=64 time=1.18 ms
^C
--- ftp.funet.fi ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1005ms
rtt min/avg/max/mdev = 0.875/1.025/1.175/0.150 ms
[core@localhost ~]$ curl http://ftp.funet.fi -o ,x
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:05 --:--:--     0^C
[core@localhost ~]$ ^C

Describe the results you expected:

Network connectivity to keep working.

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

mstenber@kobuta ~>podman version
Client:
Version:      3.4.2
API Version:  3.4.2
Go Version:   go1.17.2
Built:        Fri Nov 12 18:08:25 2021
OS/Arch:      darwin/arm64

Server:
Version:      3.4.1
API Version:  3.4.1
Go Version:   go1.16.8
Built:        Wed Oct 20 17:32:52 2021
OS/Arch:      linux/arm64

Output of podman info --debug:

host:
  arch: arm64
  buildahVersion: 1.23.1
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.0.30-2.fc35.aarch64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.30, commit: '
  cpus: 8
  distribution:
    distribution: fedora
    variant: coreos
    version: "35"
  eventLogger: journald
  hostname: localhost.localdomain
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.14.18-300.fc35.aarch64
  linkmode: dynamic
  logDriver: journald
  memFree: 4399333376
  memTotal: 8299778048
  ociRuntime:
    name: crun
    package: crun-1.3-1.fc35.aarch64
    path: /usr/bin/crun
    version: |-
      crun version 1.3
      commit: 8e5757a4e68590326dafe8a8b1b4a584b10a1370
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.1.12-2.fc35.aarch64
    version: |-
      slirp4netns version 1.1.12
      commit: 7a104a101aa3278a2152351a082a6df71f57c9a3
      libslirp: 4.6.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.3
  swapFree: 0
  swapTotal: 0
  uptime: 42h 58m 36.81s (Approximately 1.75 days)
plugins:
  log:
  - k8s-file
  - none
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /var/home/core/.config/containers/storage.conf
  containerStore:
    number: 2
    paused: 0
    running: 1
    stopped: 1
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/home/core/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 2
  runRoot: /run/user/1000/containers
  volumePath: /var/home/core/.local/share/containers/storage/volumes
version:
  APIVersion: 3.4.1
  Built: 1634740372
  BuiltTime: Wed Oct 20 14:32:52 2021
  GitCommit: ""
  GoVersion: go1.16.8
  OsArch: linux/arm64
  Version: 3.4.1

Package info (e.g. output of rpm -q podman or apt list podman):

mstenber@kobuta ~>brew info podman
podman: stable 3.4.2 (bottled), HEAD
Tool for managing OCI containers and pods
https://podman.io/
/opt/homebrew/Cellar/podman/3.4.2 (170 files, 40.9MB) *
  Poured from bottle on 2021-11-25 at 01:53:38
From: https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/podman.rb
License: Apache-2.0
...

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

mstenber@kobuta ~>uname -a
Darwin kobuta.local 21.1.0 Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:01 PDT 2021; root:xnu-8019.41.5~1/RELEASE_ARM64_T6000 arm64

aka macOS 12.0.1

@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 3, 2021
@rhatdan
Copy link
Member

rhatdan commented Dec 3, 2021

Something with the proxy?
@baude PTAL

@Luap99
Copy link
Member

Luap99 commented Dec 4, 2021

It looks like dns stops working but the network itself is still working.
Can you check /etc/resolv.conf and test with nslookup google.com?

@fingon
Copy link
Author

fingon commented Dec 4, 2021

As described, DNS (=UDP) and ping work from the machine. Anything with TCP stalls.

@fingon
Copy link
Author

fingon commented Dec 4, 2021

In the example ftp.funet.fi dns resolution and ping work. Curl does not.

@Luap99 Luap99 added the machine label Dec 4, 2021
@Luap99
Copy link
Member

Luap99 commented Dec 4, 2021

That definitely sounds like an issue with gvproxy. @guillaumerose PTAL

@guillaumerose guillaumerose self-assigned this Dec 13, 2021
@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Jan 13, 2022

@guillaumerose did you ever make anyprogress?

@guillaumerose
Copy link
Contributor

No, not yet. It's not easy to reproduce. I need to let my Mac machine alone running and doing things several days I think..

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Feb 14, 2022

@fingon @guillaumerose Did this ever reappear?

@fingon
Copy link
Author

fingon commented Feb 15, 2022

It worked badly enough that I switched to https://github.com/lima-vm/lima - interestingly enough, their default slirp using network connectivity there too had similar problem (presumably shared bits in qemu?), but with the vmnet.framework backend ( https://github.com/lima-vm/vde_vmnet ) the problem was gone.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@Luap99
Copy link
Member

Luap99 commented Mar 18, 2022

@guillaumerose Any progress?

There are 3 thumbs up on this issue, it would be great if someone could look at this.
Does strace work on macos? It would be helpful if someone could attach it when the problem occurs to see what is failing.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Apr 18, 2022

@guillaumerose @Luap99 any update on this?

@Luap99
Copy link
Member

Luap99 commented Apr 19, 2022

I have no way to debug this.

@rhatdan
Copy link
Member

rhatdan commented Apr 19, 2022

This would seem to be an issue with qemu or with gvproxy.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented May 23, 2022

@fingon Are you still seeing this issue?

@fingon
Copy link
Author

fingon commented May 23, 2022

Not using podman due to that issue, so no. Sounds like others still hit it though.

@umohnani8
Copy link
Member

umohnani8 commented Jul 21, 2022

@guillaumerose @baude any update on this?

@Luap99
Copy link
Member

Luap99 commented Jul 21, 2022

Maybe this was fixed by containers/gvisor-tap-vsock#128?
We should update gvproxy to 0.4.0 and see if the problem goes away

@rhatdan
Copy link
Member

rhatdan commented Jul 22, 2022

I will close and reopen if this does not fix the problem.

@binaryfields
Copy link

I think this issue is fixed by gvproxy 0.5.0. I tested it on macOS 13.2 with podman 4.3.1 and custom build of gvproxy from git.

I noticed that when I ran podman with gvproxy 0.4.0, the gvproxy process would have a memory leak and would use 2GB+ of memory and some of the ports would stop responding after putting it under high network load.

This is no longer happening with gvproxy 0.5.0.

The relevant change may be the following PR:

containers/gvisor-tap-vsock#152

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 2, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. machine
Projects
None yet
Development

No branches or pull requests

6 participants