Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Podman fail to autostart containers through quadlet/systemd, works when launched manually, error with pasta #22197

Closed
Froggy232 opened this issue Mar 28, 2024 · 58 comments · Fixed by #24305
Assignees
Labels
jira kind/bug Categorizes issue or PR as related to a bug. network Networking related issue or feature pasta pasta(1) bugs or features

Comments

@Froggy232
Copy link

Issue Description

Hi,
Since the upgrade to Fedora Silverblue 40 / Podman 5, systemd fail to launch containers at boot.
If I try to launch them manually through systemctl --user start container.service, it works as expected.
Thanks you!

Steps to reproduce the issue

Steps to reproduce the issue

  1. Automatize the gestion of container through quadlet / ~/.config/containers/systemd files
  2. Restart the server and see that containers failed to launch

Describe the results you received

Containers doesn't launch at boot, needs to be started manually

Describe the results you expected

Containers should start at boot.

podman info output

host:
  arch: amd64
  buildahVersion: 1.35.1
  cgroupControllers:
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.8-4.fc40.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.8, commit: '
  cpuUtilization:
    idlePercent: 99.37
    systemPercent: 0.21
    userPercent: 0.42
  cpus: 32
  databaseBackend: sqlite
  distribution:
    distribution: fedora
    variant: silverblue
    version: "40"
  eventLogger: journald
  freeLocks: 2047
  hostname: homeserver
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1020
      size: 1
    - container_id: 1
      host_id: 1703936
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1020
      size: 1
    - container_id: 1
      host_id: 1703936
      size: 65536
  kernel: 6.8.1-300.fc40.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 64334761984
  memTotal: 67334115328
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.10.0-1.fc40.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.10.0
    package: netavark-1.10.3-3.fc40.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.10.3
  ociRuntime:
    name: crun
    package: crun-1.14.4-1.fc40.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.14.4
      commit: a220ca661ce078f2c37b38c92e66cf66c012d9c1
      rundir: /run/user/1020/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20240320.g71dd405-1.fc40.x86_64
    version: |
      pasta 0^20240320.g71dd405-1.fc40.x86_64
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: false
    path: /run/user/1020/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 146028879872
  swapTotal: 146028879872
  uptime: 0h 14m 2.00s
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /var/srv/media-server/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /srv/media-server/.local/share/containers/storage
  graphRootAllocated: 3999065440256
  graphRootUsed: 1034920087552
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 14
  runRoot: /run/user/1020/containers
  transientStore: false
  volumePath: /var/srv/media-server/.local/share/containers/storage/volumes
version:
  APIVersion: 5.0.0
  Built: 1710806400
  BuiltTime: Tue Mar 19 01:00:00 2024
  GitCommit: ""
  GoVersion: go1.22.0
  Os: linux
  OsArch: linux/amd64
  Version: 5.0.0

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

No

Additional environment details

Fedora Silverblue 40 up-to-date

Additional information

Logs of a container :

mars 28 12:15:09 homeserver jellyfin[7039]: Error: pasta failed with exit code 1:
mars 28 12:15:09 homeserver jellyfin[7039]: External interface not usable

@Froggy232 Froggy232 added the kind/bug Categorizes issue or PR as related to a bug. label Mar 28, 2024
@Luap99 Luap99 added network Networking related issue or feature pasta pasta(1) bugs or features labels Mar 28, 2024
@Luap99
Copy link
Member

Luap99 commented Mar 28, 2024

You have to make sure your network is fully set up before the unit is started.

@rhatdan
Copy link
Member

rhatdan commented Mar 29, 2024

This feel like it could be related to the same question in #22057

@flyingfishflash
Copy link

flyingfishflash commented Mar 29, 2024

I have not been able to get a rootless user quadlet to wait for my network to be ready even adding

[Unit]
wants=nss-online.target
after=nss-online.target

No issues on 4.9.3

@Luap99
Copy link
Member

Luap99 commented Mar 29, 2024

@flyingfishflash You cannot wait for system units from user units, see systemd/systemd#3312

I wasn't aware that the user units start before the network is fully set up and that it causes such big trouble with pasta. Note you do not need to downgrade, you can just change the default back to slirp4netns in containers.conf, see the last part in the pasta section on https://blog.podman.io/2024/03/podman-5-0-breaking-changes-in-detail/

You could also do something like this #22190 (comment)

Of course none of this is a proper solution but I am sure we will find something to address this in a better way soon.

@flyingfishflash
Copy link

flyingfishflash commented Mar 29, 2024

@Luap99 - thank you for this tip re containers.conf!

@gdonval
Copy link

gdonval commented Apr 12, 2024

You could also do something like this #22190 (comment)

No. It's as much of a bad practice today as it was 50 years ago.

@Klowner
Copy link

Klowner commented Apr 25, 2024

I ran into this issue today and finally learned that systemd user level units apparently can't depend on system level units (such as network-online.target)

I've managed a workaround that satisfies my desire to avoid arbitrary timeouts by creating a user-level network-online.service and network-online.target

# ~/.config/systemd/user/network-online.service
[Unit]
Description=User-level proxy to system-level network-online.target

[Service]
type=oneshot
ExecStart=/bin/bash -c 'until systemctl --machine=%[email protected] is-active network-online.target; do sleep 1; done'

[Install]
WantedBy=default.target
# ~/.config/systemd/user/network-online.target
[Unit]
Description=User-level network-online.target
Requires=network-online.service
Wants=network-online.service
After=network-online.service

Then in your quadlet units:

[Unit]
After=network-online.target

@soiamsoNG
Copy link

seems it just work after you can ping an external ip (include gateway ip)

@djarbz
Copy link

djarbz commented May 6, 2024

I'll share my workaround, but it might be a good idea to have a podman network --health command to verify by driver and network and such.

#[Unit]
Description=Wait for network to be online via NetworkManager or Systemd-Networkd

[Service]
# `nm-online -s` waits until the point when NetworkManager logs
# "startup complete". That is when startup actions are settled and
# devices and profiles reached a conclusive activated or deactivated
# state. It depends on which profiles are configured to autoconnect and
# also depends on profile settings like ipv4.may-fail/ipv6.may-fail,
# which affect when a profile is considered fully activated.
# Check NetworkManager logs to find out why wait-online takes a certain
# time.

Type=oneshot
# At least one of these should work depending if using NetworkManager or Systemd-Networkd
ExecStart=/bin/bash -c ' \
    if command -v nm-online &>/dev/null; then \
        nm-online -s -q; \
    elif command -v /usr/lib/systemd/systemd-networkd-wait-online &>/dev/null; then \
        /usr/lib/systemd/systemd-networkd-wait-online; \
    else \
        echo "Error: Neither nm-online nor systemd-networkd-wait-online found."; \
        exit 1; \
    fi'
ExecStartPost=ip -br addr
RemainAfterExit=yes

# Set $NM_ONLINE_TIMEOUT variable for timeout in seconds.
# Edit with `systemctl edit <THIS SERVICE NAME>`.
#
# Note, this timeout should commonly not be reached. If your boot
# gets delayed too long, then the solution is usually not to decrease
# the timeout, but to fix your setup so that the connected state
# gets reached earlier.
Environment=NM_ONLINE_TIMEOUT=60

[Install]
WantedBy=default.target

@secext2022
Copy link

Another workaround:

We can copy network-online.target from system to user, with a little modify, like this:

$ cat /etc/systemd/user/network-online.target
[Unit]
Description=Network online for systemd --user
Documentation=man:systemd.special(7)
Documentation=https://systemd.io/NETWORK_ONLINE
#After=network.target

$ cat /etc/systemd/user/systemd-networkd-wait-online.service
[Unit]
Description=Wait network online for systemd --user
Documentation=man:systemd-networkd-wait-online.service(8)
Before=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/lib/systemd/systemd-networkd-wait-online
RemainAfterExit=yes

[Install]
WantedBy=network-online.target

or you can put these files to ~/.config/systemd/user for only one user.

Then enable the service as a user:

$ systemctl --user enable systemd-networkd-wait-online.service

Finally we can wait network online for podman, like this:

$ cat ~/.config/containers/systemd/my-app.container
[Unit]
Wants=network-online.target
After=network-online.target

reference link: https://unix.stackexchange.com/questions/216919/how-can-i-make-my-user-services-wait-till-the-network-is-online

@WildPenquin
Copy link

Hi,

Any idea for a workaround when using NetworkManager?

I tried to adapt @secext2022 's workaround, but the user service still "thinks" the Network is online approx. 7 seconds too early. I tried to change the parameter for nm-online by removing the -s, but the behavior is still the same.

dog /etc/systemd/user/network-online.target:

#  SPDX-License-Identifier: LGPL-2.1-or-later
#
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

[Unit]
Description=Network is Online
Documentation=man:systemd.special(7)
Documentation=https://systemd.io/NETWORK_ONLINE
# After=network.target

/etc/systemd/user/NetworkManager-wait-online.service:

[Unit]
Description=Network Manager Wait Online for Users
Documentation=man:NetworkManager-wait-online.service(8)
Requires=NetworkManager.service
After=NetworkManager.service
Before=network-online.target

[Service]
# `nm-online -s` waits until the point when NetworkManager logs
# "startup complete". That is when startup actions are settled and
# devices and profiles reached a conclusive activated or deactivated
# state. It depends on which profiles are configured to autoconnect and
# also depends on profile settings like ipv4.may-fail/ipv6.may-fail,
# which affect when a profile is considered fully activated.
# Check NetworkManager logs to find out why wait-online takes a certain
# time.

Type=oneshot
ExecStart=/usr/bin/nm-online -q
RemainAfterExit=yes

# Set $NM_ONLINE_TIMEOUT variable for timeout in seconds.
# Edit with `systemctl edit NetworkManager-wait-online`.
#
# Note, this timeout should commonly not be reached. If your boot
# gets delayed too long, then the solution is usually not to decrease
# the timeout, but to fix your setup so that the connected state
# gets reached earlier.
Environment=NM_ONLINE_TIMEOUT=60

[Install]
WantedBy=network-online.target

journalctl -b0 | grep Online:

Jul 17 12:43:09 archnuke systemd[1]: Starting Network Manager Wait Online...
Jul 17 12:43:09 archnuke systemd[706]: Reached target Network is Online.
Jul 17 12:43:16 archnuke systemd[1]: Finished Network Manager Wait Online.
Jul 17 12:43:16 archnuke systemd[1]: Reached target Network is Online.

The above is the system log, 12:43:09 is the user service. As the user running the podman container, LANG=C journalctl --user -b0 | grep Online:

Jul 17 12:43:09 archnuke systemd[706]: Reached target Network is Online.

Not sure why the NetworkManager-wait-online is not in the user log, it is enabled for the user:

systemctl --user status NetworkManager-wait-online.service 
○ NetworkManager-wait-online.service - Network Manager Wait Online for Users
     Loaded: loaded (/etc/xdg/systemd/user/NetworkManager-wait-online.service; enabled; preset: enabled)
     Active: inactive (dead)
       Docs: man:NetworkManager-wait-online.service(8)

As another workaround, I'm thinking for now adding to the Quadlet another dirty workaround:
ExecStartPre=/bin/sh -c 'until ping -c1 google.com; do sleep 1; done;'

@djarbz
Copy link

djarbz commented Jul 17, 2024

I haven't used /etc/systemd/user, but my unit works, at least I haven't noticed an issue, when placed in ~/.config/Systemd/user.

@secext2022
Copy link

Hi,

Any idea for a workaround when using NetworkManager?

I tried to adapt @secext2022 's workaround, but the user service still "thinks" the Network is online approx. 7 seconds too early. I tried to change the parameter for nm-online by removing the -s, but the behavior is still the same.

dog /etc/systemd/user/network-online.target:

#  SPDX-License-Identifier: LGPL-2.1-or-later
#
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

[Unit]
Description=Network is Online
Documentation=man:systemd.special(7)
Documentation=https://systemd.io/NETWORK_ONLINE
# After=network.target

/etc/systemd/user/NetworkManager-wait-online.service:

[Unit]
Description=Network Manager Wait Online for Users
Documentation=man:NetworkManager-wait-online.service(8)
Requires=NetworkManager.service
After=NetworkManager.service
Before=network-online.target

[Service]
# `nm-online -s` waits until the point when NetworkManager logs
# "startup complete". That is when startup actions are settled and
# devices and profiles reached a conclusive activated or deactivated
# state. It depends on which profiles are configured to autoconnect and
# also depends on profile settings like ipv4.may-fail/ipv6.may-fail,
# which affect when a profile is considered fully activated.
# Check NetworkManager logs to find out why wait-online takes a certain
# time.

Type=oneshot
ExecStart=/usr/bin/nm-online -q
RemainAfterExit=yes

# Set $NM_ONLINE_TIMEOUT variable for timeout in seconds.
# Edit with `systemctl edit NetworkManager-wait-online`.
#
# Note, this timeout should commonly not be reached. If your boot
# gets delayed too long, then the solution is usually not to decrease
# the timeout, but to fix your setup so that the connected state
# gets reached earlier.
Environment=NM_ONLINE_TIMEOUT=60

[Install]
WantedBy=network-online.target

journalctl -b0 | grep Online:

Jul 17 12:43:09 archnuke systemd[1]: Starting Network Manager Wait Online...
Jul 17 12:43:09 archnuke systemd[706]: Reached target Network is Online.
Jul 17 12:43:16 archnuke systemd[1]: Finished Network Manager Wait Online.
Jul 17 12:43:16 archnuke systemd[1]: Reached target Network is Online.

The above is the system log, 12:43:09 is the user service. As the user running the podman container, LANG=C journalctl --user -b0 | grep Online:

Jul 17 12:43:09 archnuke systemd[706]: Reached target Network is Online.

Not sure why the NetworkManager-wait-online is not in the user log, it is enabled for the user:

systemctl --user status NetworkManager-wait-online.service 
○ NetworkManager-wait-online.service - Network Manager Wait Online for Users
     Loaded: loaded (/etc/xdg/systemd/user/NetworkManager-wait-online.service; enabled; preset: enabled)
     Active: inactive (dead)
       Docs: man:NetworkManager-wait-online.service(8)

As another workaround, I'm thinking for now adding to the Quadlet another dirty workaround: ExecStartPre=/bin/sh -c 'until ping -c1 google.com; do sleep 1; done;'

@WildPenquin

Please check this in the container service:

[Unit]
Wants=network-online.target
After=network-online.target

@secext2022
Copy link

$ systemctl --user status my-app.service
● my-app.service - example deno/fresh app
     Loaded: loaded (/var/home/fc-test/.config/containers/systemd/my-app.container; generated)
    Drop-In: /usr/lib/systemd/user/service.d
             └─10-timeout-abort.conf
     Active: active (running) since Wed 2024-07-17 04:21:49 UTC; 20h ago
   Main PID: 2026 (conmon)
$ systemctl --user list-dependencies my-app
my-app.service
● ├─app.slice
● ├─basic.target
● │ ├─systemd-tmpfiles-setup.service
● │ ├─paths.target
● │ ├─sockets.target
● │ │ └─dbus.socket
● │ └─timers.target
● │   └─systemd-tmpfiles-clean.timer
● └─network-online.target
●   └─systemd-networkd-wait-online.service

@WildPenquin
Copy link

WildPenquin commented Jul 19, 2024

Hi @secext2022 ,

The Unit section is defined correctly.

As per my log, the problem is that NetoworkManager-wait-online user service finishes much too soon, much sooner that the system level one. I believe (meaning I'm not sure) that nm-online does not work correctly when run as a user (not designed to be run as a user?).

As yet another workaround, I've added ExecStartPre=/bin/sh -c 'until ping -c1 192.168.66.6; do sleep 1; done;' under [Service]. On the TODO list, I'm going to test if this works correctly if I change my interface to be managed by systemd-networkd with and use the systemd-networkd-wait-online service instead.

$ systemctl --user status pande-pmc.service

● pande-pmc.service - PandESportS MC-serveri
     Loaded: loaded (/home/minecraft/.config/containers/systemd/pande-pmc.container; generated)
     Active: active (running) since Fri 2024-07-19 16:04:56 EEST; 4min 55s ago
 Invocation: 9858022ff77a4dd38327d8c513324e7d
    Process: 829 ExecStartPre=/bin/sh -c until ping -c1 192.168.66.6; do sleep 1; done; (code=exited, status=0/SUCCESS)
   Main PID: 906 (conmon)
      Tasks: 82 (limit: 28525)
     Memory: 6.2G (peak: 6.2G)
        CPU: 1min 6.141s

$ systemctl --user list-dependencies pande-pmc.service

pande-pmc.service
● ├─app.slice
● ├─basic.target
● │ ├─paths.target
● │ ├─sockets.target
● │ │ ├─dbus.socket
● │ │ ├─dirmngr.socket
● │ │ ├─drkonqi-coredump-launcher.socket
● │ │ ├─gpg-agent-browser.socket
● │ │ ├─gpg-agent-extra.socket
● │ │ ├─gpg-agent-ssh.socket
● │ │ ├─gpg-agent.socket
● │ │ ├─keyboxd.socket
● │ │ ├─p11-kit-server.socket
● │ │ ├─pipewire-pulse.socket
● │ │ └─pipewire.socket
● │ └─timers.target
○ │   ├─drkonqi-coredump-cleanup.timer
○ │   └─drkonqi-sentry-postman.timer
● └─network-online.target
○   └─NetworkManager-wait-online.service

config/containers/systemd/pande-pmc.container:

[Unit]
Description=PandESportS MC-serveri

After=network-online.target
Wants=network-online.target


[Container]
AutoUpdate=registry
ContainerName=PandEPMC
Image=docker.io/gameservermanagers/gameserver:pmc
Volume=pandepmc:/data
LogDriver=k8s-file
PublishPort=25560:25560/tcp
PublishPort=25560:25560/udp
PodmanArgs=--log-opt=path=/home/minecraft/PandEPMClog.k8s
Timezone=local

[Service]
ExecStartPre=/bin/sh -c 'until ping -c1 192.168.66.6; do sleep 1; done;'
# Restart=always
Restart=no

[Install]
WantedBy=multi-user.target default.target

@WildPenquin
Copy link

WildPenquin commented Jul 19, 2024

After reading this thread and also the comments in systemd/systemd#3312 , I think that thread has much cleaner workarounds than many of the ones in this thread. The problems with the workaround in here are that they are often quite long and convoluted for this relatively simple issue, and may or will break if the system configuration changes, as they are not agnostic on the configuration. But the systemd issue has much cleaner and simpler workarounds:

  • Make the whole user@UID service depend on network-online (RFE: monitor system units from user manager systemd/systemd#3312 (comment)) - but read the whole comment for caveats! This will work if you have a dedicated user for running containers which are useless without a network (so the caveats don't matter). Rename the [email protected] to include the UID to not enable this for all users. 3 lines, changing user@ service.
  • Make one user service which checks the system level network-online.target (RFE: monitor system units from user manager systemd/systemd#3312 (comment)). Then make quadlets depends on this service. This is 4 lines of code one simple service file which should work as long as system network-online.target is configured properly. You could replace the systemctl is-active with a ping to your GW or, say, Google, depending on what your services actually need to work around badly written software ("online" does not necessarily mean connection to Internet, nor, I presume, even to your default GW). But there's no need to "copy" *-wait-online to the user services, which is prone to break (and does not work for NM at all, it seems).

I haven't tested those, but they should work judging from the thumbs =).

I'm also starting to think maybe we should not be discussing workarounds here that much since it adds noise to actually solving the issue (which is: podman user containers should not fail at boot if networking is up). (As a general remark, no services should fail for whatever network error, but instead handle the situation, as network connections are unreliable. All these workaround should be unnecessary!).

I'm sorry for adding noise here myself, too =).

EDIT: My chosen workaround for the issue (cleanest in my opinion, less prone to break; I chose to name it check-network-online.service but it could be whatever you want it to be):

/etc/systemd/user/check-network-online.service:

[Unit]
Description=Check for system level network-online.target (for users)

[Service]
Type=oneshot
ExecStart=bash -c 'until systemctl is-active network-online.target; do sleep 1; done'
RemainAfterExit=yes

[Install]
WantedBy=default.target

Enable this service for the user. In badly behaving user services (such as podman quadlets), add:

After=check-network-online.service

Of course, YMMV!

@sbrivio-rh
Copy link
Collaborator

I'm also starting to think maybe we should not be discussing workarounds here that much since it adds noise to actually solving the issue

I personally don't find it distracting.

(which is: podman user containers should not fail at boot if networking is up). (As a general remark, no services should fail for whatever network error, but instead handle the situation, as network connections are unreliable. All these workaround should be unnecessary!).

The thing is, pasta(1) picks host addresses and routes by default. This is by design as it allows you to avoid (implicit) NAT altogether. If there's nothing there, it doesn't know what to pick, so it exits.

We're now considering to implement an optional netlink monitoring function that would dynamically create and delete routes and addresses as they come and go on the host, see also #22959 (comment). That should be robust enough.

@vrothberg
Copy link
Member

@Luap99 @rhatdan @ygalblum shall we update the quadlet docs to point that out?

Sitting in a meeting where this issue was brought up.

@gdonval
Copy link

gdonval commented Sep 20, 2024

If the doc said "Quadlets are currently broken. Please see that bug report XXX we have with systemd.", at the top in red and bold, I guess the situation would be improved tremendously. Acknowledging current limits and bugs is a big part of establishing trust with users.

As it is, users stumble across this again and again. I can't speak for the general industry but here, no one wants to hear about podman again for instance.

@Luap99
Copy link
Member

Luap99 commented Oct 17, 2024

#24305 implements the work around, would be great if some folks can test it.

Luap99 added a commit to Luap99/libpod that referenced this issue Oct 18, 2024
This service is meant to be used by quadlet as replacement for
network-online.target as this does not work for rootless users.

see containers#22197

Signed-off-by: Paul Holzinger <[email protected]>
Luap99 added a commit to Luap99/libpod that referenced this issue Oct 18, 2024
As documented in the issue there is no way to wait for system units from
the user session[1]. This causes problems for rootless quadlet units as
they might be started before the network is fully up. TWhile this was
always the case and thus was never really noticed the main thing that
trigger a bunch of errors was the switch to pasta.

Pasta requires the network to be fully up in order to correctly select
the right "template" interface based on the routes. If it cannot find a
suitable interface it just fails and we cannot start the container
understandingly leading to a lot of frustration from users.

As there is no sign of any movement on the systemd issue we work around
here by using our own user unit that check if the system session
network-online.target it ready.

Now for testing it is a bit complicated. While we do now correctly test
the root and rootless generator since commit ada75c0 the resulting
Wants/After= lines differ between them and there is no logic in the
testfiles themself to say if root/rootless to match specifics. One idea
was to use `assert-key-is-rootless/root` but that seemed like more
duplication for little reason so use a regex and allow both to make it
pass always. To still have some test coverage add a check in the system
test to ask systemd if we did indeed have the right depdendencies where
we can check for exact root/rootless name match.

[1] systemd/systemd#3312

Fixes containers#22197

Signed-off-by: Paul Holzinger <[email protected]>
@topas-rec
Copy link
Contributor

topas-rec commented Oct 27, 2024

Thanks!

I think I have this issue because

  • I cannot access my container after host reboot
  • restarting the container without host reboot makes it accessible and
  • I have pasta[539]: External interface not usable in my hosts boot logs

When switching to slirp4netns as suggested in this issue makes the container access also work after reboot.

Now I tested podman 5.2.5 which should include #24305.
With pasta as the default backend the issue is not solved. Should it be solved? How can I help?

My machine is simply speaking built up from five network interfaces from which the machine is accessed.
(I also use a bond, and rate limiting which uses IFB interfaces and also VLAN interfaces, but all this is not the cause of the issue I think since "everything else works"™ and this issue is gone with slirp4netns.

@Luap99
Copy link
Member

Luap99 commented Oct 27, 2024

Now I tested podman 5.2.5 which should include #24305.

Why do you think 5.2.5 included this fix? The releases notes are very clear what it contains https://github.com/containers/podman/releases/tag/v5.2.5.
It is in v5.3.0-rc1 so it will land in 5.3.0 final.

@topas-rec
Copy link
Contributor

Because I guessed that when a PR is merged and a release is created then those changes are in.

I took the time to read through the release notes and of course didn't find the change listed there. Since it was missing I looked at a previous release, too, to find out how much I can l rely on the release notes. Some projects don't mention all the changes in there. And: people make mistakes. Things that should be in the release notes sometimes are forgotten to list.

Then I looked at the branch that the fix was merged in. Since it wasn't merged in master or main (which I expected) I tried to find out how the merge strategy looks like. I didn't find a graph view on github and then gave up. I didn't want to spend the time to clone it, which I should've done - yes.

So that's why.

Thanks for letting me know in which release the fix is in. I just want to help and I'll try to check better next time.

@sbrivio-rh
Copy link
Collaborator

Then I looked at the branch that the fix was merged in. Since it wasn't merged in master or main (which I expected) I tried to find out how the merge strategy looks like. I didn't find a graph view on github and then gave up. I didn't want to spend the time to clone it, which I should've done - yes.

Tip, as I'm familiar with git but not with GitHub and it took me a while to spot this: information equivalent to:

$ git describe --contains 57b022782bba8cd48865f9dd84e9fea8a1588e4c
v5.3.0-rc1~10^2~1

is found, on "commits" pages, just after the end of the commit message. Say, at the page for 57b0227:

main (#24305) 
v5.3.0-rc1

@Luap99
Copy link
Member

Luap99 commented Oct 29, 2024

Yes it shows it on the commits page, however that only works for things going forward. Generally speaking fixes for a new patch (.z) release will not show up in there as it will not pick up the backport commits into the release branch. So for that you would manually need to check the backport commits in the release branch which of course is annoying but I would say the release notes for the patch releases should be complete and not miss stuff as we only do a few backports most of the time. But of course we are human and sometimes things are missed.

@urbenlegend
Copy link

I think this bug may need to be re-opened. I am on Podman 5.3 and I am still getting issues where my rootless containers are not properly starting when I log in.

Nov 20 17:24:13 arch-desktop podman[1042]: time="2024-11-20T17:24:13-08:00" level=error msg="Starting some container dependencies"
Nov 20 17:24:13 arch-desktop podman[1042]: time="2024-11-20T17:24:13-08:00" level=error msg="\"setting up Pasta: pasta failed with exit code 1:\\nExternal interface not usable\\n\""
Nov 20 17:24:13 arch-desktop podman[1042]: Error: unable to start container "658c7a404e78463fefbb4ecd8ae413efefdb0b49ce44af8c838edecd92f3084b": setting up Pasta: pasta failed with exit code 1:
Nov 20 17:24:13 arch-desktop podman[1042]: External interface not usable
Nov 20 17:24:13 arch-desktop podman[1042]: Error: unable to start container "67226d53cf17d652b1cedb5ad563e5a4416c8e119223231a83e9e33d61fd26d7": starting some containers: internal libpod error
Nov 20 17:24:13 arch-desktop podman[1042]: time="2024-11-20T17:24:13-08:00" level=info msg="Received shutdown.Stop(), terminating!" PID=1042
Nov 20 17:24:13 arch-desktop systemd[974]: podman-restart.service: Main process exited, code=exited, status=125/n/a
Nov 20 17:24:13 arch-desktop systemd[974]: podman-restart.service: Failed with result 'exit-code'.
Nov 20 17:24:13 arch-desktop systemd[974]: Failed to start Podman Start All Containers With Restart Policy Set To Always.
Nov 20 17:24:13 arch-desktop systemd[974]: podman-restart.service: Consumed 187ms CPU time, 100.1M memory peak.

This occurs when NetworkManager is set to connect to my Wifi on log in (Connection is set to be only available for my user and wifi password is stored in an encrypted form). If I set it to be available for all users with a key stored in plaintext, then the wifi connects long before I get to log in, and my containers restart properly.

@sbrivio-rh
Copy link
Collaborator

This occurs when NetworkManager is set to connect to my Wifi on log in

Which means that Podman/systemd units should wait quite a long time before bringing containers up.

What would be your expectation? We could also decide that pasta, instead of refusing to start, would assign the container some fake address and routes (like slirp4netns used to do), but then you lose the (default) seamless/transparent addressing.

Or would you expect that your containers start only as you log in and your WiFi password is decrypted?

I think this bug may need to be re-opened

I'm not sure, it covered a scenario that's different enough to be considered another issue altogether, I think.

@urbenlegend
Copy link

Or would you expect that your containers start only as you log in and your WiFi password is decrypted?

I think that's exactly what I would expect for rootless containers created by users that don't have linger enabled.

Here's my situation. I have several containers that I start up using podman-compose. I enable the user-level podman-restart service in an attempt to have them restart whenever I log in. The problem is that I think podman-restart isn't waiting for network at all. It is attempting to start up the containers while my computer is connecting to the wifi. Container startup fails as a result.

Essentially, the user level podman-restart service is functionally useless if the network takes a long time to come up.

@zbynekwinkler
Copy link

I have a server without wifi - and it does not work either. The systemd query as installed in /usr/lib/systemd/user/podman-user-wait-network-online.service where I find ExecStart=sh -c 'until systemctl is-active network-online.target; do sleep 0.5; done' exits very early in the boot process - certainly at a time when pasta still complains.

I have tried many things to work around this issue. The only one working for me is waiting until the sshd is accessible at port 22. This is what I ended up with:

ExecStart=/bin/bash -c 'until nc -vzw 1 $(hostname -I | cut -f1 -d" ") 22; do sleep 1; done'

The difference is about 5s - meaning that the sshd is accessible on the outside statically assigned IP only about 5s after systemctl says that the network-online.target is active.

I don't have NetworkManager and/or wifi on the server. Only ethernet with static IP. Nothing fancy. The containers are rootless with linger enabled.

@sbrivio-rh
Copy link
Collaborator

I find ExecStart=sh -c 'until systemctl is-active network-online.target; do sleep 0.5; done' exits very early in the boot process

Any idea why?

@Luap99
Copy link
Member

Luap99 commented Nov 21, 2024

If network-online.target succeeds to early then this is out of scope for podman/quadlet. We cannot possible handle every network setup and know what done means which is exactly why I check for network-online.target because that is already such definition.

You can manually fix the target or overwrite podman-user-wait-network-online.service with whatever command you want

podman-restart.service

I only fixed quadlet units, I forgot to change podman-restart.service and [email protected] as they also start containers.
(Note we strongly recommend using quadlet units over the podman-restart.service) as systemd restart logic is much better that the podman run --restart flag.

@urbenlegend
Copy link

I forgot to change podman-restart.service and [email protected] as they also start containers.
(Note we strongly recommend using quadlet units over the podman-restart.service) as systemd restart logic is much better that the podman run --restart flag.

I would love it if similar patches were made to those user services as well, as my workflow currently involves dealing with a lot of Docker Compose files and not Quadlet.

@Luap99
Copy link
Member

Luap99 commented Nov 21, 2024

Yes I was not trying to imply that we should not fix them. They definitely need to be fixed the same way, I filled a new issue #24637 to not keep spamming this long issue.

@zbynekwinkler
Copy link

I find ExecStart=sh -c 'until systemctl is-active network-online.target; do sleep 0.5; done' exits very early in the boot process

Any idea why?

Not really. When it exits, the eth interface is up, the static IP is assigned. It seems online. It is that just pasta does not like it yet. I am not sure what the ultimate precondition for starting pasta is. Being online as defined by systemd seems not enough.

If network-online.target succeeds to early then this is out of scope for podman/quadlet. We cannot possible handle every network setup and know what done means which is exactly why I check for network-online.target because that is already such definition.

This is the simplest setup possible. Single eth interface, static IP defined in /etc/network/interfaces. What else is missing for pasta to start?

@sbrivio-rh
Copy link
Collaborator

What else is missing for pasta to start?

It's also looking for an interface with a route. Not even a default route, just a route, because it shows that that interface is not completely useless.

@flyingfishflash
Copy link

flyingfishflash commented Nov 25, 2024 via email

@stratdev3
Copy link

stratdev3 commented Dec 29, 2024

@Luap99 thanks for the workaround.

important note :

I'm from nixos and the current release of podman is v5.2.3.
Passing a couple of days testing your workaround, it never works : although network-online was active, pasta still throw error.

Then i test other targets and the multi-user.target was the only one working :

/bin/bash -c 'until systemctl is-active multi-user.target; do sleep 0.5; done;'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira kind/bug Categorizes issue or PR as related to a bug. network Networking related issue or feature pasta pasta(1) bugs or features
Projects
None yet
Development

Successfully merging a pull request may close this issue.