Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v3.2] Cannot get journald cursor when running in namespace with /var/log overmounted #10863

Closed
maybe-sybr opened this issue Jul 6, 2021 · 5 comments
Assignees
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@maybe-sybr
Copy link
Contributor

maybe-sybr commented Jul 6, 2021

/kind bug

Description

v3.2.1 introduced a breakage in my environment where I run the vanilla Fedora podman (ie.with systemd support) under namespaces - specifically injected underneath a usernetes namespace which has its own mount namespace and overmounts /var/log/. The issue manifests as podman logs -f <somecontainer> spitting out the following error:

DEBU[0000] Initializing event backend journald
ERRO[0000] Error waiting for container to exit: failed to get journal cursor: failed to get cursor: cannot assign requested address

and then exiting non-zero.

An strace shows:

[pid 916143] getpid()                   = 916136
[pid 916143] gettid()                   = 916143
[pid 916143] futex(0x7f69684604e0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
[pid 916143] openat(AT_FDCWD, "/run/log/journal", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 10
[pid 916143] newfstatat(10, "", {st_mode=S_IFDIR|S_ISGID|0755, st_size=40, ...}, AT_EMPTY_PATH) = 0
[pid 916143] getdents64(10, 0x7f69540042c0 /* 2 entries */, 32768) = 48
[pid 916143] getdents64(10, 0x7f69540042c0 /* 0 entries */, 32768) = 0
[pid 916143] fstatfs(10, {f_type=TMPFS_MAGIC, f_bsize=4096, f_blocks=804376, f_bfree=803910, f_bavail=803910, f_files=819200, f_ffree=817903, f_fsid={val=[0, 0]}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_NOSUID|ST_NODEV}) = 0
[pid 916143] close(10)                  = 0
[pid 916143] openat(AT_FDCWD, "/var/log/journal", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
[pid 916143] futex(0xc000600150, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 916159] <... futex resumed>)       = 0
[pid 916143] <... futex resumed>)       = 1
[pid 916159] nanosleep({tv_sec=0, tv_nsec=3000},  <unfinished ...>
[pid 916143] write(2, "\33[31mERRO\33[0m[0000] Error waitin"..., 142ERRO[0000] Error waiting for container to exit: failed to get journal cursor: failed to get cursor: cannot assign requested address
) = 142

The repro below (step 4) works on 3.2.0, breaks in 3.2.1 due to c3f6ef6 having been backported, and is not fixed in 3.2.2 (#10744) because the bug was actually only exposed by c3f6ef6 rather than being caused by it. I asked for the changes backported in #10744 to land in 3.2.2 and @vrothberg obliged (thanks!), but after seeing the same misbehaviour I had to do a bit more bisecting.

It turns out that the issue is triggered by c3f6ef6 (confirmed by reverting that change on top of the affected tags) which changes how we wait until a container exits by making use of runtime events instead of attempting to follow a file until EOF. However, on main cherry picking that change on top of any commit from the 3.2.0-rc0 divergence up to fb4a0c5 also causes the misbehaviour.

In fb4a0c5 there are some changes which landed from containers/common in that change which actually appear to be the correct fix for this misbehaviour, rather than the changes landed in #10744. Specifically containers/common@7c0d472 and containers/common@a1b80b5 which fixed some missed imports in the other commit. Those changes landed in common v0.40.0 and don't appear to have been backported to 0.38.

Suggested fix from me is to backport containers/common@7c0d472 and containers/common@a1b80b5 into a common v0.38.15 and then include that in a podman 3.2.3. For the moment, I'm just going to move back to the tip of main to unblock my own work :)

Steps to reproduce the issue:

I've sanitised the following repro which I think demonstrates the behaviour portably enough.

  1. Use a podman built with systemd support

  2. Unshare a mount NS and overmount /var/log with tmpfs or something (you also need to unshare userns and map root)

  • unshare -Urm then mount -t tmpfs tmpfs /var/log
  1. Run a container in the mount NS and attempt to podman logs -f it, observe the error message
  • you'll need to set some magic _CONTAINERS env vars to let podman know that the userns is set up already
    • export _CONTAINERS_ROOTLESS_UID=<youruid>
    • export _CONTAINERS_USERNS_CONFIGURED="true"
  • I do this command to follow a short-lived container:
$ podman --cgroup-manager=cgroupfs --log-level=debug logs -fn \
    "$(podman --cgroup-manager=cgroupfs run -d docker.io/library/alpine sh -c "sleep 2; echo hi")"`

As a one liner:

$ podman pull docker.io/library/alpine  # make sure you have this image before entering the userns, pulling doesn't work there
$ unshare -Urm env _CONTAINERS_ROOTLESS_UID="$(id -u "${USER}")" _CONTAINERS_USERNS_CONFIGURED="true" sh -c 'mount -t tmpfs tmpfs /var/log && podman logs --cgroup-manager=cgroupfs --log-level=debug -fn "$(podman --cgroup-manager=cgroupfs run -d docker.io/library/alpine sh -c "sleep 2; echo hi")"'

This is enough to demonstrate how the breakage occurs and strace things, but not exactly the same as my environment (see step 4). I've separated this out since it may be an easier system to reason about along the way to step 4 when confirming my observations :)

  1. Observing the exact sub-usernetes behaviour is a little more involved. We need to unshare more stuff which allows us to run systemdless
  • unshare mount, user, pid namespaces and make sure you fork a child to avoid memory allocation issues with unshare -Urmpf
  • overmount /var/log again and also mount the pidns' procfs with mount -t proc proc /proc
  • Run the podman command from step 3

As a one liner again:

$ unshare -Urmpf env _CONTAINERS_ROOTLESS_UID="$(id -u "${USER}")" _CONTAINERS_USERNS_CONFIGURED="true" sh -c 'mount -t tmpfs tmpfs /var/log && mount -t proc proc /proc && podman logs --cgroup-manager=cgroupfs --log-level=debug -fn "$(podman --cgroup-manager=cgroupfs run -d docker.io/library/alpine sh -c "sleep 2; echo hi")"'

On 3.2.1/3.2.2 and the current tip of the 3.2 branch (6f0bf16) this breaks.

(Expand) With the following diff (changes to the vendorised containers/common) applied to the tip of the 3.2 branch, it works:
diff --git a/vendor/github.com/containers/common/pkg/config/default.go b/vendor/github.com/containers/common/pkg/config/default.go
index 133f9f073..794efc32e 100644
--- a/vendor/github.com/containers/common/pkg/config/default.go
+++ b/vendor/github.com/containers/common/pkg/config/default.go
@@ -191,7 +191,7 @@ func DefaultConfig() (*Config, error) {
 			Init:       false,
 			InitPath:   "",
 			IPCNS:      "private",
-			LogDriver:  DefaultLogDriver,
+			LogDriver:  defaultLogDriver(),
 			LogSizeMax: DefaultLogSizeMax,
 			NetNS:      netns,
 			NoHosts:    false,
diff --git a/vendor/github.com/containers/common/pkg/config/nosystemd.go b/vendor/github.com/containers/common/pkg/config/nosystemd.go
index 5b82b1389..6e39a6ccd 100644
--- a/vendor/github.com/containers/common/pkg/config/nosystemd.go
+++ b/vendor/github.com/containers/common/pkg/config/nosystemd.go
@@ -3,9 +3,17 @@
 package config
 
 func defaultCgroupManager() string {
-	return "cgroupfs"
+	return CgroupfsCgroupsManager
 }
 
 func defaultEventsLogger() string {
 	return "file"
 }
+
+func defaultLogDriver() string {
+	return DefaultLogDriver
+}
+
+func useSystemd() bool {
+	return false
+}
diff --git a/vendor/github.com/containers/common/pkg/config/systemd.go b/vendor/github.com/containers/common/pkg/config/systemd.go
index 02e5c4ac2..ed014126b 100644
--- a/vendor/github.com/containers/common/pkg/config/systemd.go
+++ b/vendor/github.com/containers/common/pkg/config/systemd.go
@@ -3,11 +3,23 @@
 package config
 
 import (
+	"io/ioutil"
+	"strings"
+	"sync"
+
 	"github.com/containers/common/pkg/cgroupv2"
 	"github.com/containers/storage/pkg/unshare"
 )
 
+var (
+	systemdOnce sync.Once
+	usesSystemd bool
+)
+
 func defaultCgroupManager() string {
+	if !useSystemd() {
+		return CgroupfsCgroupsManager
+	}
 	enabled, err := cgroupv2.Enabled()
 	if err == nil && !enabled && unshare.IsRootless() {
 		return CgroupfsCgroupsManager
@@ -15,6 +27,32 @@ func defaultCgroupManager() string {
 
 	return SystemdCgroupsManager
 }
+
 func defaultEventsLogger() string {
-	return "journald"
+	if useSystemd() {
+		return "journald"
+	}
+	return "file"
+}
+
+func defaultLogDriver() string {
+	// If we decide to change the default for logdriver, it should be done here.
+	if useSystemd() {
+		return DefaultLogDriver
+	}
+
+	return DefaultLogDriver
+
+}
+
+func useSystemd() bool {
+	systemdOnce.Do(func() {
+		dat, err := ioutil.ReadFile("/proc/1/comm")
+		if err == nil {
+			val := strings.TrimSuffix(string(dat), "\n")
+			usesSystemd = (val == "systemd")
+		}
+		return
+	})
+	return usesSystemd
 }

Describe the results you received:

Failed to get a cursor for the journal at /var/log/journal because it does not exist (is overmounted)

Describe the results you expected:

In step 3 of my repro, should have seen that pid 1 was still systemd and died because attempting to access /var/log/journal (and /run/log/journal I suppose, which podman does try first) is the correct choice to make.

In step 4 of my repro, pid 1 is no longer systemd and we should fall back to "file" logging rather than "journald" per the changes in common. The container should run and logs should be followed until the container exits.

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

[maybe@karma proj]$ which podman
/usr/bin/podman
[maybe@karma proj]$ podman --version
podman version 3.2.2

Output of podman info --debug:

[maybe@karma proj]$ podman info --debug
host:
  arch: amd64
  buildahVersion: 1.21.0
  cgroupControllers: []
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.0.27-2.fc34.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.27, commit: '
  cpus: 8
  distribution:
    distribution: fedora
    version: "34"
  eventLogger: journald
  hostname: karma
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1337
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1337
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.12.13-300.fc34.x86_64
  linkmode: dynamic
  memFree: 2506313728
  memTotal: 16473620480
  ociRuntime:
    name: crun
    package: crun-0.20.1-1.fc34.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 0.20.1
      commit: 0d42f1109fd73548f44b01b3e84d04a279e99d2e
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/user/1337/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.1.9-1.fc34.x86_64
    version: |-
      slirp4netns version 1.1.8+dev
      commit: 6dc0186e020232ae1a6fcc1f7afbc3ea02fd3876
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.0
  swapFree: 25948839936
  swapTotal: 26684153856
  uptime: 3h 44m 12.55s (Approximately 0.12 days)
registries:
  registry.chopstix.lan:
    Blocked: false
    Insecure: true
    Location: registry.chopstix.lan:5000
    MirrorByDigestOnly: false
    Mirrors: null
    Prefix: registry.chopstix.lan
  registry.chopstix.lan:5000:
    Blocked: false
    Insecure: true
    Location: registry.chopstix.lan:5000
    MirrorByDigestOnly: false
    Mirrors: null
    Prefix: registry.chopstix.lan:5000
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - registry.centos.org
  - docker.io
store:
  configFile: /home/maybe/.config/containers/storage.conf
  containerStore:
    number: 20
    paused: 0
    running: 1
    stopped: 19
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: fuse-overlayfs-1.5.0-1.fc34.x86_64
      Version: |-
        fusermount3 version: 3.10.4
        fuse-overlayfs: version 1.5
        FUSE library version 3.10.4
        using FUSE kernel interface version 7.31
  graphRoot: /home/maybe/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 95
  runRoot: /run/user/1337/containers
  volumePath: /home/maybe/.local/share/containers/storage/volumes
version:
  APIVersion: 3.2.2
  Built: 1624664959
  BuiltTime: Sat Jun 26 09:49:19 2021
  GitCommit: ""
  GoVersion: go1.16.4
  OsArch: linux/amd64
  Version: 3.2.2

Package info (e.g. output of rpm -q podman or apt list podman):

[maybe@karma proj]$ rpm -q podman
podman-3.2.2-1.fc34.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

@vrothberg vrothberg self-assigned this Jul 6, 2021
@vrothberg
Copy link
Member

Thanks for the excellent report, @maybe-sybr. I will take care of backporting etc.

@vrothberg
Copy link
Member

containers/common#668

@maybe-sybr
Copy link
Contributor Author

Much appreciated @vrothberg :)

@rhatdan
Copy link
Member

rhatdan commented Jul 10, 2021

@vrothberg has this been back ported?

@vrothberg
Copy link
Member

Yes, fixes are in the main and the v3.2 branch (not yet released).

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

3 participants