Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Podman missinterprets %h symbol in the bind volume source when container created/started from a unit (BTRFS file system) #11547

Closed
PavelSosin-320 opened this issue Sep 13, 2021 · 31 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@PavelSosin-320
Copy link

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

When container that assumes bind volume mounting is created or run from the generated by podman generate systemd command the source path can be expressed by the absolute path or via % builtin symbols, %h for example. All % builtin "path" symbols like %t and %h are expanded into an absolute path, i.e. must be accepted by podman as a valid bind volume source. All subdirectories related to the %h, %v, etc. should be accepted. I hope, local driver and fuse.mount support this scenario.

Steps to reproduce the issue:

  1. Create a container from the docker.io/theiaide/theis image with bind volume as /home/user/project:/home/project
  2. Generate Systemd unit from the container.
  3. Put the unit into ~/.config/systemd/user and enable it
  4. Create a "project workspace directory" ~/project
  5. Modify unit .service file: add Condition=PathExists=%h/project (to be sure that bind volume source exists and its full path is correctly expanded) and the bind volume source expressed as -v %h/project:/home/project
  6. Start the unit using systemctl start
    Describe the results you received:
    "Invalid reference" error message
    Describe the results you expected:
    It must work because systemd unit must be shareable between users

Additional information you deem important (e.g. issue happens only occasionally):
It works OK in WSL because the workspace is located in the host filesystem
Output of podman version:
Podman Version 3.3.1
API Version: 3.3.1


**Output of `podman info --debug`:**

(paste your output here)


**Package info (e.g. output of `rpm -q podman` or `apt list podman`):**

(paste your output here)


**Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)**


Yes

**Additional environment details (AWS, VirtualBox, physical, etc.):**
Fedora 4 Workstation on the laptop
security-opt lable=disable has been added
Standalone podman run  -v /home/pavelsosin/project:/home/project [TheiaIDE](https://hub.docker.com/r/theiaide/theia) is OK.
@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 13, 2021
@PavelSosin-320
Copy link
Author

Sorry! It may be unrelated because it happens regardless of the volume type. Simply, podman run and Podman container create return err code 125 when executed by rootless user, i.e. systemd unit on my Fedora WS in the user session, like in ##2197. Then, cid file is not created , ............. stop and rm don't work. It never happens in when podman runs in the terminal, only when from the unit using systemctl --user start, daemon-reexec, etc. Hint: Podman REST API service is running in the background.

@rhatdan
Copy link
Member

rhatdan commented Sep 13, 2021

@vrothberg PTAL

@PavelSosin-320
Copy link
Author

@giuseppe I think that it is exactly the ##2172 because the main difference between working case onWSL distro i.e. ext4 filesystem and failing Fedora34 WS i.e. btrfs filesystem. But I expected that in 3.1 the configuration file should be corrected. Please, look at attachement.
podman-info.log
I'm using Fedora WS GUI session, rootless user. Which conf file defines the overlay driver behavior in this case?

@giuseppe
Copy link
Member

could you share the systemd service file generated by Podman?

@PavelSosin-320
Copy link
Author

PavelSosin-320 commented Sep 14, 2021 via email

@PavelSosin-320 PavelSosin-320 changed the title Podman missinterprets %h symbol in the bind volume source when container created/started from a unit Podman missinterprets %h symbol in the bind volume source when container created/started from a unit (BTRFS file system) Sep 16, 2021
@PavelSosin-320
Copy link
Author

The root cause is certainly related to the file system type: Fedora34 uses BTRFS by the default mounted at /home. If user data is located at /home too but not necessary on the btrfs subvolume the cross-fs mount will never work. The generated Unit checks that the Podman tree and container storage exist but it doesn't check or trigger mounting of subvolume or validate that user's data are in the same FS as container storage. The same problem appears in the anonymous volume scenario - nothing enforces that everything is on the same volume. storage.conf allows everything. WSL VM with its single ext4 filesystem works perfectly: no btrfs and Host folders are mounted using MS-specific mechanism. How bind mount should work on BTRFS and subvolumes???

@vrothberg
Copy link
Member

could you share the systemd service file generated by Podman?

@PavelSosin-320, please share the systemd service file.

@PavelSosin-320
Copy link
Author

From the Podman on WSL instance with some comments and TODOs
PodmanUnit.txt

@vrothberg
Copy link
Member

Thanks, can you also the contents of run-r8df511c2cb034c33a1ec70d63b670529.service?

@PavelSosin-320
Copy link
Author

Sorry for the long delay due to the holidays. I tried to run theia image via systemd-run as a transient unit and the result is:
[pavelsosin@Dell user]$ journalctl --unit run-u1276.service
-- Journal begins at Mon 2021-09-13 16:33:09 IDT, ends at Thu 2021-09-30 17:28:05 IDT. --
Sep 30 17:26:17 Dell systemd[1]: Started /usr/bin/podman run -d -p 3000:3000 -v /home/pavelsosin/project:/home/project docker.io/theiaide/theia.
Sep 30 17:26:17 Dell podman[384240]: 2021-09-30 17:26:17.896394707 +0300 IDT m=+0.044832762 image pull docker.io/theiaide/theia
Sep 30 17:26:18 Dell podman[384240]: 2021-09-30 17:26:18.338036042 +0300 IDT m=+0.486474147 container create ea48aca6642ccc7e87476b7611950d711efb28569efd8823a749e13a1d2f0c0e (image=docker.io/theiaide/theia:latest, name=determined_hertz)
Sep 30 17:26:19 Dell podman[384240]: 2021-09-30 17:26:19.592840034 +0300 IDT m=+1.741278171 container init ea48aca6642ccc7e87476b7611950d711efb28569efd8823a749e13a1d2f0c0e (image=docker.io/theiaide/theia:latest, name=determined_hertz)
Sep 30 17:26:19 Dell podman[384240]: 2021-09-30 17:26:19.937165559 +0300 IDT m=+2.085603734 container start ea48aca6642ccc7e87476b7611950d711efb28569efd8823a749e13a1d2f0c0e (image=docker.io/theiaide/theia:latest, name=determined_hertz)
Sep 30 17:26:19 Dell podman[384240]: ea48aca6642ccc7e87476b7611950d711efb28569efd8823a749e13a1d2f0c0e
Sep 30 17:26:20 Dell podman[384445]: 2021-09-30 17:26:20.229830119 +0300 IDT m=+0.072758624 container died ea48aca6642ccc7e87476b7611950d711efb28569efd8823a749e13a1d2f0c0e (image=docker.io/theiaide/theia:latest, name=determined_hertz)
Sep 30 17:26:22 Dell podman[384445]: 2021-09-30 17:26:22.572881081 +0300 IDT m=+2.415809589 container cleanup ea48aca6642ccc7e87476b7611950d711efb28569efd8823a749e13a1d2f0c0e (image=docker.io/theiaide/theia:latest, name=determined_hertz)
Sep 30 17:26:22 Dell systemd[1]: run-u1276.service: Deactivated successfully.
Sep 30 17:26:22 Dell systemd[1]: run-u1276.service: Consumed 1.134s CPU time
In other words, bind mount succeeded. This is good news. The bad news that possibly, either podman or storage driver search something in the unit environment that doesn't exist -"empty".
I also checked that no hidden dependency from systemd fs* target effects the bind because AssertMount /home is passed. BTRFS /home userspace FS is mounted, i.e. /home works as a BTRFS subvolume.

@PavelSosin-320
Copy link
Author

PavelSosin-320 commented Oct 3, 2021

Since all possible scenarios that I tested including Image volumes and anonymous volumes worked OK I suppose the root cause is that Podman parses -v option value exactly as described in the Documentation:
Any source that does not begin with a . or / will be treated as the name of a named volume. If a volume with that name does not exist, it will be created .. in a blind way. Systemd units placeholders that start from % don't conform with this pattern and are not substituted as Env variables. Unit, unlike systemd-run, ignores session environment - it runs without a shell. For example the equivalent of $(PWD) is WorkingDirectory=.
And indeed it worked for me fine:

  1. I added WorkingDirectory=%h directive
  2. And addressed the source directory of the bind volume via (.).. I suppose, the same pattern to designate "project folder" with all necessary files that can be passed to the container started as a service in combination with --new.

@PavelSosin-320
Copy link
Author

PavelSosin-320 commented Oct 10, 2021

@vwbusguy Unfortunately, now I can say definitely that the issue is related to the btrfs filesystem because exactly the same syntax works correctly on the ext4 filesystem of the WSL-Fedora VM instance but doesn't work in the Fedora 34 Desktop with its default btrfs FS. Since BTRFS itself is a userspace FS and has its own Kernel module and mount utilities it can conflict with the FUSE-mount. In the Docker's documentation, this issue is addressed explicitly in Docker doc BTRFS storage driver. Although Podman info on Fedora reports that the backing FS is btrfs all other configurations look like the same as on ext4 FS.
Maybe it is the same as ##4764 but in the case of creation container from unit generated with the --new option only "invalid reference message" bubbles on the surface, produces err 125 and stops unit execution.

@PavelSosin-320
Copy link
Author

After eliminating BTRF-related issues via Podman with BTRFS I found very simple thing: When Podman tries to create a container from the systemd-unit that ran by rootless user it can't find storage configuration and it causes the "invalid reference" error. The testing using systemd-run results in
Oct 12 12:47:23 Dell systemd[1]: Started /usr/bin/podman container create --conmon-pidfile %t/container-theiaUnit.pid --cidfile %t/container-theiaUnit.ctr-id cgroups=no-conmon --replace -p >
Oct 12 12:47:23 Dell podman[21664]: time="2021-10-12T12:47:23+03:00" level=warning msg="Storage configuration is unset - using hardcoded default graph root "/var/lib/containers/storage""
Oct 12 12:47:23 Dell podman[21664]: time="2021-10-12T12:47:23+03:00" level=warning msg="Storage configuration is unset - using hardcoded default graph root "/var/lib/containers/storage""
Oct 12 12:47:23 Dell podman[21664]: time="2021-10-12T12:47:23+03:00" level=warning msg="Storage configuration is unset - using hardcoded default graph root "/var/lib/containers/storage""
Oct 12 12:47:23 Dell podman[21664]: Error: invalid reference format
Oct 12 12:47:23 Dell systemd[1]: run-u236.service: Main process exited, code=exited, status=125/n/a
Oct 12 12:47:23 Dell systemd[1]: run-u236.service: Failed with result 'exit-code'.
Even after importing HOME and XDG_RUNTIME_DIR into User's systemd manager environment the issue is not resolved.
Intesive usage environment variables in the configuration, like in sthe storage.conf is not safe because Session environment doesn't exist in the systemd runtime - nothing is inherited automatically.

@PavelSosin-320
Copy link
Author

I see the dependency on btrfs storage driver: it creates every container as a subvolume (!!!) in the $HOME/.local/share/containers/storage/btrfs/subvolumes/. So, the real ruunroot for the rootless containers has to be adjusted. It would better to use %h in the runroot option because HOME has to be imported into systemd environment according to systemctl --user value. It doesn't happen automatically.

@vwbusguy
Copy link

Indeed, it does, and that's generally not a problem for btrfs, unless you want to try to fsck all of them at once for some reason. Otherwise, subvols in btrfs are cheap. But yeah, the problem is that systemd won't grok the default PID file location and will assume the container isn't healthy and running when it is and will continuously restart it (depending on container/service restart policy) after a minute or so. Oddly enough, just commenting out the PIDFile line in the service file seems to make it work just fine, but I haven't tried this with a bunch of different container services on one host.

@PavelSosin-320
Copy link
Author

Really sorry colleagues! But I found that Container-storage BTRFS driver code. intensievly uses OS.home to create path to the individual container's subvolume. Unfortunatelly, container that created as a part of generated systemd unit can't use HOME env variable safely because the service that invokes podman run, podman container create, etc. doesn't have session environment and runs as "homeless". Only units that create nothing in the storage can work safely. Session environment is created by pam or systemd generators with no synchronization with the unit execution. The same is true also for other XDG based env variables: XDG_HOME, CONFIG_HOME, DATA_HOME, ... env variables. %h placeholder doesn't mean that HOME exists in the environment. Everything is created during session creation. Systemd service needs statically created environment that comes from the unit.service file, environment file or unit config per user/service.

@PavelSosin-320
Copy link
Author

@vwbusguy I don't think that IO throughput is so critical in the development environment - the nische of Podman. But it has a lot of benefits when used by a rootless user due to the features that it offers: isolation, data safety and ease of maintenence. The artifact volume's content is cleary visible without irrelevant high, low, merged details. Container's volume snapshot is ready-for-use in debugging of failure states container snapshot available without additional machinary.

@PavelSosin-320
Copy link
Author

PavelSosin-320 commented Oct 18, 2021

Some advance: after importing HOME and all CONTAINERS_* env variables into systemd environment using systemctl --user import-environment "Invalid reference" message disapears, i.e. storage driver works but the return code is still 125. Podman info runs OK out of the unit context but now container creation fails without any error message. Which data I can collect in the such situation?
PPS podman info --debug runs in the systemd environment via systemd-run but there is no signs of the storage configuration in its output! Storage.conf is ignored???

@vwbusguy
Copy link

How would it know to use btrfs driver vs overlay if storage.conf is ignored?

@PavelSosin-320
Copy link
Author

To eliminate "Invalid reference" message I after learning lessons from running "Podman container create" using systemd-run,

  1. Created my own systemd-env.conf file containing HOME definintion for my user in the $XDG_CONFIG_HOME/containers
  2. Added EnvironmentFiles=/home/.. systemd-env.conf and PassEnvironment =HOME to my Unit file
    and "invalid reference" disapered as expected.😁But Error 125 is returned by Podman anyway. Atleast, The hope that Podman can use environment variables when managed by systemd units is alive.
    Maybe, using Unit's config files create via systemctl --user edit is more convinient.
    This is explicitely stated in the Fedora documentation that HOME environment variable is created during login, i.e. not automatically available in Units.

@PavelSosin-320
Copy link
Author

PavelSosin-320 commented Oct 24, 2021

Hint: something went in the Docker's BTRFS driver too: #moby/moby#42253. Interesting what https://github.com/AkihiroSuda did here. Podman only describes failure in the wrong way. Indeed, some operations like subvolume create, show don't need root privileges but in the some cases /home subvolume is not accessible for the rootless user. The simple ls , read, and write as rootless user into /home..... subvolume work without mount ????? Does mount fail without FUSE outside User session? Maybe, Mount namespaces of conmon and created by systemd conflict? Systemd Unit created for Pod wthout conmon and --new option works OK. The Pod provides its own CG as a parent for inner containers.

@PavelSosin-320
Copy link
Author

Finally, I suppose that I hit the ##4678 . This is 1.5 years old issue without solution. Only workaround was proposed. But it looks very similar to the starting Podman REST API zombie process issue. Systemd can't tolerate non-organized processes packs. Otherwise, The systemd based systemd will be filled with zombie processes and leaked CGs. Systemd tends to organize a group of processes and If long-living CG is needed Podman.scope under the user.slice managed by logind can be used. It creates CGs with predictable namesI I played with it to get rid zombie REST API server process and it worked well - everything that belong to the scope disappears.
And, BINGO - it works !!!
When I run podman container create using systemd-run with --scope option everything works!

@rhatdan
Copy link
Member

rhatdan commented Oct 25, 2021

Please open a PR to fix this in containers/storage.

@PavelSosin-320
Copy link
Author

@rhatdan The scope creation is purely crun duty. I don't think that "external" transient scope creation using systemd-run can be used in the production. I just upgraded crun for Fedora 34 to the recent version 1.2 for Fedora 34 and will test it as soon as possible to be sure that it works correctly. But meanwhile, can somebody from the Podman team check that Podman invoke Crun correctly with --rootless and --systemd-cgroup options values and then, processes exit code properly.

@PavelSosin-320
Copy link
Author

Crun has been tested and exposes the same very old issue of cgroup manager for rootless user: Podman has to follow containers.conf configuration and use systemd as cgroup manager for root and rootless user. to manage container's running as a systemd service the system unit of kind scope either Systemd manager DBUS API or manually executed systemd-run is absolutely necessary. busctl works for rootless users, every user has own bus socket - there is no reason to suspect that API has some additional restrictions.
busctl
3015 systemd pavelsosin :1.324 [email protected]
ls -al $XDG_RUNTIME_DIR/bus
srw-rw-rw-. 1 pavelsosin pavelsosin 0 Oct 14 11:34 /run/user/1000/bus
This issue #opencontainers/runc#2163 was raised, closed and now it comes back as systemd unit execution showstopper. Has PR 2281 reached Fedora34?

@giuseppe
Copy link
Member

are you using BTRFS as the storage backend or is the storage configured to use overlay?

@PavelSosin-320
Copy link
Author

@giuseppe 1. Crun is not guilty!!! I reverted the Podman configuration to the old runc and got the same result - error 125/n/a.
I afraid that passed to the OCI runtime CG path is wrong: when an user's unit managed service is started the part of cg tree above container's scope is defined by explicitly defined units dependencies. PAM and all slices and services CGs triggered by PAM may not exists when RUNC or CRUN tries to create container. Unfortunately, no CRUN options work to me and I can't test it passing CG created by a slice unit.

@giuseppe
Copy link
Member

I am curious to know if this works when using overlay instead of btrs.

Have you tried changing the storage driver?

@vwbusguy
Copy link

I've also had this happen with overlay.

  driver = "overlay"
  runroot = "/run/user/1000/containers"
  graphroot = "/home/scott/.local/share/containers/storage"
  [storage.options]
    size = ""
    remap-uids = ""
    remap-gids = ""
    ignore_chown_errors = ""
    remap-user = ""
    remap-group = ""
    skip_mount_home = ""
    mount_program = "/usr/bin/fuse-overlayfs"

@PavelSosin-320
Copy link
Author

I experienced some strange adverse effects as a result of Fedora update brought systemd, DBus upgrade with their utilities upgrade. They expect some environment variables and access rights in the scope of Systemd service:
XDG_RUNTIME_DIR=/run/user/1000
DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus

became the organic part of rootless systemd environment. If Podman or OCI Runtime uses systemd to manage Control groups they have to obtain and respect them. Machine concept and Machine slice were introduced too. New Control groups interface. If Podman wills to adopt Kata OCI runtime as the current container.conf content shows me. Please, send me the link to the OCI invocation module source and I eager to make "Code review".

@PavelSosin-320
Copy link
Author

PavelSosin-320 commented Nov 7, 2021

Playing with RunC vs CRun I found that RunC has strong requirement that RunRoot where the bundle is stored must be on the TMPFS, i.e. /Run?User/ ... But Fedora ( and the future possible WSL Fedora distro based on WinBTRFS driver boxes users inside distro's /home FS that is always BTRFS. Does somebody know how to mount bind different FS type?
On my Fedora Desktop it is done via 3th FS - gvfs that was the real omnivore : it mounts my cloud storage to my home area:
gvfsd-fuse on /run/user/1000/gvfs type fuse.gvfsd-fuse (rw,nosuid,nodev,relatime,user_id=1000,group_id=1000)
ls -al /run/user/1000/gvfs

dr-x------. 1 pavelsosin pavelsosin 0 Jan 1 1970 'google-drive:host=gmail.com,user=pavel.sosin'
Even it is hard to believe that the direct call for mount will succeed but I'm wondering why my /etc/fstab has no tmpfs entry that can be marked with users option and will allow to mount /run/user / to the rootless home subvolume. I suppose that tmpfs is mounted by the certain systemd unit of type .mount running by the Systemd-user manager and available via PrivateTmp= directive, as it is described in the systemd docs. Otherwise, how rootless users have tmp dirs?
Indeed, it hasn't ???
Instead I have
lsblk -e7
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 931.5G 0 disk
├─sda1 8:1 0 600M 0 part /boot/efi
├─sda2 8:2 0 1G 0 part /boot
└─sda3 8:3 0 929.9G 0 part /home/pavelsosin/lib/containers/storage/btrfs - Is it visual aberration or something else?
zram0 252:0 0 7.6G 0 disk [SWAP]
Who created mount point for the block device in the ~/lib. Is ~/.local/share expected (XDG_DATA_HOME)?. Systemd has nothing in the environment that pointing to this path and can be exported to the services.

@containers containers locked and limited conversation to collaborators Nov 10, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants