Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix VMA handling for shmem maps #2022

Merged

Conversation

ymanton
Copy link
Contributor

@ymanton ymanton commented Dec 12, 2022

Patch originally written by @imachug to get httpd restored inside a container in unprivileged mode.

Fixes #2019.

@ymanton ymanton force-pushed the unprivileged-fixes-for-release branch from 8ea5bbe to 8e4550c Compare December 12, 2022 23:05
@ymanton ymanton changed the title Unprivileged fixes for release Fix VMA handling for shmem maps Dec 12, 2022
@ymanton
Copy link
Contributor Author

ymanton commented Dec 14, 2022

@imachug, can you take a look at these patches and confirm they work as expected on top of current criu-dev? Are you OK with the co-authored-by tag as it is?

Also, can you suggest how test cases might be written?

@purplesyringa
Copy link
Contributor

Hmm, I don't think this handles the case I described here? I'm a bit unsure if it's really a problem because I was thinking about the problem from a theoretical point of view, but do my concerns about this seem reasonable to you?

@ymanton
Copy link
Contributor Author

ymanton commented Dec 15, 2022

Hmm, I don't think this handles the case I described here? I'm a bit unsure if it's really a problem because I was thinking about the problem from a theoretical point of view, but do my concerns about this seem reasonable to you?

Sorry, we may be misunderstanding each other. I meant does current criu-dev plus the patches here allow you to restore a containerized httpd via Podman as you mentioned in #1930? The problem of how to handle deleted files with only partial mappings across one or more processes can be discussed in #2019, but since it's currently a theoretical problem and there are no patches I don't think it should hold up the next CRIU release.

@purplesyringa
Copy link
Contributor

purplesyringa commented Dec 15, 2022

Yes, it works. I'd rather you added a note that --unprivileged behavior wrt. shared memory is known to be broken, and if something seems to work, it might only be by chance or involve UB.

Co-Authored-By is fine by me. I don't think I can give an advice on the test cases except for running the whole hole test suite in podman and disabling the tests that seem to break due to permission errors?

@ymanton ymanton force-pushed the unprivileged-fixes-for-release branch from 8e4550c to f6546e3 Compare December 16, 2022 16:56
@ymanton
Copy link
Contributor Author

ymanton commented Dec 16, 2022

Yes, it works. I'd rather you added a note that --unprivileged behavior wrt. shared memory is known to be broken, and if something seems to work, it might only be by chance or involve UB.

I've added the following text to the man-page in the unprivileged mode section:

Note that for some operations, having a capability in a namespace other than
the init namespace (i.e. the default/root namespace) is not sufficient. For
example, in order to read symlinks in proc/[pid]/map_files CRIU requires
CAP_CHECKPOINT_RESTORE in the init namespace; having CAP_CHECKPOINT_RESTORE
while running in another user namespace (e.g. in a container) does not allow
CRIU to read symlinks in /proc/[pid]/map_files.

Without access to /proc/[pid]/map_files checkpointing/restoring processes
that have mapped deleted files may not be possible.

@ymanton ymanton marked this pull request as ready for review December 16, 2022 16:58
@purplesyringa
Copy link
Contributor

purplesyringa commented Dec 16, 2022

Then maybe let's add a small check to fail the dump if the mmap'ed offset of a shared memory region is not 0? This would allow us to say "If it works, it works" instead of "It might or might not work and you'll never know which one it is"

@codecov-commenter
Copy link

codecov-commenter commented Dec 16, 2022

Codecov Report

Base: 70.73% // Head: 70.68% // Decreases project coverage by -0.05% ⚠️

Coverage data is based on head (f7bb0b1) compared to base (0965eab).
Patch coverage: 13.33% of modified lines in pull request are covered.

❗ Current head f7bb0b1 differs from pull request most recent head 388ad7c. Consider uploading reports for the commit 388ad7c to get more accurate results

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@             Coverage Diff              @@
##           criu-dev    #2022      +/-   ##
============================================
- Coverage     70.73%   70.68%   -0.05%     
============================================
  Files           132      132              
  Lines         33120    33137      +17     
============================================
- Hits          23428    23424       -4     
- Misses         9692     9713      +21     
Impacted Files Coverage Δ
criu/proc_parse.c 68.57% <0.00%> (-0.57%) ⬇️
criu/shmem.c 51.89% <36.36%> (-0.27%) ⬇️
criu/arch/x86/crtools.c 64.15% <0.00%> (-4.05%) ⬇️
criu/seize.c 59.83% <0.00%> (-0.21%) ⬇️
criu/arch/x86/cpu.c 74.65% <0.00%> (-0.18%) ⬇️
criu/cr-restore.c 67.44% <0.00%> (+0.18%) ⬆️
criu/uffd.c 79.36% <0.00%> (+0.47%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

criu/shmem.c Outdated Show resolved Hide resolved
criu/shmem.c Outdated Show resolved Hide resolved
@avagin
Copy link
Member

avagin commented Dec 23, 2022

I think our CI has to run zdtm.py from a sub-user-namespace. You can try to do something like this:

$ unshare -Ucfp python  test/zdtm.py run --rootless -f h -t zdtm/static/maps00

@ymanton ymanton force-pushed the unprivileged-fixes-for-release branch from f6546e3 to 0236c51 Compare January 9, 2023 18:36
@ymanton
Copy link
Contributor Author

ymanton commented Jan 13, 2023

I think our CI has to run zdtm.py from a sub-user-namespace. You can try to do something like this:

$ unshare -Ucfp python  test/zdtm.py run --rootless -f h -t zdtm/static/maps00

Thanks. Needs --mount-proc so that ZDTM can see the right PIDs, but I had to run an interactive shell via unshare ... bash and then run the test. Running it directly as above or even unshare ... bash -c "..." leaves the test as a zombie after it's been dumped. Output looks like:

...
+ 368:     return flag in tdesc.get('flags', '').split()
+2399:         for t in torun:
+2483:         launcher.finish()
+2194:         self.__wait_all()
+2177:         while self.__subs:
+2178:             self.__wait_one(0)
+2111:         pid = -1
+2112:         status = -1
+2113:         signal.alarm(10)
+2114:         while True:
+2115:             try:
+2116:                 pid, status = os.waitpid(0, flags)
========================= Run zdtm/static/maps00 in h ==========================
Start test
./maps00 --pidfile=maps00.pid --outfile=maps00.out --filename=maps00.test
Run criu dump
Wait for zdtm/static/maps00(56) to die for 0.100000
Wait for zdtm/static/maps00(56) to die for 0.200000
Wait for zdtm/static/maps00(56) to die for 0.400000
Wait for zdtm/static/maps00(56) to die for 0.800000
Wait for zdtm/static/maps00(56) to die for 1.600000
Wait for zdtm/static/maps00(56) to die for 3.200000
Wait for zdtm/static/maps00(56) to die for 6.400000
+  48:     print("==== ALARM ====")
==== ALARM ====
Wait for zdtm/static/maps00(56) to die for 12.800000
Wait for zdtm/static/maps00(56) to die for 25.600000
    PID TTY          TIME CMD
     56 ?        00:00:00 maps00 <defunct>
    PID TTY      STAT   TIME COMMAND
      1 pts/3    S+     0:00 python test/zdtm.py --debug run --ignore-taint --rootless -f h -t zdtm/static/maps00
      4 pts/3    S+     0:00 ./zdtm_ct zdtm.py
      5 pts/3    S+     0:00  \_ python zdtm.py
      6 pts/3    S+     0:00      \_ python zdtm.py
     60 pts/3    R+     0:00          \_ ps axf 56
     56 ?        Zs     0:00 [maps00] <defunct>
############ Test zdtm/static/maps00 FAIL at zdtm/static/maps00 die ############
Test output: ================================
11:00:29.515:    56: map: ptr 0x7f2dcca5f000 flag        2 prot        0
11:00:29.515:    56: map: ptr 0x7f2dcca5d000 flag        1 prot        0
11:00:29.515:    56: map: ptr 0x7f2dcca5b000 flag       22 prot        0
11:00:29.516:    56: map: ptr 0x7f2dcca59000 flag       21 prot        0

...

Traceback (most recent call last):
  File "/home/ymanton/criu/test/zdtm.py", line 1932, in do_run_test
    cr(cr_api, t, opts)
  File "/home/ymanton/criu/test/zdtm.py", line 1603, in cr
    test.gone()
  File "/home/ymanton/criu/test/zdtm.py", line 597, in gone
    self.__wait_task_die()
  File "/home/ymanton/criu/test/zdtm.py", line 442, in __wait_task_die
    wait_pid_die(int(self.__pid), self.__name, self.__timeout)
  File "/home/ymanton/criu/test/zdtm.py", line 364, in wait_pid_die
    raise test_fail_exc("%s die" % who)
__main__.test_fail_exc: zdtm/static/maps00 die
...

Not clear to me why the dumped process is being cleaned up under an interactive shell but not otherwise... but at least it effectively tests the patches here. Fails as expected on criu-dev, passes with these patches.

@ymanton ymanton force-pushed the unprivileged-fixes-for-release branch from 0236c51 to adb39ac Compare January 16, 2023 18:42
@ymanton
Copy link
Contributor Author

ymanton commented Jan 16, 2023

Then maybe let's add a small check to fail the dump if the mmap'ed offset of a shared memory region is not 0? This would allow us to say "If it works, it works" instead of "It might or might not work and you'll never know which one it is"

@imachug, I'm not clear on this. A non-zero offset for anonymous shared mappings is not allowed; do you mean we should check this for shared mappings backed by deleted files? If so, the current code will not fall back to /proc/$pid/mem, it will fail as follows:

b'(00.008055) Failed to open map_files/7fe73250b000-7fe73250d000, try to go via [/home/ymanton/criu/test/zdtm/static/maps00.test-07 (deleted)] path'
b"(00.008080) Error (criu/proc_parse.c:340): Can't open mapped [/home/ymanton/criu/test/zdtm/static/maps00.test-07 (deleted)]: No such file or directory"
b"(00.008087) Error (criu/proc_parse.c:667): Can't open 642's mapfile link 7fe73250b000: No such file or directory"
b'(00.008097) Error (criu/cr-dump.c:1564): Collect mappings (pid: 642) failed with -1'

@purplesyringa
Copy link
Contributor

Yes, that was what I deemed the problem. I think you're right. Sorry for confusion. Deleted mappings should perhaps be addressed in a separate PR, if possible. I'm a bit iffy on trying to open a path ending with (deleted), as that might lead to a bug if such file does exist, but that's a totally unrelated problem, and it might have been addressed already.

@avagin
Copy link
Member

avagin commented Jan 31, 2023

I have merged this pr, so let's do fixes on top.

criu/shmem.c Outdated Show resolved Hide resolved
criu/shmem.c Outdated Show resolved Hide resolved
criu/shmem.c Outdated Show resolved Hide resolved
@avagin
Copy link
Member

avagin commented Jan 31, 2023

Thanks. Needs --mount-proc so that ZDTM can see the right PIDs, but I had to run an interactive shell via unshare ... bash and then run the test. Running it directly as above or even unshare ... bash -c "..." leaves the test as a zombie after it's been dumped.

@ymanton I think the next command has to work:

unshare -Ucfpm --mount-proc  bash -c "python  test/zdtm.py run -t zdtm/static/env00 --ignore-taint --rootless -f h && echo PASSED"

The echo after zdtm doesn't allow bach to exec the last command and so it continues collecting zombies like init.

@ymanton ymanton force-pushed the unprivileged-fixes-for-release branch 2 times, most recently from e885358 to 9c1f939 Compare February 3, 2023 04:38
@avagin
Copy link
Member

avagin commented Feb 8, 2023

+ ssh default 'cd /vagrant/criu; unshare -Ucfpm --mount-proc bash -c "./test/zdtm.py run -t zdtm/static/maps00 -f h --rootless && true"'
userns is supported
========================= Run zdtm/static/maps00 in h ==========================
Start test
./maps00 --pidfile=maps00.pid --outfile=maps00.out --filename=maps00.test
Run criu dump
=[log]=> dump/zdtm/static/maps00/70/1/dump.log
------------------------ grep Error ------------------------
b'(00.003926)'
b'(00.003928) Collecting mappings (pid: 70)'
b'(00.003930) ----------------------------------------'
b'(00.003972) Failed to open map_files/400000-402000, try to go via [/vagrant/criu/test/zdtm/static/maps00] path'
b'(00.003980) Error (criu/proc_parse.c:350): Failed to resolve mapping 400000 filename'
b"(00.003987) Error (criu/proc_parse.c:667): Can't open 70's mapfile link 400000: Operation not permitted"
b'(00.003991) Error (criu/cr-dump.c:1564): Collect mappings (pid: 70) failed with -1'
b'(00.004031) Unlock network'
b'(00.004037) Unfreezing tasks into 1'
b'(00.004039) \tUnseizing 70 into 1'
b'(00.004059) Error (criu/cr-dump.c:2099): Dumping FAILED.'
------------------------ ERROR OVER ------------------------
################## Test zdtm/static/maps00 FAIL at CRIU dump ###################

@ymanton ymanton force-pushed the unprivileged-fixes-for-release branch 2 times, most recently from 4ec4018 to bce7510 Compare February 8, 2023 20:45
@avagin
Copy link
Member

avagin commented Feb 8, 2023

@ymanton could you try out this patch:

diff --git a/scripts/build/Dockerfile.alpine b/scripts/build/Dockerfile.alpine
index eced46c22..19b08315f 100644
--- a/scripts/build/Dockerfile.alpine
+++ b/scripts/build/Dockerfile.alpine
@@ -21,7 +21,9 @@ RUN apk update && apk add \
        py3-pip \
        py3-protobuf \
        python3 \
-       sudo
+       sudo \
+       libcap-utils \
+       util-linux
 
 COPY . /criu
 WORKDIR /criu
diff --git a/scripts/ci/run-ci-tests.sh b/scripts/ci/run-ci-tests.sh
index 7b64c6b06..b00a30a72 100755
--- a/scripts/ci/run-ci-tests.sh
+++ b/scripts/ci/run-ci-tests.sh
@@ -289,6 +291,13 @@ ip net add test
 ./test/zdtm.py run -t zdtm/static/env00 -t zdtm/transition/fork -t zdtm/static/ghost_holes00 -t zdtm/static/socket-tcp -t zdtm/static/msgque -k always
 ./test/crit-recode.py
 
+# Rootless tests
+make -C test/zdtm/ cleanout
+rm -rf test/dump
+setcap cap_checkpoint_restore,cap_sys_ptrace+eip criu/criu
+sudo --user=#65534 --group=#65534 unshare -Ucfpm --mount-proc -- bash -c "./test/zdtm.py run -t zdtm/static/maps00 -f h --rootless && true"
+setcap -r criu/criu
+
 # more crit testing
 make -C test/others/crit run
 

@ymanton ymanton force-pushed the unprivileged-fixes-for-release branch from bce7510 to 89a672b Compare February 9, 2023 02:29
@ymanton
Copy link
Contributor Author

ymanton commented Feb 9, 2023

Thanks @avagin, it works in some cases, but it fails the same way as my previous attempt on Fedora. It fails on Ubuntu 20.04 because libcap doesn't know about cap_checkpoint_restore; but I guess we can add a capsh --supports=... check to deal with that sort of thing.

I don't have easy access to Fedora, but I can get on a RHEL machine. I'll try to reproduce it there and see what's going on.

@avagin
Copy link
Member

avagin commented Feb 9, 2023

It fails on Ubuntu 20.04 because libcap doesn't know about cap_checkpoint_restore;

#2061

@ymanton ymanton force-pushed the unprivileged-fixes-for-release branch from 89a672b to 0cf19fa Compare February 9, 2023 17:04
@ymanton
Copy link
Contributor Author

ymanton commented Feb 10, 2023

Works on RHEL 9 as well. I got on a Fedora 37 VM and finally it fails like above. The error is from

criu/criu/proc_parse.c

Lines 349 to 353 in 0965eab

if (vma->vmst->st_dev != vfi_dev || vma->vmst->st_ino != vfi->ino) {
pr_err("Failed to resolve mapping %lx filename\n", (unsigned long)vma->e->start);
close(fd);
return -1;
}

When it fails it seems to be because the device number we get from smaps doesn't match what we get from stat.

For failing cases:

# stat
  File: /home/ymanton/criu/test/zdtm/static/maps00
  Size: 120856          Blocks: 240        IO Block: 4096   regular file
Device: 0,36    Inode: 5107        Links: 1

# smaps
00400000-00402000 r--p 00000000 00:1c 5107                               /home/ymanton/criu/test/zdtm/static/maps00

In working environments it looks like:

# stat
  File: /home/ymanton/criu/test/zdtm/static/maps00
  Size: 116224          Blocks: 232        IO Block: 4096   regular file
Device: fd00h/64768d    Inode: 135918877   Links: 1

# smaps
559a2d8d5000-559a2d8d7000 r--p 00000000 fd:00 135918877                  /home/ymanton/criu/test/zdtm/static/maps00

I do not know why this is, but comments near the top of vma_get_mapfile_user seem to say that we are hoping smaps will have the correct info but maybe not.

If anyone thinks they know what is going on I would appreciate some info, otherwise I'll see if I can find an explanation.

@avagin
Copy link
Member

avagin commented Feb 10, 2023

@ymanton could you show /proc/self/mountinfo from this host? It can be a btrfs issue.

@ymanton
Copy link
Contributor Author

ymanton commented Feb 10, 2023

@avagin you're right, file is on btrfs.

[ymanton@localhost criu]$ cat /proc/self/mountinfo
...
58 1 0:28 /root / rw,relatime shared:1 - btrfs /dev/vda5 rw,seclabel,compress=zstd:1,space_cache=v2,subvolid=256,subvol=/root
...
89 58 0:28 /home /home rw,relatime shared:39 - btrfs /dev/vda5 rw,seclabel,compress=zstd:1,space_cache=v2,subvolid=257,subvol=/home
...

Looks like this is a known issue: https://criu.org/Filesystems_pecularities

Seems we should use something like phys_stat_dev_match() in the code above to compare the devices.

@ymanton ymanton force-pushed the unprivileged-fixes-for-release branch from 0cf19fa to 0640188 Compare February 13, 2023 18:41
@ymanton
Copy link
Contributor Author

ymanton commented Feb 13, 2023

I think I've fixed the btrfs issue, and it looks like the test passes elsewhere but Fedora and Cent Stream still fail on restore; clone3 -> EPERM. I disabled that path and tried again but writing to ns_last_pid also returns EPERM.

@ymanton
Copy link
Contributor Author

ymanton commented Feb 14, 2023

I think I've fixed the btrfs issue, and it looks like the test passes elsewhere but Fedora and Cent Stream still fail on restore; clone3 -> EPERM. I disabled that path and tried again but writing to ns_last_pid also returns EPERM.

@adrianreber, any idea about the above? It works as expected when I run in unprivileged mode without using unshare, but for some reason cap_checkpoint_restore inside a namespace doesn't work as expected, and only on Fedora/Cent Stream.

@adrianreber
Copy link
Member

Is it selinux related? Can you switch to permissive to test it?

@ymanton ymanton force-pushed the unprivileged-fixes-for-release branch from 0640188 to 388ad7c Compare February 14, 2023 16:32
@ymanton
Copy link
Contributor Author

ymanton commented Feb 14, 2023

Yeah that was it, works in Permissive mode, thanks.

@adrianreber
Copy link
Member

Yeah that was it, works in Permissive mode, thanks.

You could report it here: https://github.com/fedora-selinux/selinux-policy

Ask to have the same access to clone3 and ns_last_pid in a user namespace.

Signed-off-by: Younes Manton <[email protected]>
@ymanton ymanton force-pushed the unprivileged-fixes-for-release branch from ec01921 to 9e0d819 Compare February 14, 2023 18:25
If we don't have access to map_files and instead have to get the data
from /proc/$pid/mem we can close and reset the fd before passing it to
do_dump_one_shmem() which can then check it before trying to seek past
holes, eliminating the need for a separate seek_data_supported boolean.

Signed-off-by: Younes Manton <[email protected]>
@ymanton ymanton force-pushed the unprivileged-fixes-for-release branch from 9e0d819 to 0ae799a Compare February 14, 2023 18:31
CAP_CHECKPOINT_RESTORE does not give access to /proc/$pid/map_files in
user namespaces. In order to test that CRIU in unprivileged mode can
dump and restore anonymous shared memory pages we will run the maps00
tests in a user namespace.

Signed-off-by: Younes Manton <[email protected]>
If we can't access a map_files entry directly and instead have to follow
the link and access the file via a filesystem path we need to properly
deal with files on btrfs subvolumes.

Signed-off-by: Younes Manton <[email protected]>
@ymanton ymanton force-pushed the unprivileged-fixes-for-release branch from 0ae799a to 66f6ac4 Compare February 14, 2023 21:25
@ymanton
Copy link
Contributor Author

ymanton commented Feb 14, 2023

Finally, all green.

@avagin avagin merged commit eff3c2a into checkpoint-restore:criu-dev Feb 17, 2023
@avagin
Copy link
Member

avagin commented Feb 17, 2023

Merged. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Changes that are required to C/R httpd in podman
5 participants