Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow CRIU to support restoring into an existing PID namespace #1056

Merged
merged 5 commits into from
Jul 8, 2020

Conversation

adrianreber
Copy link
Member

Trying to checkpoint one container of a pod in CRI-O fails with

Error (criu/namespaces.c:1081): Can't dump a pid namespace without the process init

because, as far as I understand it, pods can share namespaces. It seems the PID namespace if controlled by the initial container in that pod which runs a pause process.

To enable checkpointing containers out of a pod this PR enables CRIU to restore out of PID namespace and into another namespace. I can restore process from the host PID namespace in another PID namespace. Out of PID namespace into the host PID namespace. All directions are working.

Restoring into an existing PID namespace comes with possibility of PID collisions.

Using clone3() the whole thing is pretty simple: setns(); clone3();

Not using clone3() makes it much more complicated as it is necessary to write to ns_last_pid of the destination PID namespace and thus a helper is required: setns(), fork(), open('ns_last_pid'), write(), close(), waitpid().

This seems to work so far, all tests are happy. After implementing it I am not sure it makes sense to checkpoint single containers out of a pod or if it would make more sense to checkpoint the complete pod. Not sure.

test/pki/cacert.pem Outdated Show resolved Hide resolved
Copy link
Member

@0x7f454c46 0x7f454c46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!
Sorry about late-reviewing, sitting under lockdown a bit affects my productivity :-/

syntax = "proto2";

message pidns_entry {
optional string ext_key = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be required?
Empty file without entries looks a bit strange, but I don't insist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking about this. Not sure. I was looking at the external network namespace support implementation and that uses optional. With PID namespace support it is however the only entry in the protobuf definition. Having the only field as optional is kind of strange.

So, yes, required would be probably a good idea right now as the file will only exist if ext_key is set. If we ever get another entry here then it would be better to be optional.

Interestingly I just read that the proto3 format has dropped required and optional completely.

I will change it to required if we think that this is better. I am unsure myself what do to with it.

criu/cr-restore.c Show resolved Hide resolved
criu/cr-restore.c Show resolved Hide resolved
criu/cr-restore.c Show resolved Hide resolved
criu/cr-restore.c Show resolved Hide resolved
criu/cr-restore.c Show resolved Hide resolved
Documentation/criu.txt Show resolved Hide resolved
test/others/ns_ext/run.sh Show resolved Hide resolved
criu/cr-restore.c Show resolved Hide resolved
criu/cr-restore.c Show resolved Hide resolved
criu/cr-restore.c Show resolved Hide resolved
@adrianreber
Copy link
Member Author

Thanks for the reviews everyone. I will update this PR.

@avagin Can you create a new magic for me?

@adrianreber
Copy link
Member Author

I was able to address most things which came up during the review:

  • more comments in the test script
  • switched to call_in_child_process instead of doing fork manually
  • renamed function to set_next_pid
  • replaced BUG macros with error messages

If it should be called pid or pidns when using this via --external is still open, but because the network namespace support also only uses net and not netns I am 👍 for pid even if pidns is clearer.

I am also unsure if the protobuf field should be required or optional.

I also need someone to create a magic value which fits the existing scheme.

@adrianreber adrianreber force-pushed the pidns branch 4 times, most recently from 003e532 to 82b2687 Compare June 16, 2020 05:54
@adrianreber
Copy link
Member Author

Tests are happy right now, but sometimes I see errors like this (centos):

+ ../../../criu/criu dump -v4 -t 31156 -o dump.log -D images --external 'pid[4026532585]:test_ns' --external 'pid[4026532585]:test_ns2'
+ RESULT=1
+ cat images/dump.log
+ grep -B 5 Error
(00.004863) Dumping task cwd id 0x6 root id 0x7
(00.004909) Dumping file-locks
(00.004915) 
(00.004916) Dumping pstree (pid: 31156)
(00.004918) ----------------------------------------
(00.004919) Error (criu/pstree.c:297): The root process 8 is not a session leader. Consider using --shell-job option
(00.007373) Unlock network
(00.007383) Unfreezing tasks into 1
(00.007386) 	Unseizing 31156 into 1
(00.007401) Error (criu/cr-dump.c:1775): Dumping FAILED.
+ '[' 1 '!=' 0 ']'
+ echo 'CRIU dump failed'
CRIU dump failed

or on alpine

+ ../../../criu/criu dump -v4 -t 1216 -o dump.log -D images --external 'pid[4026532743]:test_ns' --external 'pid[4026532743]:test_ns2'
+ RESULT=1
+ + cat images/dump.log
grep -B 5 Error
(00.003015) Wait for ack 70 on daemon socket
pie: 7: __sent ack msg: 70 70 0
pie: 7: Daemon waits for command
(00.003029) Fetched ack: 70 70 0
(00.003033) sid=0 pgid=0 pid=7
(00.003037) Error (criu/cr-dump.c:1332): A session leader of 1216(7) is outside of its pid namespace
--
(00.003425) 1216 (native) is going to execute the syscall 15, required is 15
(00.003442) 1216 was stopped
(00.003702) Unlock network
(00.003711) Unfreezing tasks into 1
(00.003714) 	Unseizing 1216 into 1
(00.003730) Error (criu/cr-dump.c:1775): Dumping FAILED.
+ '[' 1 '!=' 0 ']'
+ echo 'CRIU dump failed'
CRIU dump failed
+ echo FAIL
FAIL

Not sure why this happens. Is this somehow racy? But it is during dump which did not really change in this PR. Any ideas recommendations if my test setup is wrong?

@Snorch
Copy link
Member

Snorch commented Jun 17, 2020

Maybe the difference is in how you run your root process, setsid wrapper might help.

@adrianreber
Copy link
Member Author

Maybe the difference is in how you run your root process, setsid wrapper might help.

Could you give some more details what you mean. I am already using setsid and I also tried it with a setsid before the unshare. The problem is that I only see it sometimes on alpine or centos. Non of the other distribution have failed so far.

@Snorch
Copy link
Member

Snorch commented Jun 18, 2020

Error (criu/cr-dump.c:1332): A session leader of 1216(7) is outside of its pid namespace

It means that your root task with pid [1216, 7] (1216 in criu pidns and 7 in it's own pidns) has sid [smth, 0], which means that session leader is from lower pid namespace, likely from criu one or even lower.

Your task [1216, 7] should do setsid before exec and have sid [1216, 7], so that where will be no problem anymore. You are obviousely using setsid somehow wrong as your root task is not a session leader.

TODO: create correct magic

Signed-off-by: Adrian Reber <[email protected]>
This loads and stores the key for an external PID namespace if specified
by the user using: --external pid[<inode>]:<label>

Preparation for restoring into existing PID namespaces.

Signed-off-by: Adrian Reber <[email protected]>
This allows CRIU to restore a process into an existing PID namespace.
During checkpointing the PID namespace can be marked as external using:

   --external pid[<inode>]:<label>

This can be the host PID namespace or any other PID namespace.

During restore a process can be restored into an existing PID namespace
using:

  --inherit-fd fd[<FD>]:<label>

The <label> has to be the same for checkpoint and restore. CRIU uses
the <label> to know which resource the user means.

A process can start in the host PID namespace and can be moved to
another namespace or the other way around. Any PID namespace can be
used.

This is necessary to checkpoint containers in a POD which share certain
namespaces and this code to support external PID namespaces is the first
step towards checkpointing and restoring containers which belong to a
POD.

This is not using the --join-ns functionality as it is not possible to
move existing process to a PID namespace using setns(). Only child
process will be created in the PID namespace and that is why setns()
needs to be called before clone().

Signed-off-by: Adrian Reber <[email protected]>
Adapt netns_ext tests to also work with pid namespaces and move it from
test/others/netns_ext/ to test/others/ns_ext/.

Also enable ns_ext tests in Travis runs.

Signed-off-by: Adrian Reber <[email protected]>
@adrianreber
Copy link
Member Author

Are there any further comments regarding required or optional in the protobuf file?

@adrianreber
Copy link
Member Author

Another ping. The only open point so far is optional or required.

ino2=$(ls -iL $MNT2 | awk '{ print $1 }')
exec 33< $MNT1
exec 34< $MNT2
$CRIU dump -v4 -t $pid -o dump.log -D images --external $NS[$ino]:test_ns --external $NS[$ino2]:test_ns2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the case of pidns, ino and ino2 will be equal to each other?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think you don't dump pid2, do you? It would be nice to have some comments in this script.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope you are aware that you wrote that script initially before I extended it to also test restoring into an existing PID namespace 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very often, I write fewer comments than I have to;).

elif [[ "$NS" == "pid" ]]; then
RND1=$RANDOM
RND2=$RANDOM
setsid unshare -p -f setsid bash -c "setsid sh _run.sh pidfile2 $RND2 & . _run.sh pidfile $RND1" < /dev/zero &> output &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a comment which explains these three setsid-s

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. Only with asetsid at all three places it succeeds with Ubuntu, CentOS and Alpine. I do not understand this part. It seems, especially the alpine busybox behaves differently.

@avagin avagin merged commit 620668e into checkpoint-restore:criu-dev Jul 8, 2020
adrianreber added a commit to adrianreber/criu that referenced this pull request Jul 17, 2020
I pushed the wrong branch to the pidns PR

checkpoint-restore#1056

which resulted in the wrong patches getting merged.

This is the actual result from the review.

Signed-off-by: Adrian Reber <[email protected]>
@adrianreber adrianreber mentioned this pull request Jul 17, 2020
adrianreber added a commit to adrianreber/criu that referenced this pull request Jul 17, 2020
I pushed the wrong branch to the pidns PR

checkpoint-restore#1056

which resulted in the wrong patches getting merged.

This is the actual result from the review.

Signed-off-by: Adrian Reber <[email protected]>
adrianreber added a commit to adrianreber/criu that referenced this pull request Jul 21, 2020
I pushed the wrong branch to the pidns PR

checkpoint-restore#1056

which resulted in the wrong patches getting merged.

This is the actual result from the review.

Signed-off-by: Adrian Reber <[email protected]>
avagin pushed a commit that referenced this pull request Jul 23, 2020
I pushed the wrong branch to the pidns PR

#1056

which resulted in the wrong patches getting merged.

This is the actual result from the review.

Signed-off-by: Adrian Reber <[email protected]>
avagin pushed a commit to avagin/criu that referenced this pull request Oct 1, 2020
I pushed the wrong branch to the pidns PR

checkpoint-restore#1056

which resulted in the wrong patches getting merged.

This is the actual result from the review.

Signed-off-by: Adrian Reber <[email protected]>
avagin pushed a commit to avagin/criu that referenced this pull request Oct 12, 2020
I pushed the wrong branch to the pidns PR

checkpoint-restore#1056

which resulted in the wrong patches getting merged.

This is the actual result from the review.

Signed-off-by: Adrian Reber <[email protected]>
avagin pushed a commit that referenced this pull request Oct 20, 2020
I pushed the wrong branch to the pidns PR

#1056

which resulted in the wrong patches getting merged.

This is the actual result from the review.

Signed-off-by: Adrian Reber <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants