Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flakey/broken detection of criu version #18856

Open
edsantiago opened this issue Jun 12, 2023 · 10 comments
Open

flakey/broken detection of criu version #18856

edsantiago opened this issue Jun 12, 2023 · 10 comments
Labels
flakes Flakes from Continuous Integration stale-issue

Comments

@edsantiago
Copy link
Member

Bizarre new flake:

[It] podman checkpoint latest running container
...
# podman-remote [options] container checkpoint second
Error: checkpoint/restore requires at least CRIU 31100

All other checkpoint/restore tests pass on that same task, so obviously this is a false error.

Only one instance in my logs: f37 remote

First step, I would suggest, might be to instrument libpod/container_internal_common.go:checkpointRestoreSupported() so it emits the criu version it thinks it's finding.

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Jun 12, 2023
@Luap99
Copy link
Member

Luap99 commented Jun 12, 2023

Likely a underlying error in the rpc protocol between podman and criu, I think the first step is to return the original error and not consider an error as minimum version not matched.

Luap99 added a commit to Luap99/libpod that referenced this issue Jun 12, 2023
There is weird issue containers#18856 which causes the version check to fail.
Return the underlying error in these cases so we can see it and debug
it.

Signed-off-by: Paul Holzinger <[email protected]>
@edsantiago
Copy link
Member Author

Here we go! f38 remote:

# podman-remote [options] container checkpoint ..... 
Error: failed to check for criu version: write criu-xprt-cln: broken pipe

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@Luap99
Copy link
Member

Luap99 commented Apr 4, 2024

I assume we never saw it again and close it

@Luap99 Luap99 closed this as not planned Won't fix, can't repro, duplicate, stale Apr 4, 2024
@edsantiago
Copy link
Member Author

Last week, actually:

Error: failed to check for criu version: write criu-xprt-cln: broken pipe

@Luap99
Copy link
Member

Luap99 commented Apr 4, 2024

:(

@Luap99 Luap99 reopened this Apr 4, 2024
@edsantiago
Copy link
Member Author

Today in system tests, rawhide remote:

<+024ms> # # podman-remote container checkpoint 53b2c1ff91face3d7788b502eec0415a74c752347d6d8cdc280a165a5b192b3c
<+461ms> # 53b2c1ff91face3d7788b502eec0415a74c752347d6d8cdc280a165a5b192b3c
         #
<+014ms> # # podman-remote container restore --ignore-static-ip --ignore-static-mac 53b2c1ff91face3d7788b502eec0415a74c752347d6d8cdc280a165a5b192b3c
         # [03:39:16.939415637]
         # Error: failed to check for criu version: write criu-xprt-cln: broken pipe

Luap99 added a commit to Luap99/go-criu that referenced this issue Jun 27, 2024
In the podman CI we are seeing a weird flake during criu version
detection[1]. The write to the socket just fails with broken pipe.
The logical thing to assume here is that the child exited. However the
current code never reports back the child error from wait nor does it
try to capture the output from it. This fixes both. The cleanup error is
now added to the returned error so the caller sees both.

As errors.Join is used from the std lib bump the minumum go version to
1.20.

[1] containers/podman#18856

Signed-off-by: Paul Holzinger <[email protected]>
Luap99 added a commit to Luap99/go-criu that referenced this issue Jun 27, 2024
In the podman CI we are seeing a weird flake during criu version
detection[1]. The write to the socket just fails with broken pipe.
The logical thing to assume here is that the child exited. However the
current code never reports back the child error from wait nor does it
try to capture the output from it. This fixes both. The cleanup error is
now added to the returned error so the caller sees both.

As errors.Join is used from the std lib bump the minimum go version to
1.20.

[1] containers/podman#18856

Signed-off-by: Paul Holzinger <[email protected]>
rst0git pushed a commit to Luap99/go-criu that referenced this issue Jul 15, 2024
In the podman CI we are seeing a weird flake during criu version
detection[1]. The write to the socket just fails with broken pipe.
The logical thing to assume here is that the child exited. However the
current code never reports back the child error from wait nor does it
try to capture the output from it. This fixes both. The cleanup error is
now added to the returned error so the caller sees both.

As errors.Join is used from the std lib bump the minimum go version to
1.20.

[1] containers/podman#18856

Signed-off-by: Paul Holzinger <[email protected]>
Signed-off-by: Radostin Stoyanov <[email protected]>
rst0git pushed a commit to Luap99/go-criu that referenced this issue Jul 15, 2024
In the podman CI we are seeing a weird flake during criu version
detection[1]. The write to the socket just fails with broken pipe.
The logical thing to assume here is that the child exited. However the
current code never reports back the child error from wait nor does it
try to capture the output from it. This fixes both. The cleanup error is
now added to the returned error so the caller sees both.

As errors.Join is used from the std lib bump the minimum go version to
1.20.

[1] containers/podman#18856

Signed-off-by: Paul Holzinger <[email protected]>
Signed-off-by: Radostin Stoyanov <[email protected]>
Luap99 added a commit to Luap99/go-criu that referenced this issue Jul 19, 2024
In the podman CI we are seeing a weird flake during criu version
detection[1]. The write to the socket just fails with broken pipe.
The logical thing to assume here is that the child exited. However the
current code never reports back the child error from wait. The cleanup
error is now added to the returned error so the caller sees both.

The output is not captured as this causes hangs when the fds are passed
into child processes.

As errors.Join is used from the std lib bump the minimum go version to
1.20.

[1] containers/podman#18856

Signed-off-by: Paul Holzinger <[email protected]>
Luap99 added a commit to Luap99/go-criu that referenced this issue Jul 19, 2024
In the podman CI we are seeing a weird flake during criu version
detection[1]. The write to the socket just fails with broken pipe.
The logical thing to assume here is that the child exited. However the
current code never reports back the child error from wait. The cleanup
error is now added to the returned error so the caller sees both.

The output is not captured as this causes hangs when the fds are passed
into child processes.

As errors.Join is used from the std lib bump the minimum go version to
1.20.

[1] containers/podman#18856

Signed-off-by: Paul Holzinger <[email protected]>
@edsantiago
Copy link
Member Author

As far as flakes go, this is a pretty sweet & mellow one, infrequent, easily recognized & categorized. one more, f40

Luap99 added a commit to Luap99/libpod that referenced this issue Aug 19, 2024
There is no new version yet but we like to use the new code[1] to debug
a flake[2] in the podman CI. It will not fix it but the new error might
give us a better idea what is going on.

[1] checkpoint-restore/go-criu#175
[2] containers#18856

Signed-off-by: Paul Holzinger <[email protected]>
edsantiago pushed a commit to edsantiago/libpod that referenced this issue Aug 19, 2024
There is no new version yet but we like to use the new code[1] to debug
a flake[2] in the podman CI. It will not fix it but the new error might
give us a better idea what is going on.

[1] checkpoint-restore/go-criu#175
[2] containers#18856

Signed-off-by: Paul Holzinger <[email protected]>
@Luap99
Copy link
Member

Luap99 commented Oct 25, 2024

Error: failed to check for criu version: write criu-xprt-cln: broken pipe
criu swrk failed: exit status 255

Not much more helpful I fear

https://api.cirrus-ci.com/v1/artifact/task/6055388473196544/html/int-podman-fedora-40-root-host-sqlite.log.html#t--Podman-checkpoint-podman-checkpoint-container-with-export-and-verify-non-default-runtime--1


Getting the actual stderr from criu would be useful but seems to be impossible to implement correctly
checkpoint-restore/go-criu#175 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flakes Flakes from Continuous Integration stale-issue
Projects
None yet
Development

No branches or pull requests

3 participants