Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleaning up fd to avoid thread hangup #280

Closed
wants to merge 1 commit into from

Conversation

CaptainVincent
Copy link
Contributor

Adding the missing section.
The original behavior was inside instance_linux.rs, uncertain whether it applies to Windows.

@jprendes
Copy link
Collaborator

Why do we need to do this?
IIUC, File should close the fd when dropped already.
Before this change I would have expected the File to be dropped at either of the following:

  • after a redirect(), as it take()s the value out of the Option.
  • when the last clone of Stdio is dropped, as the Arc count reaches 0 ans drops its content.

I would say in general it's when calling redirect.

I don't see where the fd is leaked outside of these cases. Do you you when that happens?

@CaptainVincent
Copy link
Contributor Author

CaptainVincent commented Aug 27, 2023

I could reproduce hang up issue by run basic demo
sudo ctr run --rm --runtime=io.containerd.wasmedge.v1 ghcr.io/containerd/runwasi/wasi-demo-app:latest testwasm /wasi-demo-app.wasm echo 'hello' (wasmtime shim has same behavior)

In fact, this issue doesn't have much output information.
Currently, I only know that the status switches to TaskStateWrapper::Exited when the task ends.
And I'm trying to locate the first commit where this issue first appeared. 11d867a

So I'm attempting to fill in what I thought was missing to address this issue. However, you might be right; this might not be the true root cause. I'll first fix the CI and then give more thought to your feedback.

@jprendes
Copy link
Collaborator

I can totally see how 11d867a could have introduced the issue. But I don't know why.

Maybe something related to the process being cloned?

@CaptainVincent
Copy link
Contributor Author

yup, me too. 🤔
I will keep try conduct further testing to give more logical explanation for this PR.

@jprendes
Copy link
Collaborator

jprendes commented Aug 27, 2023

I could reproduce hang up issue by run basic demo
sudo ctr run --rm --runtime=io.containerd.wasmedge.v1 ghcr.io/containerd/runwasi/wasi-demo-app:latest testwasm /wasi-demo-app.wasm echo 'hello' (wasmtime shim has same behavior)

Does it happen every time or sporadically? I can't reproduce it locally, even after 1500 iterations. What is your test environment?

@CaptainVincent
Copy link
Contributor Author

CaptainVincent commented Aug 28, 2023

Does it happen every time or sporadically? I can't reproduce it locally, even after 1500 iterations. What is your test environment?

What ~ really?! I ran it on our development server and it hangs every time. That's really strange.
Did I miss any relevant information that might be available beyond what's presented below?

  • ctr version
    ctr containerd.io 1.6.22

  • os version
    No LSB modules are available.
    Distributor ID: Ubuntu
    Description: Ubuntu 22.04.3 LTS
    Release: 22.04
    Codename: jammy

@CaptainVincent
Copy link
Contributor Author

CaptainVincent commented Aug 28, 2023

I add CI to test demo with those commits before and after stdio refactor.
ci-7a63fd2 OK
ci-11d867a Still hangs and time out

@jprendes
Copy link
Collaborator

I just tried in a Ubuntu 22.04 VM, but I can't reproduce.
Ubuntu 22.04 did install containerd 1.7.2.
I will try with 20.04

@jprendes
Copy link
Collaborator

I tried with a 20.04 VM and the same.
It also has contianerd 1.7.2. I wonder if it's a problem only with older containerd.
Will keep diggingd.

@CaptainVincent
Copy link
Contributor Author

CaptainVincent commented Aug 28, 2023

In here, I upgrade my github runner to containerd v1.7.5 and check it inside workflow before running again

Run ctr --version
ctr github.com/containerd/containerd v1.7.5

But no luck 🥲

@jprendes
Copy link
Collaborator

jprendes commented Aug 28, 2023

Ok, I see what's happening.
The redirect() method gets called in a different process.
This means that the Stdio copy in original process' Instance never gets consumed.

Doesn't explain why it works fine for me locally, but now we can fix it :-)

@CaptainVincent CaptainVincent deleted the cleanup-fds branch August 30, 2023 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants