Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull request jobs are flaking with k3s #322

Closed
jsturtevant opened this issue Sep 18, 2023 · 17 comments
Closed

Pull request jobs are flaking with k3s #322

jsturtevant opened this issue Sep 18, 2023 · 17 comments

Comments

@jsturtevant
Copy link
Contributor

Wasmer has failed in e2es in several PR's recently:

#320
#319
#318

Looking at logs it doesn't give a ton of info, looks like the pods are stuck in pending state

fyi @0xE282B0 @dierbei

@jprendes
Copy link
Collaborator

jprendes commented Sep 18, 2023

I'm not sure it's related to wasmer, I think it's rather flakiness from k3s.

@jprendes
Copy link
Collaborator

jprendes commented Sep 18, 2023

I'm hoping #323 can give us further insight into the issue.

  • It has already failed once with kind + wasmer, but passed with k3s + wasmer + ubuntu-22.04 in #1195
  • It has failed once with k3s + wasmer + (ubuntu-20.04 + ubuntu-22.04) in #1197

@jsturtevant jsturtevant changed the title Wasmer is flaking in Pull request jobs Pull request jobs are flaking Sep 18, 2023
@jsturtevant jsturtevant changed the title Pull request jobs are flaking Pull request jobs are flaking with k3s Sep 18, 2023
@jsturtevant
Copy link
Contributor Author

@dierbei
Copy link
Contributor

dierbei commented Sep 18, 2023

@jsturtevant @jprendes I'll take a look at that.

Right now according to my thoughts it is:

  1. look at the Containerd logs in Pending state
  2. if all are in Pending state, are there any conflicts (e.g. as in unit tests)

I'm going to go ahead and build a k3s environment and go ahead and test it.

@Mossaka
Copy link
Member

Mossaka commented Sep 19, 2023

Anyone tried reproducing it locally?

@dierbei
Copy link
Contributor

dierbei commented Sep 19, 2023

Anyone tried reproducing it locally?

I've been a little busy the last couple of days, I'll give it a try as soon as I can.

@dierbei
Copy link
Contributor

dierbei commented Sep 20, 2023

Anyone tried reproducing it locally?

Unfortunately, I tested make test-wasmer 13 times without any problems.

My OS is ubuntu 22.04.

I'm continuing to try.

@0xE282B0
Copy link
Contributor

Hi, I noticed that the first shim with sidecar that comes up after installation is stuck. When deleting it, it starts without problems.
Often it is Wasmer because it is the first one that starts but observed it with Lunatic and WasmEdge as well.
Like in this test report: KWasm/kwasm-node-installer#43 (comment)

@dierbei
Copy link
Contributor

dierbei commented Sep 20, 2023

@jprendes @jsturtevant @Mossaka @0xE282B0 I'm experiencing a Pending status, but right now I'm not quite sure what the problem is.

https://github.com/dierbei/runwasi/actions/runs/6248281600/job/16963026527#step:7:28

image

@0xE282B0
Copy link
Contributor

Not sure if it is the same problem but in my case I get this error message on the linux container whit kubectl describe pod ...:

Error: failed to create containerd task: 
  failed to start shim: start failed: io.containerd.wasmtime.v1: 
    Other("failed to setup namespaces: 
      Other: could not open network namespace /proc/0/ns/net: No such file or directory (os error 2)")
: exit status 1: unknown

Then the sidecar container is in a restart loop.

@dierbei
Copy link
Contributor

dierbei commented Sep 21, 2023

I took a closer look at the logs and realized that it seems to be because kubectl apply -f deploy.yaml is executing too fast.

This is because the --all-namespace fetch does not fetch the pods in the k8s kube-system namespace.

What should happen now is that the k8s kube-system pod is not started yet.

sudo bin/k3s kubectl get pods --all-namespaces
NAMESPACE   NAME                        READY   STATUS    RESTARTS   AGE
default     wasi-demo-79d9475fd-p8r7k   0/2     Pending   0          107s
default     wasi-demo-79d9475fd-p7sqz   0/2     Pending   0          107s
default     wasi-demo-79d9475fd-c848k   0/2     Pending   0          107s

@jprendes
Copy link
Collaborator

I took a closer look at the logs and realized that it seems to be because kubectl apply -f deploy.yaml is executing too fast.

I think you are right.
I've added some mitigation here, which runs k3s test for all runtimes:

That is compared to before, where I didn't manage to get a single clean run.

Now, if we check the failed run, it failed because the kube-system pods never came up after 1 minute, even before involving the shims, so it's not a problem with the shims.

@jsturtevant
Copy link
Contributor Author

jsturtevant commented Oct 3, 2023

some additional logs and investigations here: #346 (comment)

@jsturtevant
Copy link
Contributor Author

I've opened containerd/rust-extensions#210 since our logs don't have timing info in them

@jsturtevant
Copy link
Contributor Author

jsturtevant commented Oct 6, 2023

besides #347 seeing the following

cp: cannot stat '/var/lib/rancher/k3s/agent/etc/containerd/config.toml': No such file or directory

@jprendes
Copy link
Collaborator

jprendes commented Oct 6, 2023

I think the best in that case, as wheel as when k3s's kube-system pods fall to star, is to uninstall k3s and install it again.

That looks like a corrupted k3s installation.

@jprendes
Copy link
Collaborator

This has been fixed by #353

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants