-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix wait for terminated pods or for minion nodes in the just-restarted cluster #11106
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: prezha The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@prezha I tried the PR and still getting the same issue on my end.
|
@spowelljr thanks for trying it! i can't replicate it locally, can i kindly ask you to append |
@prezha Output attached |
|
hmmm, CrashLoopBackOff for kindnet and storage-provisioner - can you pls also share logs of those containers? |
Attached |
Those logs don't really contain anything useful I don't think, my fault. kindnet
Storage
|
thanks, @spowelljr, interesting! do you mind if we |
/ok-to-test |
it looks like as being related to that specific combination of the driver (docker) and runtime (containerd) - with kvm + containerd combo it looks like it's working:
btw,
i can't run docker combo on my machine bc it gets the mix of cgroup managers (cgroupfs and systemd) that i cannot overcome atm |
that could be something about kindnet, maybe try bumping the kindnet and see if that fixes? and also we need to check the CNI folder used on ISO and vm driver use bridge cni |
kvm2 driver with docker runtime
Times for minikube start: 56.1s 54.9s 51.9s 58.2s 57.5s Times for minikube (PR 11106) ingress: 43.4s 43.9s 44.1s 43.5s 43.5s docker driver with docker runtime
Times for minikube start: 25.4s 23.0s 23.6s 22.3s 22.3s Times for minikube ingress: 31.0s 34.5s 34.0s 33.6s 36.1s docker driver with containerd runtime |
kvm2 driver with docker runtime
Times for minikube start: 56.6s 50.8s 54.0s 51.9s 46.7s Times for minikube ingress: 33.8s 35.3s 34.4s 34.3s 34.3s docker driver with docker runtime
Times for minikube start: 22.6s 21.2s 22.6s 21.2s 21.6s Times for minikube ingress: 28.5s 28.0s 28.5s 36.5s 31.5s docker driver with containerd runtime |
e6896f8
to
b4ff382
Compare
kvm2 driver with docker runtime
Times for minikube start: 49.0s 47.4s 48.6s 47.4s 47.6s Times for minikube ingress: 34.8s 43.8s 36.3s 39.2s 43.4s docker driver with docker runtime
Times for minikube start: 23.8s 21.1s 21.8s 21.6s 21.4s Times for minikube ingress: 30.5s 32.0s 36.0s 34.5s 38.1s docker driver with containerd runtime |
kvm2 driver with docker runtime
Times for minikube ingress: 42.9s 43.4s 43.7s 37.8s 44.4s Times for minikube start: 52.6s 47.4s 48.3s 51.3s 47.8s docker driver with docker runtime
Times for minikube (PR 11106) start: 21.9s 20.9s 21.8s 22.0s 22.1s Times for minikube ingress: 32.5s 34.5s 35.0s 34.0s 32.5s docker driver with containerd runtime |
kvm2 driver with docker runtime
Times for minikube start: 50.2s 47.5s 51.7s 55.0s 46.8s Times for minikube ingress: 35.3s 35.2s 34.3s 33.8s 34.2s docker driver with docker runtime
Times for minikube start: 22.0s 21.0s 21.1s 21.8s 20.9s Times for minikube ingress: 30.5s 29.0s 36.5s 29.5s 31.0s docker driver with containerd runtime |
i think i've split it all correctly and cleaned up this pr now... |
f3573eb
to
89c5ec2
Compare
3873b8a
to
357b593
Compare
kvm2 driver with docker runtime
Times for minikube start: 50.1s 50.5s 46.6s 51.1s 49.9s Times for minikube ingress: 34.8s 36.7s 36.7s 35.4s 34.3s docker driver with docker runtime
Times for minikube ingress: 29.5s 29.0s 28.5s 27.0s 27.5s Times for minikube start: 21.3s 22.7s 21.3s 21.8s 21.6s docker driver with containerd runtime |
kvm2 driver with docker runtime
Times for minikube start: 49.3s 46.7s 47.5s 47.0s 46.9s Times for minikube ingress: 37.8s 34.3s 42.8s 36.2s 37.7s docker driver with docker runtime
Times for minikube start: 22.5s 22.5s 21.3s 21.7s 22.0s Times for minikube (PR 11106) ingress: 35.5s 38.0s 38.0s 40.5s 34.5s docker driver with containerd runtime |
kvm2 driver with docker runtime
Times for minikube start: 49.3s 50.7s 50.0s 50.2s 53.9s Times for minikube ingress: 36.4s 33.8s 34.8s 37.7s 34.4s docker driver with docker runtime
Times for minikube start: 20.9s 21.3s 21.8s 20.9s 23.1s Times for minikube (PR 11106) ingress: 32.5s 29.0s 29.0s 33.5s 31.5s docker driver with containerd runtime |
@prezha in none the PRs (and broke up PRs) the containerd test finishes .... |
@medyagh yes, i think this is the correct 'wait' behaviour because critical pods didn't reach the 'running' phase or 'ready' condition - extracts from the log:
and
now, the difference is that (successful) 'none' uses no cni and this (failed) 'docker + containerd' uses kindnet - here, the (now separated) #11178 and #11185, should help in addressing the issue itself so, in the context of 'wait', i think this is the expected and desired behaviour |
This change makes the Docker_liniux failure high Docker_Linux — Jenkins: completed with 8 / 233 failures in 46.28 minute(s). |
@@ -63,7 +66,14 @@ func WaitExtra(cs *kubernetes.Clientset, labels []string, timeout time.Duration) | |||
} | |||
if match || pod.Spec.PriorityClassName == "system-cluster-critical" || pod.Spec.PriorityClassName == "system-node-critical" { | |||
if err := waitPodCondition(cs, pod.Name, pod.Namespace, core.PodReady, timeout); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about instead of waiting till it fails and then check the error message to see if it is a "Termianted pod" how about at beggning only range through pods which are not Terminated ?
if err != nil {
return fmt.Errorf("error listing pods in %q namespace: %w", meta.NamespaceSystem, err)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or as @ilya-zuyev noted for _, pod := range pods.Items {
might have a condition we could check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think something like that (ie, 'skipping!') was my original intention:
minikube/pkg/minikube/bootstrapper/bsutil/kverify/pod_ready.go
Lines 137 to 139 in 357b593
if pod.Status.Phase != core.PodRunning && pod.Status.Phase != core.PodPending { | |
return core.ConditionUnknown, fmt.Sprintf("pod %q in %q namespace has status phase %q (skipping!): %+v", pod.Name, pod.Namespace, pod.Status.Phase, pod.Status) | |
} |
but it somehow got lost along the way (when i also added host checking)
i'll have a look...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@medyagh @ilya-zuyev thanks for catching this!
it shouldn't have waited on non-running and non-pending pods because of:
minikube/pkg/minikube/bootstrapper/bsutil/kverify/pod_ready.go
Lines 137 to 139 in 357b593
if pod.Status.Phase != core.PodRunning && pod.Status.Phase != core.PodPending { | |
return core.ConditionUnknown, fmt.Sprintf("pod %q in %q namespace has status phase %q (skipping!): %+v", pod.Name, pod.Namespace, pod.Status.Phase, pod.Status) | |
} |
and
minikube/pkg/minikube/bootstrapper/bsutil/kverify/pod_ready.go
Lines 105 to 108 in 357b593
if status == core.ConditionUnknown { | |
klog.Info(reason) | |
return false, fmt.Errorf(reason) | |
} |
then
minikube/pkg/minikube/bootstrapper/bsutil/kverify/pod_ready.go
Lines 116 to 118 in 357b593
if err := wait.PollImmediate(kconst.APICallRetryInterval, kconst.DefaultControlPlaneTimeout, checkCondition); err != nil { | |
return fmt.Errorf("waitPodCondition: %w", err) | |
} |
would exit immediately (due to the returned error from checkCondition()), but i've missed that specific one in:
minikube/pkg/minikube/bootstrapper/bsutil/kverify/pod_ready.go
Lines 72 to 73 in 357b593
terminatedPod := strings.HasSuffix(err.Error(), "not found") | |
if !unstartedMinion.MatchString(err.Error()) && !terminatedPod { |
so i've corrected that in 6779c72 in #11209 to test it there first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilya-zuyev @medyagh actually, this was ok - no need to change anything here as wait
works as intended (it will skip if pod is gone or if the just-restarted-minion node is not yet ready (detailed explanation above still holds)
i'll close this one in favour of #11209
/close
@prezha: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
fixes #11101 and #10674
we shouldn't wait for terminated pods or for minion nodes in the just-restarted cluster, and, in general, we should wait after the cni is applied
example logs are given here