-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flake] TestDevGracefulCancel no output for 10 mins #6424
Comments
happened twice on same PR https://github.com/GoogleContainerTools/skaffold/pull/6422/checks?check_run_id=3315882185 |
Wsa this issue seen on any Github Actions runs as well? If not it might this flake may be gone as we have migrated off of Travis to Github Actions for CI/CD |
This continues to reproduce on Kokoro as well. I've just seen it on a random PR. Tests failed after a 50-minute run (roughly 10 minute actual test run (4 min integration test runs), then 40 minute waiting on input). Full logs are at https://source.cloud.google.com/results/invocations/3e0997d9-10ea-4eaf-ab6c-9603b73f482a/targets |
Sometimes it makes it past |
This does repro locally although much lower chances than in the CI.
My suspicion is that this is a timing issue (we probably send the signal before it's used for cancellation). I added a log line indicating the skaffold process has received the signal. Despite the signal is received after we print the Ctrl+C note, it still hangs sometimes: time="2021-09-23T09:58:55-07:00" level=info msg="Still waiting for pods [getting-started]" time="2021-09-23T09:58:55-07:00" level=info msg="Still waiting for pods [getting-started]" Waiting for deployments to stabilize... time="2021-09-23T09:58:55-07:00" level=info msg="Still waiting for pods [getting-started]" Deployments stabilized in 12.129231ms Press Ctrl+C to exit Watching for changes... time="2021-09-23T09:58:56-07:00" level=info msg="Pods marked as ready: map[getting-started:true]" time="2021-09-23T09:58:56-07:00" level=warning msg="received signal: interrupt" subtask=-1 task=DevLoop [..hangs..] |
More debugging:
Deregister is implemented here. I am suspecting it's the Here's a stack trace dump that shows it's indeed stuck there.
I have a branch here with bunch of log stmts, just run it with the integration test cmd above. https://github.com/ahmetb/skaffold-1/tree/debug-6424 |
Make the Kubernetes pod watcher context aware, and select on context.Done(). This means we can stop waiting, and acting on, pod events when the context has been cancelled. This should help avoid deadlocks that can occur when pod watcher event receivers stop reading from the channel that they've registered with the pod watcher. I'm not sure it'll completely eliminate the problem though. What if a receiver stops reading (because context is cancelled) while the pod watcher is in the middle of sending pod events to all the receivers? Fixes: GoogleContainerTools#6424
Make the Kubernetes pod watcher context aware, and select on context.Done(). This means we can stop waiting, and acting on, pod events when the context has been cancelled. This should help avoid deadlocks that can occur when pod watcher event receivers stop reading from the channel that they've registered with the pod watcher. I'm not sure it'll completely eliminate the problem though. What if a receiver stops reading (because context is cancelled) while the pod watcher is in the middle of sending pod events to all the receivers? Fixes: GoogleContainerTools#6424
Make the Kubernetes pod watcher context aware, and select on `context.Done()`. This means we can stop waiting, and acting on, pod events when the context has been cancelled. Remove waiting on `context.Done()` in the Kubernetes log aggregator, container manager, and pod port forwarder. This is to eliminate the chance that the pod watcher sends a `PodEvent` on a channel without a waiting receiver. This should help avoid deadlocks that can occur when pod watcher event receivers stop reading from the channel that they've registered with the pod watcher. We still close the channels on the receiver side, which could increase the chances of regression and re-occurrence of this issue. Also uses a RWMutex in the pod watcher, though we could move this change to a separate commit. Fixes: GoogleContainerTools#6424
Make the Kubernetes pod watcher context aware, and select on `context.Done()`. This means we can stop waiting, and acting on, pod events when the context has been cancelled. Remove waiting on `context.Done()` in the Kubernetes log aggregator, container manager, and pod port forwarder. This is to eliminate the chance that the pod watcher sends a `PodEvent` on a channel without a waiting receiver. This should help avoid deadlocks that can occur when pod watcher event receivers stop reading from the channel that they've registered with the pod watcher. We still close the channels on the receiver side, which could increase the chances of regression and re-occurrence of this issue. Also uses a RWMutex in the pod watcher, though we could move this change to a separate commit. Fixes: #6424
Adding log statements to better understand what's happening in kokoro. Unfortunately I haven't been able to reproduce the problem locally (on Linux) so far. Related: GoogleContainerTools#6424, GoogleContainerTools#6643, GoogleContainerTools#6662
Debugging the flaky `TestDevGracefulCancel/multi-config-microservices` integration test showed that the kubectl port forwarder was stuck with goroutines waiting on a channels(one per resource). Search for `goroutine 235` and `goroutine 234` in this Kokoro log: https://source.cloud.google.com/results/invocations/a9749ab5-8762-4319-a2be-f67c7440f7a2/targets/skaffold%2Fpresubmit/log This change means that the forwarder also listens for context canceled. **Related**: GoogleContainerTools#6424, GoogleContainerTools#6643, GoogleContainerTools#6662, GoogleContainerTools#6685
Debugging the flaky `TestDevGracefulCancel/multi-config-microservices` integration test showed that the kubectl port forwarder was stuck with goroutines waiting on a channels(one per resource). Search for `goroutine 235` and `goroutine 234` in this Kokoro log: https://source.cloud.google.com/results/invocations/a9749ab5-8762-4319-a2be-f67c7440f7a2/targets/skaffold%2Fpresubmit/log This change means that the forwarder also listens for context canceled. **Related**: #6424, #6643, #6662, #6685
The text was updated successfully, but these errors were encountered: