Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flake] TestDevGracefulCancel no output for 10 mins #6424

Closed
tejal29 opened this issue Aug 12, 2021 · 6 comments · Fixed by #6643
Closed

[Flake] TestDevGracefulCancel no output for 10 mins #6424

tejal29 opened this issue Aug 12, 2021 · 6 comments · Fixed by #6643
Labels
kind/todo implementation task/epic for the skaffold team meta/test-flake

Comments

@tejal29
Copy link
Contributor

tejal29 commented Aug 12, 2021

=== RUN   TestDevGracefulCancel/getting-started
time="2021-08-12T20:45:55Z" level=info msg="Ran [skaffold dev --namespace skaffoldbnrts --default-repo gcr.io/k8s-skaffold] in 32.464 seconds"
time="2021-08-12T20:45:55Z" level=info msg="Namespace: skaffoldvjhqk"
time="2021-08-12T20:45:55Z" level=info msg="Running [skaffold dev --namespace skaffoldvjhqk --default-repo gcr.io/k8s-skaffold] in examples/getting-started"
time="2021-08-12T20:45:55Z" level=info msg="Waiting for pods [getting-started] to be ready"
Listing files to watch...
 - skaffold-example
Generating tags...
 - skaffold-example -> gcr.io/k8s-skaffold/skaffold-example:v1.29.0-40-g50a585d
Checking cache...
 - skaffold-example: Found Locally
Starting test...
Tags used in deployment:
 - skaffold-example -> gcr.io/k8s-skaffold/skaffold-example:55673b440103e61c7db744acfb318e6027d024e45e3da530a384c2a5962c6f48
Starting deploy...
Loading images into kind cluster nodes...
 - gcr.io/k8s-skaffold/skaffold-example:55673b440103e61c7db744acfb318e6027d024e45e3da530a384c2a5962c6f48 -> Loaded
Images loaded in 920.236524ms
 - pod/getting-started created
time="2021-08-12T20:45:57Z" level=info msg="Still waiting for pods [getting-started]"
Waiting for deployments to stabilize...
time="2021-08-12T20:45:57Z" level=info msg="Still waiting for pods [getting-started]"
Deployments stabilized in 49.670341ms
Waiting for deployments to stabilize...
time="2021-08-12T20:45:57Z" level=info msg="Still waiting for pods [getting-started]"
Deployments stabilized in 16.485819ms
Press Ctrl+C to exit
Watching for changes...
time="2021-08-12T20:45:59Z" level=info msg="Pods marked as ready: map[getting-started:true]"
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received
@tejal29 tejal29 added meta/test-flake kind/todo implementation task/epic for the skaffold team labels Aug 12, 2021
@tejal29 tejal29 changed the title [Flake] TestDevGracefulCancel no output for 20 mins [Flake] TestDevGracefulCancel no output for 10 mins Aug 12, 2021
@tejal29
Copy link
Contributor Author

tejal29 commented Aug 12, 2021

@aaron-prindle
Copy link
Contributor

Wsa this issue seen on any Github Actions runs as well? If not it might this flake may be gone as we have migrated off of Travis to Github Actions for CI/CD

@ahmetb
Copy link
Contributor

ahmetb commented Sep 22, 2021

This continues to reproduce on Kokoro as well.

I've just seen it on a random PR. Tests failed after a 50-minute run (roughly 10 minute actual test run (4 min integration test runs), then 40 minute waiting on input). Full logs are at https://source.cloud.google.com/results/invocations/3e0997d9-10ea-4eaf-ab6c-9603b73f482a/targets

@ahmetb
Copy link
Contributor

ahmetb commented Sep 23, 2021

Sometimes it makes it past TestDevGracefulCancel/getting-started and fails at TestDevGracefulCancel/multi-config-microservices. https://source.cloud.google.com/results/invocations/0ae17eca-0fda-4f28-8947-d30f12d0efa7/targets/skaffold%2Fpresubmit/log

@ahmetb
Copy link
Contributor

ahmetb commented Sep 23, 2021

This does repro locally although much lower chances than in the CI.

time env INTEGRATION_TEST_ARGS="-run=TestDevGracefulCancel/getting-started -count=100" make integration

My suspicion is that this is a timing issue (we probably send the signal before it's used for cancellation). I added a log line indicating the skaffold process has received the signal. Despite the signal is received after we print the Ctrl+C note, it still hangs sometimes:

time="2021-09-23T09:58:55-07:00" level=info msg="Still waiting for pods [getting-started]"
time="2021-09-23T09:58:55-07:00" level=info msg="Still waiting for pods [getting-started]"
Waiting for deployments to stabilize...
time="2021-09-23T09:58:55-07:00" level=info msg="Still waiting for pods [getting-started]"
Deployments stabilized in 12.129231ms
Press Ctrl+C to exit
Watching for changes...
time="2021-09-23T09:58:56-07:00" level=info msg="Pods marked as ready: map[getting-started:true]"
time="2021-09-23T09:58:56-07:00" level=warning msg="received signal: interrupt" subtask=-1 task=DevLoop
[..hangs..]

@ahmetb
Copy link
Contributor

ahmetb commented Sep 23, 2021

More debugging:
In dev.go the statement defer r.deployer.GetLogger().Stop() gets stuck.
It calls Stop() on LoggerMux,
which calls Stop() on Kubernetes LogAggregator
In that method specifically this stmt blocks:

a.podWatcher.Deregister(a.events)

Deregister is implemented here. I am suspecting it's the w.receiverLock.Lock() statement that actually locks.

Here's a stack trace dump that shows it's indeed stuck there.
There's another routine at podWatcher.Start() and it's also calling w.receiverLock.Lock().
Presumably its for-loop is stuck on sending to channel (receiver <- ...) so it never unlocks.

SIGABRT: abort
PC=0x7fff204c1cde m=0 sigcode=0

goroutine 0 [idle]:
runtime.pthread_cond_wait(0x81631c0, 0x8163180)
	/usr/local/Cellar/go/1.17/libexec/src/runtime/sys_darwin.go:393 +0x34
runtime.semasleep(0xffffffffffffffff)
	/usr/local/Cellar/go/1.17/libexec/src/runtime/os_darwin.go:66 +0x93
runtime.notesleep(0x8162f70)
	/usr/local/Cellar/go/1.17/libexec/src/runtime/lock_sema.go:182 +0x85
runtime.mPark()
	/usr/local/Cellar/go/1.17/libexec/src/runtime/proc.go:1441 +0x2a
runtime.stoplockedm()
	/usr/local/Cellar/go/1.17/libexec/src/runtime/proc.go:2602 +0x7b
runtime.schedule()
	/usr/local/Cellar/go/1.17/libexec/src/runtime/proc.go:3299 +0x3d
runtime.park_m(0xc000622b60)
	/usr/local/Cellar/go/1.17/libexec/src/runtime/proc.go:3516 +0x12d
runtime.mcall()
	/usr/local/Cellar/go/1.17/libexec/src/runtime/asm_amd64.s:307 +0x43

goroutine 1 [semacquire, 5 minutes]:
sync.runtime_SemacquireMutex(0xc000401a70, 0x38, 0x65559bd9)
	/usr/local/Cellar/go/1.17/libexec/src/runtime/sema.go:71 +0x25
sync.(*Mutex).lockSlow(0xc0002d8958)
	/usr/local/Cellar/go/1.17/libexec/src/sync/mutex.go:138 +0x271
sync.(*Mutex).Lock(0xc0002d8958)
	/usr/local/Cellar/go/1.17/libexec/src/sync/mutex.go:81 +0x51
github.com/GoogleContainerTools/skaffold/pkg/skaffold/kubernetes.(*podWatcher).Deregister(0xc0002d8940, 0xc00081e180)
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/kubernetes/watcher.go:65 +0x2e
github.com/GoogleContainerTools/skaffold/pkg/skaffold/kubernetes/logger.(*LogAggregator).Stop(0xc000e9a0b0)
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/kubernetes/logger/log.go:177 +0x32a
github.com/GoogleContainerTools/skaffold/pkg/skaffold/log.LoggerMux.Stop({0xc000d61140, 0x1, 0x1})
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/log/logger_mux.go:43 +0x274
github.com/GoogleContainerTools/skaffold/pkg/skaffold/runner/v1.(*SkaffoldRunner).Dev.func9()
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/runner/v1/dev.go:328 +0x204
github.com/GoogleContainerTools/skaffold/pkg/skaffold/runner/v1.(*SkaffoldRunner).Dev(0xc000bb8160, {0x6f18150, 0xc0002d4480}, {0x6ed6c60, 0xc000bd8a50}, {0xc0003188c8, 0x1, 0x1})
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/runner/v1/dev.go:371 +0x277a
github.com/GoogleContainerTools/skaffold/cmd/skaffold/app/cmd.runDev.func5({0x6f5bbc0, 0xc000bb8160}, {0xc000dce040, 0x1, 0x1})
	/Users/ahmetb/workspace/skaffold/cmd/skaffold/app/cmd/dev.go:73 +0x1bc
github.com/GoogleContainerTools/skaffold/cmd/skaffold/app/cmd.withRunner({0x6f18150, 0xc0002d4480}, {0x6ed6c60, 0xc000bd8a50}, 0xc000c0f548)
	/Users/ahmetb/workspace/skaffold/cmd/skaffold/app/cmd/runner.go:55 +0x20d
github.com/GoogleContainerTools/skaffold/cmd/skaffold/app/cmd.runDev({0x6f18150, 0xc0002d4480}, {0x6ed6c60, 0xc000bd8a50})
	/Users/ahmetb/workspace/skaffold/cmd/skaffold/app/cmd/dev.go:67 +0x385
github.com/GoogleContainerTools/skaffold/cmd/skaffold/app/cmd.(*builder).NoArgs.func1(0x0, {0x4, 0xc000c0f680, 0xc00035e0a0})
	/Users/ahmetb/workspace/skaffold/cmd/skaffold/app/cmd/commands.go:135 +0x9a
github.com/GoogleContainerTools/skaffold/cmd/skaffold/app/cmd.applyPostRunHooks.func1(0xc0007262c0, {0xc000bc3380, 0x0, 0x4})
	/Users/ahmetb/workspace/skaffold/cmd/skaffold/app/cmd/commands.go:164 +0x6e
github.com/spf13/cobra.(*Command).execute(0xc0007262c0, {0xc000bc3340, 0x4, 0x4})
	/Users/ahmetb/go/pkg/mod/github.com/spf13/[email protected]/command.go:856 +0xaf0
github.com/spf13/cobra.(*Command).ExecuteC(0xc00071c000)
	/Users/ahmetb/go/pkg/mod/github.com/spf13/[email protected]/command.go:974 +0x8b3
github.com/spf13/cobra.(*Command).Execute(0xc00071c000)
	/Users/ahmetb/go/pkg/mod/github.com/spf13/[email protected]/command.go:902 +0x2f
github.com/spf13/cobra.(*Command).ExecuteContext(0xc00071c000, {0x6f18150, 0xc0002d4480})
	/Users/ahmetb/go/pkg/mod/github.com/spf13/[email protected]/command.go:895 +0x73
github.com/GoogleContainerTools/skaffold/cmd/skaffold/app.Run({0x6ed5e80, 0xc00012c008}, {0x6ed5e80, 0xc00012c010})
	/Users/ahmetb/workspace/skaffold/cmd/skaffold/app/skaffold.go:47 +0x492
main.main()
	/Users/ahmetb/workspace/skaffold/cmd/skaffold/skaffold.go:50 +0x2bd

goroutine 21 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc000160a00)
	/Users/ahmetb/go/pkg/mod/[email protected]/stats/view/worker.go:276 +0x135
created by go.opencensus.io/stats/view.init.0
	/Users/ahmetb/go/pkg/mod/[email protected]/stats/view/worker.go:34 +0xb6

goroutine 22 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x8161ce0)
	/Users/ahmetb/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:1164 +0x5b
created by k8s.io/klog/v2.init.0
	/Users/ahmetb/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:418 +0x128

goroutine 7 [chan receive]:
github.com/golang/glog.(*loggingT).flushDaemon(0x8161b40)
	/Users/ahmetb/go/pkg/mod/github.com/golang/[email protected]/glog.go:882 +0x5b
created by github.com/golang/glog.init.0
	/Users/ahmetb/go/pkg/mod/github.com/golang/[email protected]/glog.go:410 +0x1a8

goroutine 9 [chan receive, 5 minutes]:
github.com/GoogleContainerTools/skaffold/pkg/skaffold/event/v2.newHandler.func1()
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/event/v2/event.go:59 +0x4b
created by github.com/GoogleContainerTools/skaffold/pkg/skaffold/event/v2.newHandler
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/event/v2/event.go:57 +0x1ca

goroutine 10 [chan receive, 5 minutes]:
github.com/GoogleContainerTools/skaffold/pkg/skaffold/event.newHandler.func1()
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/event/event.go:60 +0x45
created by github.com/GoogleContainerTools/skaffold/pkg/skaffold/event.newHandler
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/event/event.go:58 +0x18a

goroutine 11 [chan receive, 5 minutes]:
github.com/rjeczalik/notify.(*recursiveTree).dispatch(0xc000317400)
	/Users/ahmetb/go/pkg/mod/github.com/rjeczalik/[email protected]/tree_recursive.go:125 +0x51
created by github.com/rjeczalik/notify.newRecursiveTree
	/Users/ahmetb/go/pkg/mod/github.com/rjeczalik/[email protected]/tree_recursive.go:119 +0x25b

goroutine 12 [syscall, 5 minutes, locked to thread]:
github.com/rjeczalik/notify._Cfunc_CFRunLoopRun()
	_cgo_gotypes.go:229 +0x45
github.com/rjeczalik/notify.init.1.func1()
	/Users/ahmetb/go/pkg/mod/github.com/rjeczalik/[email protected]/watcher_fsevents_cgo.go:72 +0x45
created by github.com/rjeczalik/notify.init.1
	/Users/ahmetb/go/pkg/mod/github.com/rjeczalik/[email protected]/watcher_fsevents_cgo.go:64 +0x33

goroutine 37 [syscall, 5 minutes]:
os/signal.signal_recv()
	/usr/local/Cellar/go/1.17/libexec/src/runtime/sigqueue.go:166 +0x28
os/signal.loop()
	/usr/local/Cellar/go/1.17/libexec/src/os/signal/signal_unix.go:24 +0x1d
created by os/signal.Notify.func1.1
	/usr/local/Cellar/go/1.17/libexec/src/os/signal/signal.go:151 +0x30

goroutine 72 [chan send, 5 minutes]:
github.com/GoogleContainerTools/skaffold/pkg/skaffold/kubernetes.(*podWatcher).Start.func5()
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/kubernetes/watcher.go:151 +0xa25
created by github.com/GoogleContainerTools/skaffold/pkg/skaffold/kubernetes.(*podWatcher).Start
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/kubernetes/watcher.go:99 +0x8de

goroutine 46 [IO wait, 5 minutes]:
internal/poll.runtime_pollWait(0x6ed89a0, 0x72)
	/usr/local/Cellar/go/1.17/libexec/src/runtime/netpoll.go:229 +0x45
internal/poll.(*pollDesc).wait(0xc000d65218, 0x72, 0x0)
	/usr/local/Cellar/go/1.17/libexec/src/internal/poll/fd_poll_runtime.go:84 +0x88
internal/poll.(*pollDesc).waitRead(0xc000d65218, 0x0)
	/usr/local/Cellar/go/1.17/libexec/src/internal/poll/fd_poll_runtime.go:89 +0x37
internal/poll.(*FD).Read(0xc000d65200, {0xc000dc3000, 0x1000, 0x1000})
	/usr/local/Cellar/go/1.17/libexec/src/internal/poll/fd_unix.go:167 +0x453
net.(*netFD).Read(0xc000d65200, {0xc000dc3000, 0x1000, 0x1000})
	/usr/local/Cellar/go/1.17/libexec/src/net/fd_posix.go:56 +0x73
net.(*conn).Read(0xc000011618, {0xc000dc3000, 0x1000, 0x1000})
	/usr/local/Cellar/go/1.17/libexec/src/net/net.go:183 +0xa9
net/http.(*persistConn).Read(0xc000dcd320, {0xc000dc3000, 0x1000, 0x1000})
	/usr/local/Cellar/go/1.17/libexec/src/net/http/transport.go:1926 +0x1dc
bufio.(*Reader).fill(0xc000da1260)
	/usr/local/Cellar/go/1.17/libexec/src/bufio/bufio.go:101 +0x294
bufio.(*Reader).Peek(0xc000da1260, 0x1)
	/usr/local/Cellar/go/1.17/libexec/src/bufio/bufio.go:139 +0x178
net/http.(*persistConn).readLoop(0xc000dcd320)
	/usr/local/Cellar/go/1.17/libexec/src/net/http/transport.go:2087 +0x285
created by net/http.(*Transport).dialConn
	/usr/local/Cellar/go/1.17/libexec/src/net/http/transport.go:1747 +0x27f6

goroutine 47 [select, 5 minutes]:
net/http.(*persistConn).writeLoop(0xc000dcd320)
	/usr/local/Cellar/go/1.17/libexec/src/net/http/transport.go:2386 +0x17f
created by net/http.(*Transport).dialConn
	/usr/local/Cellar/go/1.17/libexec/src/net/http/transport.go:1748 +0x2865

rax    0x104
rbx    0xbdd7e00
rcx    0x7ffeefbff088
rdx    0x4000
rdi    0x81631c0
rsi    0x400100004100
rbp    0x7ffeefbff130
rsp    0x7ffeefbff088
r8     0x0
r9     0xa0
r10    0x0
r11    0x246
r12    0x8163180
r13    0x16
r14    0x400100004100
r15    0x4000
rip    0x7fff204c1cde
rflags 0x247
cs     0x7
fs     0x0
gs     0x0

I have a branch here with bunch of log stmts, just run it with the integration test cmd above. https://github.com/ahmetb/skaffold-1/tree/debug-6424

halvards added a commit to halvards/skaffold that referenced this issue Sep 28, 2021
Make the Kubernetes pod watcher context aware, and select on
context.Done(). This means we can stop waiting, and acting on, pod
events when the context has been cancelled.

This should help avoid deadlocks that can occur when pod watcher event
receivers stop reading from the channel that they've registered with the
pod watcher.

I'm not sure it'll completely eliminate the problem though. What if a
receiver stops reading (because context is cancelled) while the pod
watcher is in the middle of sending pod events to all the receivers?

Fixes: GoogleContainerTools#6424
halvards added a commit to halvards/skaffold that referenced this issue Sep 28, 2021
Make the Kubernetes pod watcher context aware, and select on
context.Done(). This means we can stop waiting, and acting on, pod
events when the context has been cancelled.

This should help avoid deadlocks that can occur when pod watcher event
receivers stop reading from the channel that they've registered with the
pod watcher.

I'm not sure it'll completely eliminate the problem though. What if a
receiver stops reading (because context is cancelled) while the pod
watcher is in the middle of sending pod events to all the receivers?

Fixes: GoogleContainerTools#6424
halvards added a commit to halvards/skaffold that referenced this issue Sep 29, 2021
Make the Kubernetes pod watcher context aware, and select on
`context.Done()`. This means we can stop waiting, and acting on, pod
events when the context has been cancelled.

Remove waiting on `context.Done()` in the Kubernetes log aggregator,
container manager, and pod port forwarder. This is to eliminate the
chance that the pod watcher sends a `PodEvent` on a channel without a
waiting receiver.

This should help avoid deadlocks that can occur when pod watcher event
receivers stop reading from the channel that they've registered with the
pod watcher.

We still close the channels on the receiver side, which could increase
the chances of regression and re-occurrence of this issue.

Also uses a RWMutex in the pod watcher, though we could move this change
to a separate commit.

Fixes: GoogleContainerTools#6424
MarlonGamez pushed a commit that referenced this issue Sep 29, 2021
Make the Kubernetes pod watcher context aware, and select on
`context.Done()`. This means we can stop waiting, and acting on, pod
events when the context has been cancelled.

Remove waiting on `context.Done()` in the Kubernetes log aggregator,
container manager, and pod port forwarder. This is to eliminate the
chance that the pod watcher sends a `PodEvent` on a channel without a
waiting receiver.

This should help avoid deadlocks that can occur when pod watcher event
receivers stop reading from the channel that they've registered with the
pod watcher.

We still close the channels on the receiver side, which could increase
the chances of regression and re-occurrence of this issue.

Also uses a RWMutex in the pod watcher, though we could move this change
to a separate commit.

Fixes: #6424
halvards added a commit to halvards/skaffold that referenced this issue Oct 6, 2021
Adding log statements to better understand what's happening in kokoro.
Unfortunately I haven't been able to reproduce the problem locally (on
Linux) so far.

Related: GoogleContainerTools#6424, GoogleContainerTools#6643, GoogleContainerTools#6662
halvards added a commit to halvards/skaffold that referenced this issue Oct 7, 2021
Debugging the flaky `TestDevGracefulCancel/multi-config-microservices`
integration test showed that the kubectl port forwarder was stuck with
goroutines waiting on a channels(one per resource).

Search for `goroutine 235` and `goroutine 234` in this Kokoro log:
https://source.cloud.google.com/results/invocations/a9749ab5-8762-4319-a2be-f67c7440f7a2/targets/skaffold%2Fpresubmit/log

This change means that the forwarder also listens for context canceled.

**Related**: GoogleContainerTools#6424, GoogleContainerTools#6643, GoogleContainerTools#6662, GoogleContainerTools#6685
briandealwis pushed a commit that referenced this issue Oct 7, 2021
Debugging the flaky `TestDevGracefulCancel/multi-config-microservices`
integration test showed that the kubectl port forwarder was stuck with
goroutines waiting on a channels(one per resource).

Search for `goroutine 235` and `goroutine 234` in this Kokoro log:
https://source.cloud.google.com/results/invocations/a9749ab5-8762-4319-a2be-f67c7440f7a2/targets/skaffold%2Fpresubmit/log

This change means that the forwarder also listens for context canceled.

**Related**: #6424, #6643, #6662, #6685
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/todo implementation task/epic for the skaffold team meta/test-flake
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants