[Flake] TestDevGracefulCancel no output for 10 mins #6424

tejal29 · 2021-08-12T21:03:44Z

=== RUN   TestDevGracefulCancel/getting-started
time="2021-08-12T20:45:55Z" level=info msg="Ran [skaffold dev --namespace skaffoldbnrts --default-repo gcr.io/k8s-skaffold] in 32.464 seconds"
time="2021-08-12T20:45:55Z" level=info msg="Namespace: skaffoldvjhqk"
time="2021-08-12T20:45:55Z" level=info msg="Running [skaffold dev --namespace skaffoldvjhqk --default-repo gcr.io/k8s-skaffold] in examples/getting-started"
time="2021-08-12T20:45:55Z" level=info msg="Waiting for pods [getting-started] to be ready"
Listing files to watch...
 - skaffold-example
Generating tags...
 - skaffold-example -> gcr.io/k8s-skaffold/skaffold-example:v1.29.0-40-g50a585d
Checking cache...
 - skaffold-example: Found Locally
Starting test...
Tags used in deployment:
 - skaffold-example -> gcr.io/k8s-skaffold/skaffold-example:55673b440103e61c7db744acfb318e6027d024e45e3da530a384c2a5962c6f48
Starting deploy...
Loading images into kind cluster nodes...
 - gcr.io/k8s-skaffold/skaffold-example:55673b440103e61c7db744acfb318e6027d024e45e3da530a384c2a5962c6f48 -> Loaded
Images loaded in 920.236524ms
 - pod/getting-started created
time="2021-08-12T20:45:57Z" level=info msg="Still waiting for pods [getting-started]"
Waiting for deployments to stabilize...
time="2021-08-12T20:45:57Z" level=info msg="Still waiting for pods [getting-started]"
Deployments stabilized in 49.670341ms
Waiting for deployments to stabilize...
time="2021-08-12T20:45:57Z" level=info msg="Still waiting for pods [getting-started]"
Deployments stabilized in 16.485819ms
Press Ctrl+C to exit
Watching for changes...
time="2021-08-12T20:45:59Z" level=info msg="Pods marked as ready: map[getting-started:true]"
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received

The text was updated successfully, but these errors were encountered:

tejal29 · 2021-08-12T21:05:16Z

happened twice on same PR https://github.com/GoogleContainerTools/skaffold/pull/6422/checks?check_run_id=3315882185

aaron-prindle · 2021-08-31T21:59:51Z

Wsa this issue seen on any Github Actions runs as well? If not it might this flake may be gone as we have migrated off of Travis to Github Actions for CI/CD

ahmetb · 2021-09-22T19:24:30Z

This continues to reproduce on Kokoro as well.

I've just seen it on a random PR. Tests failed after a 50-minute run (roughly 10 minute actual test run (4 min integration test runs), then 40 minute waiting on input). Full logs are at https://source.cloud.google.com/results/invocations/3e0997d9-10ea-4eaf-ab6c-9603b73f482a/targets

ahmetb · 2021-09-23T16:35:13Z

Sometimes it makes it past TestDevGracefulCancel/getting-started and fails at TestDevGracefulCancel/multi-config-microservices. https://source.cloud.google.com/results/invocations/0ae17eca-0fda-4f28-8947-d30f12d0efa7/targets/skaffold%2Fpresubmit/log

ahmetb · 2021-09-23T17:10:09Z

This does repro locally although much lower chances than in the CI.

time env INTEGRATION_TEST_ARGS="-run=TestDevGracefulCancel/getting-started -count=100" make integration

My suspicion is that this is a timing issue (we probably send the signal before it's used for cancellation). I added a log line indicating the skaffold process has received the signal. Despite the signal is received after we print the Ctrl+C note, it still hangs sometimes:

time="2021-09-23T09:58:55-07:00" level=info msg="Still waiting for pods [getting-started]"
time="2021-09-23T09:58:55-07:00" level=info msg="Still waiting for pods [getting-started]"
Waiting for deployments to stabilize...
time="2021-09-23T09:58:55-07:00" level=info msg="Still waiting for pods [getting-started]"
Deployments stabilized in 12.129231ms
Press Ctrl+C to exit
Watching for changes...
time="2021-09-23T09:58:56-07:00" level=info msg="Pods marked as ready: map[getting-started:true]"
time="2021-09-23T09:58:56-07:00" level=warning msg="received signal: interrupt" subtask=-1 task=DevLoop
[..hangs..]

ahmetb · 2021-09-23T18:01:20Z

More debugging:
In dev.go the statement defer r.deployer.GetLogger().Stop() gets stuck.
It calls Stop() on LoggerMux,
which calls Stop() on Kubernetes LogAggregator
In that method specifically this stmt blocks:

a.podWatcher.Deregister(a.events)

Deregister is implemented here. I am suspecting it's the w.receiverLock.Lock() statement that actually locks.

Here's a stack trace dump that shows it's indeed stuck there.
There's another routine at podWatcher.Start() and it's also calling w.receiverLock.Lock().
Presumably its for-loop is stuck on sending to channel (receiver <- ...) so it never unlocks.

SIGABRT: abort
PC=0x7fff204c1cde m=0 sigcode=0

goroutine 0 [idle]:
runtime.pthread_cond_wait(0x81631c0, 0x8163180)
	/usr/local/Cellar/go/1.17/libexec/src/runtime/sys_darwin.go:393 +0x34
runtime.semasleep(0xffffffffffffffff)
	/usr/local/Cellar/go/1.17/libexec/src/runtime/os_darwin.go:66 +0x93
runtime.notesleep(0x8162f70)
	/usr/local/Cellar/go/1.17/libexec/src/runtime/lock_sema.go:182 +0x85
runtime.mPark()
	/usr/local/Cellar/go/1.17/libexec/src/runtime/proc.go:1441 +0x2a
runtime.stoplockedm()
	/usr/local/Cellar/go/1.17/libexec/src/runtime/proc.go:2602 +0x7b
runtime.schedule()
	/usr/local/Cellar/go/1.17/libexec/src/runtime/proc.go:3299 +0x3d
runtime.park_m(0xc000622b60)
	/usr/local/Cellar/go/1.17/libexec/src/runtime/proc.go:3516 +0x12d
runtime.mcall()
	/usr/local/Cellar/go/1.17/libexec/src/runtime/asm_amd64.s:307 +0x43

goroutine 1 [semacquire, 5 minutes]:
sync.runtime_SemacquireMutex(0xc000401a70, 0x38, 0x65559bd9)
	/usr/local/Cellar/go/1.17/libexec/src/runtime/sema.go:71 +0x25
sync.(*Mutex).lockSlow(0xc0002d8958)
	/usr/local/Cellar/go/1.17/libexec/src/sync/mutex.go:138 +0x271
sync.(*Mutex).Lock(0xc0002d8958)
	/usr/local/Cellar/go/1.17/libexec/src/sync/mutex.go:81 +0x51
github.com/GoogleContainerTools/skaffold/pkg/skaffold/kubernetes.(*podWatcher).Deregister(0xc0002d8940, 0xc00081e180)
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/kubernetes/watcher.go:65 +0x2e
github.com/GoogleContainerTools/skaffold/pkg/skaffold/kubernetes/logger.(*LogAggregator).Stop(0xc000e9a0b0)
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/kubernetes/logger/log.go:177 +0x32a
github.com/GoogleContainerTools/skaffold/pkg/skaffold/log.LoggerMux.Stop({0xc000d61140, 0x1, 0x1})
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/log/logger_mux.go:43 +0x274
github.com/GoogleContainerTools/skaffold/pkg/skaffold/runner/v1.(*SkaffoldRunner).Dev.func9()
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/runner/v1/dev.go:328 +0x204
github.com/GoogleContainerTools/skaffold/pkg/skaffold/runner/v1.(*SkaffoldRunner).Dev(0xc000bb8160, {0x6f18150, 0xc0002d4480}, {0x6ed6c60, 0xc000bd8a50}, {0xc0003188c8, 0x1, 0x1})
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/runner/v1/dev.go:371 +0x277a
github.com/GoogleContainerTools/skaffold/cmd/skaffold/app/cmd.runDev.func5({0x6f5bbc0, 0xc000bb8160}, {0xc000dce040, 0x1, 0x1})
	/Users/ahmetb/workspace/skaffold/cmd/skaffold/app/cmd/dev.go:73 +0x1bc
github.com/GoogleContainerTools/skaffold/cmd/skaffold/app/cmd.withRunner({0x6f18150, 0xc0002d4480}, {0x6ed6c60, 0xc000bd8a50}, 0xc000c0f548)
	/Users/ahmetb/workspace/skaffold/cmd/skaffold/app/cmd/runner.go:55 +0x20d
github.com/GoogleContainerTools/skaffold/cmd/skaffold/app/cmd.runDev({0x6f18150, 0xc0002d4480}, {0x6ed6c60, 0xc000bd8a50})
	/Users/ahmetb/workspace/skaffold/cmd/skaffold/app/cmd/dev.go:67 +0x385
github.com/GoogleContainerTools/skaffold/cmd/skaffold/app/cmd.(*builder).NoArgs.func1(0x0, {0x4, 0xc000c0f680, 0xc00035e0a0})
	/Users/ahmetb/workspace/skaffold/cmd/skaffold/app/cmd/commands.go:135 +0x9a
github.com/GoogleContainerTools/skaffold/cmd/skaffold/app/cmd.applyPostRunHooks.func1(0xc0007262c0, {0xc000bc3380, 0x0, 0x4})
	/Users/ahmetb/workspace/skaffold/cmd/skaffold/app/cmd/commands.go:164 +0x6e
github.com/spf13/cobra.(*Command).execute(0xc0007262c0, {0xc000bc3340, 0x4, 0x4})
	/Users/ahmetb/go/pkg/mod/github.com/spf13/[email protected]/command.go:856 +0xaf0
github.com/spf13/cobra.(*Command).ExecuteC(0xc00071c000)
	/Users/ahmetb/go/pkg/mod/github.com/spf13/[email protected]/command.go:974 +0x8b3
github.com/spf13/cobra.(*Command).Execute(0xc00071c000)
	/Users/ahmetb/go/pkg/mod/github.com/spf13/[email protected]/command.go:902 +0x2f
github.com/spf13/cobra.(*Command).ExecuteContext(0xc00071c000, {0x6f18150, 0xc0002d4480})
	/Users/ahmetb/go/pkg/mod/github.com/spf13/[email protected]/command.go:895 +0x73
github.com/GoogleContainerTools/skaffold/cmd/skaffold/app.Run({0x6ed5e80, 0xc00012c008}, {0x6ed5e80, 0xc00012c010})
	/Users/ahmetb/workspace/skaffold/cmd/skaffold/app/skaffold.go:47 +0x492
main.main()
	/Users/ahmetb/workspace/skaffold/cmd/skaffold/skaffold.go:50 +0x2bd

goroutine 21 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc000160a00)
	/Users/ahmetb/go/pkg/mod/[email protected]/stats/view/worker.go:276 +0x135
created by go.opencensus.io/stats/view.init.0
	/Users/ahmetb/go/pkg/mod/[email protected]/stats/view/worker.go:34 +0xb6

goroutine 22 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x8161ce0)
	/Users/ahmetb/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:1164 +0x5b
created by k8s.io/klog/v2.init.0
	/Users/ahmetb/go/pkg/mod/k8s.io/klog/[email protected]/klog.go:418 +0x128

goroutine 7 [chan receive]:
github.com/golang/glog.(*loggingT).flushDaemon(0x8161b40)
	/Users/ahmetb/go/pkg/mod/github.com/golang/[email protected]/glog.go:882 +0x5b
created by github.com/golang/glog.init.0
	/Users/ahmetb/go/pkg/mod/github.com/golang/[email protected]/glog.go:410 +0x1a8

goroutine 9 [chan receive, 5 minutes]:
github.com/GoogleContainerTools/skaffold/pkg/skaffold/event/v2.newHandler.func1()
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/event/v2/event.go:59 +0x4b
created by github.com/GoogleContainerTools/skaffold/pkg/skaffold/event/v2.newHandler
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/event/v2/event.go:57 +0x1ca

goroutine 10 [chan receive, 5 minutes]:
github.com/GoogleContainerTools/skaffold/pkg/skaffold/event.newHandler.func1()
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/event/event.go:60 +0x45
created by github.com/GoogleContainerTools/skaffold/pkg/skaffold/event.newHandler
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/event/event.go:58 +0x18a

goroutine 11 [chan receive, 5 minutes]:
github.com/rjeczalik/notify.(*recursiveTree).dispatch(0xc000317400)
	/Users/ahmetb/go/pkg/mod/github.com/rjeczalik/[email protected]/tree_recursive.go:125 +0x51
created by github.com/rjeczalik/notify.newRecursiveTree
	/Users/ahmetb/go/pkg/mod/github.com/rjeczalik/[email protected]/tree_recursive.go:119 +0x25b

goroutine 12 [syscall, 5 minutes, locked to thread]:
github.com/rjeczalik/notify._Cfunc_CFRunLoopRun()
	_cgo_gotypes.go:229 +0x45
github.com/rjeczalik/notify.init.1.func1()
	/Users/ahmetb/go/pkg/mod/github.com/rjeczalik/[email protected]/watcher_fsevents_cgo.go:72 +0x45
created by github.com/rjeczalik/notify.init.1
	/Users/ahmetb/go/pkg/mod/github.com/rjeczalik/[email protected]/watcher_fsevents_cgo.go:64 +0x33

goroutine 37 [syscall, 5 minutes]:
os/signal.signal_recv()
	/usr/local/Cellar/go/1.17/libexec/src/runtime/sigqueue.go:166 +0x28
os/signal.loop()
	/usr/local/Cellar/go/1.17/libexec/src/os/signal/signal_unix.go:24 +0x1d
created by os/signal.Notify.func1.1
	/usr/local/Cellar/go/1.17/libexec/src/os/signal/signal.go:151 +0x30

goroutine 72 [chan send, 5 minutes]:
github.com/GoogleContainerTools/skaffold/pkg/skaffold/kubernetes.(*podWatcher).Start.func5()
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/kubernetes/watcher.go:151 +0xa25
created by github.com/GoogleContainerTools/skaffold/pkg/skaffold/kubernetes.(*podWatcher).Start
	/Users/ahmetb/workspace/skaffold/pkg/skaffold/kubernetes/watcher.go:99 +0x8de

goroutine 46 [IO wait, 5 minutes]:
internal/poll.runtime_pollWait(0x6ed89a0, 0x72)
	/usr/local/Cellar/go/1.17/libexec/src/runtime/netpoll.go:229 +0x45
internal/poll.(*pollDesc).wait(0xc000d65218, 0x72, 0x0)
	/usr/local/Cellar/go/1.17/libexec/src/internal/poll/fd_poll_runtime.go:84 +0x88
internal/poll.(*pollDesc).waitRead(0xc000d65218, 0x0)
	/usr/local/Cellar/go/1.17/libexec/src/internal/poll/fd_poll_runtime.go:89 +0x37
internal/poll.(*FD).Read(0xc000d65200, {0xc000dc3000, 0x1000, 0x1000})
	/usr/local/Cellar/go/1.17/libexec/src/internal/poll/fd_unix.go:167 +0x453
net.(*netFD).Read(0xc000d65200, {0xc000dc3000, 0x1000, 0x1000})
	/usr/local/Cellar/go/1.17/libexec/src/net/fd_posix.go:56 +0x73
net.(*conn).Read(0xc000011618, {0xc000dc3000, 0x1000, 0x1000})
	/usr/local/Cellar/go/1.17/libexec/src/net/net.go:183 +0xa9
net/http.(*persistConn).Read(0xc000dcd320, {0xc000dc3000, 0x1000, 0x1000})
	/usr/local/Cellar/go/1.17/libexec/src/net/http/transport.go:1926 +0x1dc
bufio.(*Reader).fill(0xc000da1260)
	/usr/local/Cellar/go/1.17/libexec/src/bufio/bufio.go:101 +0x294
bufio.(*Reader).Peek(0xc000da1260, 0x1)
	/usr/local/Cellar/go/1.17/libexec/src/bufio/bufio.go:139 +0x178
net/http.(*persistConn).readLoop(0xc000dcd320)
	/usr/local/Cellar/go/1.17/libexec/src/net/http/transport.go:2087 +0x285
created by net/http.(*Transport).dialConn
	/usr/local/Cellar/go/1.17/libexec/src/net/http/transport.go:1747 +0x27f6

goroutine 47 [select, 5 minutes]:
net/http.(*persistConn).writeLoop(0xc000dcd320)
	/usr/local/Cellar/go/1.17/libexec/src/net/http/transport.go:2386 +0x17f
created by net/http.(*Transport).dialConn
	/usr/local/Cellar/go/1.17/libexec/src/net/http/transport.go:1748 +0x2865

rax    0x104
rbx    0xbdd7e00
rcx    0x7ffeefbff088
rdx    0x4000
rdi    0x81631c0
rsi    0x400100004100
rbp    0x7ffeefbff130
rsp    0x7ffeefbff088
r8     0x0
r9     0xa0
r10    0x0
r11    0x246
r12    0x8163180
r13    0x16
r14    0x400100004100
r15    0x4000
rip    0x7fff204c1cde
rflags 0x247
cs     0x7
fs     0x0
gs     0x0

I have a branch here with bunch of log stmts, just run it with the integration test cmd above. https://github.com/ahmetb/skaffold-1/tree/debug-6424

Make the Kubernetes pod watcher context aware, and select on context.Done(). This means we can stop waiting, and acting on, pod events when the context has been cancelled. This should help avoid deadlocks that can occur when pod watcher event receivers stop reading from the channel that they've registered with the pod watcher. I'm not sure it'll completely eliminate the problem though. What if a receiver stops reading (because context is cancelled) while the pod watcher is in the middle of sending pod events to all the receivers? Fixes: GoogleContainerTools#6424

Make the Kubernetes pod watcher context aware, and select on `context.Done()`. This means we can stop waiting, and acting on, pod events when the context has been cancelled. Remove waiting on `context.Done()` in the Kubernetes log aggregator, container manager, and pod port forwarder. This is to eliminate the chance that the pod watcher sends a `PodEvent` on a channel without a waiting receiver. This should help avoid deadlocks that can occur when pod watcher event receivers stop reading from the channel that they've registered with the pod watcher. We still close the channels on the receiver side, which could increase the chances of regression and re-occurrence of this issue. Also uses a RWMutex in the pod watcher, though we could move this change to a separate commit. Fixes: GoogleContainerTools#6424

Make the Kubernetes pod watcher context aware, and select on `context.Done()`. This means we can stop waiting, and acting on, pod events when the context has been cancelled. Remove waiting on `context.Done()` in the Kubernetes log aggregator, container manager, and pod port forwarder. This is to eliminate the chance that the pod watcher sends a `PodEvent` on a channel without a waiting receiver. This should help avoid deadlocks that can occur when pod watcher event receivers stop reading from the channel that they've registered with the pod watcher. We still close the channels on the receiver side, which could increase the chances of regression and re-occurrence of this issue. Also uses a RWMutex in the pod watcher, though we could move this change to a separate commit. Fixes: #6424

Adding log statements to better understand what's happening in kokoro. Unfortunately I haven't been able to reproduce the problem locally (on Linux) so far. Related: GoogleContainerTools#6424, GoogleContainerTools#6643, GoogleContainerTools#6662

Debugging the flaky `TestDevGracefulCancel/multi-config-microservices` integration test showed that the kubectl port forwarder was stuck with goroutines waiting on a channels(one per resource). Search for `goroutine 235` and `goroutine 234` in this Kokoro log: https://source.cloud.google.com/results/invocations/a9749ab5-8762-4319-a2be-f67c7440f7a2/targets/skaffold%2Fpresubmit/log This change means that the forwarder also listens for context canceled. **Related**: GoogleContainerTools#6424, GoogleContainerTools#6643, GoogleContainerTools#6662, GoogleContainerTools#6685

Debugging the flaky `TestDevGracefulCancel/multi-config-microservices` integration test showed that the kubectl port forwarder was stuck with goroutines waiting on a channels(one per resource). Search for `goroutine 235` and `goroutine 234` in this Kokoro log: https://source.cloud.google.com/results/invocations/a9749ab5-8762-4319-a2be-f67c7440f7a2/targets/skaffold%2Fpresubmit/log This change means that the forwarder also listens for context canceled. **Related**: #6424, #6643, #6662, #6685

tejal29 added meta/test-flake kind/todo implementation task/epic for the skaffold team labels Aug 12, 2021

tejal29 changed the title ~~[Flake] TestDevGracefulCancel no output for 20 mins~~ [Flake] TestDevGracefulCancel no output for 10 mins Aug 12, 2021

aaron-prindle mentioned this issue Aug 19, 2021

add skaffold inspect tests list command #6446

Merged

ahmetb mentioned this issue Sep 22, 2021

introduce schema v2beta24 #6628

Merged

halvards mentioned this issue Sep 28, 2021

fix: Wait for context cancel in k8s pod watcher #6643

Merged

MarlonGamez closed this as completed in #6643 Sep 29, 2021

halvards mentioned this issue Oct 6, 2021

build: 30 min timeout for integration tests #6684

Merged

halvards mentioned this issue Oct 6, 2021

build: debug flaky TestDevGracefulCancel/multi-config-microservices integration test #6685

Closed

halvards mentioned this issue Oct 7, 2021

fix: Kubectl port fwd returns on context canceled #6700

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flake] TestDevGracefulCancel no output for 10 mins #6424

[Flake] TestDevGracefulCancel no output for 10 mins #6424

tejal29 commented Aug 12, 2021 •

edited by briandealwis

Loading

tejal29 commented Aug 12, 2021 •

edited

Loading

aaron-prindle commented Aug 31, 2021

ahmetb commented Sep 22, 2021

ahmetb commented Sep 23, 2021

ahmetb commented Sep 23, 2021 •

edited

Loading

ahmetb commented Sep 23, 2021 •

edited

Loading

[Flake] TestDevGracefulCancel no output for 10 mins #6424

[Flake] TestDevGracefulCancel no output for 10 mins #6424

Comments

tejal29 commented Aug 12, 2021 • edited by briandealwis Loading

tejal29 commented Aug 12, 2021 • edited Loading

aaron-prindle commented Aug 31, 2021

ahmetb commented Sep 22, 2021

ahmetb commented Sep 23, 2021

ahmetb commented Sep 23, 2021 • edited Loading

ahmetb commented Sep 23, 2021 • edited Loading

tejal29 commented Aug 12, 2021 •

edited by briandealwis

Loading

tejal29 commented Aug 12, 2021 •

edited

Loading

ahmetb commented Sep 23, 2021 •

edited

Loading

ahmetb commented Sep 23, 2021 •

edited

Loading