Fatal error from Firehose causes nozzle to spin #112

johnsonj · 2017-09-11T07:32:10Z

The firehose is hitting a fatal error and seems to try to shut down the nozzle but the nozzle process does not exit, it just spins idle.

Example error:

{"timestamp":"1505114919.113666058","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":3,"data":{"cleanup":"The metrics buffer was successfully flushed before shutdown","error":"websocket: close 1008 (policy violation): Client did not respond to ping before keep-alive timeout expired.","trace":"goroutine 1 [running]:\ngithub.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/vendor/github.com/cloudfoundry/lager.(*logger).Fatal(0xc420152240, 0xa37f5f, 0x8, 0xd94ac0, 0xc4203618a0, 0xc420145370, 0x1, 0x1)\n\t/home/jrjohnson/dev/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/vendor/github.com/cloudfoundry/lager/logger.go:132 +0xca\nmain.main()\n\t/home/jrjohnson/dev/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/main.go:61 +0x620\n"}}

The app either needs to retry the connection or exit. The issue is related to #107 which has the symptom of no metrics/logs being reported.

The text was updated successfully, but these errors were encountered:

johnsonj · 2017-09-13T03:58:25Z

We used to crash via null pointer de-ref when we failed to keep up with the firehose. When this happens we get a fatal error from firehose and a ton of null messages sent to 'ReceiveMessage'. This code was added to catch the null messages and toss up an error. This error would go through the normal errors channel and be logged.

Now that we aren't crashing we are relying on the fatal error from loggregator (via this channel) to properly shut down the nozzle, which it sometimes does, sometimes doesn't.

When it doesn't we can attach to the spinning nozzle and observe a few things:

The heartbeater is still running. This is visibile in the console and a quick look into the goroutines:
```
(gdb) info goroutine
1 waiting  runtime.gopark
...
* 21 syscall  runtime.notetsleepg
...
```
The main thread is blocking waiting to send the done{}{} message:

(gdb) goroutine 1 bt
#0  runtime.gopark (unlockf={void (struct runtime.g *, void *, bool *)} 0xc4204dfa98, lock=0xc4203fb7f8, reason="chan send", 
    traceEv=22 '\026', traceskip=3) at /home/jrjohnson/.gvm/gos/go1.9/src/runtime/proc.go:278
#1  0x000000000042f00e in runtime.goparkunlock (lock=0xc4203fb7f8, reason="chan send", traceEv=22 '\026', traceskip=3)
    at /home/jrjohnson/.gvm/gos/go1.9/src/runtime/proc.go:283
#2  0x00000000004056ab in runtime.chansend (c=0xc4203fb7a0, ep=0xc4204dfbc0, block=true, callerpc=9289424, ~r4=224)
    at /home/jrjohnson/.gvm/gos/go1.9/src/runtime/chan.go:222
#3  0x0000000000405433 in runtime.chansend1 (c=0xc4203fb7a0, elem=0xc4204dfbc0) at /home/jrjohnson/.gvm/gos/go1.9/src/runtime/chan.go:113
>>> #4  0x00000000008dbed0 in github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/nozzle.(*Nozzle).Stop (n=0xc4200964c0)
    at /home/jrjohnson/dev/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/nozzle/nozzle.go:74
#5  0x000000000045affb in runtime.call32 () at /home/jrjohnson/.gvm/gos/go1.9/src/runtime/asm_amd64.s:509
#6  0x000000000042cc73 in runtime.gopanic (e=...) at /home/jrjohnson/.gvm/gos/go1.9/src/runtime/panic.go:491
#7  0x00000000008becf3 in github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/vendor/github.com/cloudfoundry/lager.(*logger).Fatal (l=0xc420160240, action="firehose", err=..., 
    data= []github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/vendor/github.com/cloudfoundry/lager.Data = {...})
    at /home/jrjohnson/dev/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/vendor/github.com/cloudfoundry/lager/logger.go:154
#8  0x00000000008dd5c0 in main.main ()
    at /home/jrjohnson/dev/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/main.go:61

Which means we are telling the nozzle to stop multiple times after this goroutine has shut down.

This state causes the nozzle to appear alive to BOSH but prevents any logs/metrics from being sent off to stackdriver.

The done channel was being left open after the consumer drained the done meesage. Calling the Stop() method 2 times in this state would cause the program to hang while trying to send done <- struct{}{}. Change the nozzle to keep track of running state and to return an error when it's stopped incorrectly. See cloudfoundry-community#112 for more detail

johnsonj mentioned this issue Sep 12, 2017

http2Client.notifyError got notified that the client transport was broken EOF #107

Closed

johnsonj mentioned this issue Sep 13, 2017

Prevent application hang on some firehose errors #115

Merged

johnsonj closed this as completed Sep 13, 2017

johnsonj mentioned this issue Feb 20, 2018

Promote Develop #203

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fatal error from Firehose causes nozzle to spin #112

Fatal error from Firehose causes nozzle to spin #112

johnsonj commented Sep 11, 2017

johnsonj commented Sep 13, 2017

Fatal error from Firehose causes nozzle to spin #112

Fatal error from Firehose causes nozzle to spin #112

Comments

johnsonj commented Sep 11, 2017

johnsonj commented Sep 13, 2017