Skip to content
This repository has been archived by the owner on Jun 25, 2020. It is now read-only.

Fatal error from Firehose causes nozzle to spin #112

Closed
johnsonj opened this issue Sep 11, 2017 · 1 comment
Closed

Fatal error from Firehose causes nozzle to spin #112

johnsonj opened this issue Sep 11, 2017 · 1 comment

Comments

@johnsonj
Copy link
Contributor

The firehose is hitting a fatal error and seems to try to shut down the nozzle but the nozzle process does not exit, it just spins idle.

Example error:

{"timestamp":"1505114919.113666058","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":3,"data":{"cleanup":"The metrics buffer was successfully flushed before shutdown","error":"websocket: close 1008 (policy violation): Client did not respond to ping before keep-alive timeout expired.","trace":"goroutine 1 [running]:\ngithub.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/vendor/github.com/cloudfoundry/lager.(*logger).Fatal(0xc420152240, 0xa37f5f, 0x8, 0xd94ac0, 0xc4203618a0, 0xc420145370, 0x1, 0x1)\n\t/home/jrjohnson/dev/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/vendor/github.com/cloudfoundry/lager/logger.go:132 +0xca\nmain.main()\n\t/home/jrjohnson/dev/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/main.go:61 +0x620\n"}}

The app either needs to retry the connection or exit. The issue is related to #107 which has the symptom of no metrics/logs being reported.

@johnsonj
Copy link
Contributor Author

We used to crash via null pointer de-ref when we failed to keep up with the firehose. When this happens we get a fatal error from firehose and a ton of null messages sent to 'ReceiveMessage'. This code was added to catch the null messages and toss up an error. This error would go through the normal errors channel and be logged.

Now that we aren't crashing we are relying on the fatal error from loggregator (via this channel) to properly shut down the nozzle, which it sometimes does, sometimes doesn't.

When it doesn't we can attach to the spinning nozzle and observe a few things:

  1. The heartbeater is still running. This is visibile in the console and a quick look into the goroutines:
    (gdb) info goroutine
    1 waiting  runtime.gopark
    ...
    * 21 syscall  runtime.notetsleepg
    ...
    
  2. The main thread is blocking waiting to send the done{}{} message:
(gdb) goroutine 1 bt
#0  runtime.gopark (unlockf={void (struct runtime.g *, void *, bool *)} 0xc4204dfa98, lock=0xc4203fb7f8, reason="chan send", 
    traceEv=22 '\026', traceskip=3) at /home/jrjohnson/.gvm/gos/go1.9/src/runtime/proc.go:278
#1  0x000000000042f00e in runtime.goparkunlock (lock=0xc4203fb7f8, reason="chan send", traceEv=22 '\026', traceskip=3)
    at /home/jrjohnson/.gvm/gos/go1.9/src/runtime/proc.go:283
#2  0x00000000004056ab in runtime.chansend (c=0xc4203fb7a0, ep=0xc4204dfbc0, block=true, callerpc=9289424, ~r4=224)
    at /home/jrjohnson/.gvm/gos/go1.9/src/runtime/chan.go:222
#3  0x0000000000405433 in runtime.chansend1 (c=0xc4203fb7a0, elem=0xc4204dfbc0) at /home/jrjohnson/.gvm/gos/go1.9/src/runtime/chan.go:113
>>> #4  0x00000000008dbed0 in github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/nozzle.(*Nozzle).Stop (n=0xc4200964c0)
    at /home/jrjohnson/dev/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/nozzle/nozzle.go:74
#5  0x000000000045affb in runtime.call32 () at /home/jrjohnson/.gvm/gos/go1.9/src/runtime/asm_amd64.s:509
#6  0x000000000042cc73 in runtime.gopanic (e=...) at /home/jrjohnson/.gvm/gos/go1.9/src/runtime/panic.go:491
#7  0x00000000008becf3 in github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/vendor/github.com/cloudfoundry/lager.(*logger).Fatal (l=0xc420160240, action="firehose", err=..., 
    data= []github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/vendor/github.com/cloudfoundry/lager.Data = {...})
    at /home/jrjohnson/dev/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/vendor/github.com/cloudfoundry/lager/logger.go:154
#8  0x00000000008dd5c0 in main.main ()
    at /home/jrjohnson/dev/src/github.com/cloudfoundry-community/stackdriver-tools/src/stackdriver-nozzle/main.go:61

Which means we are telling the nozzle to stop multiple times after this goroutine has shut down.

This state causes the nozzle to appear alive to BOSH but prevents any logs/metrics from being sent off to stackdriver.

johnsonj added a commit to johnsonj/gcp-tools-release that referenced this issue Sep 13, 2017
The done channel was being left open after the consumer drained the done
meesage. Calling the Stop() method 2 times in this state would cause the
program to hang while trying to send done <- struct{}{}.

Change the nozzle to keep track of running state and to return an error
when it's stopped incorrectly.

See cloudfoundry-community#112 for more detail
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant