-
Notifications
You must be signed in to change notification settings - Fork 13
Fatal error from Firehose causes nozzle to spin #112
Comments
We used to crash via null pointer de-ref when we failed to keep up with the firehose. When this happens we get a fatal error from firehose and a ton of null messages sent to 'ReceiveMessage'. This code was added to catch the null messages and toss up an error. This error would go through the normal errors channel and be logged. Now that we aren't crashing we are relying on the fatal error from loggregator (via this channel) to properly shut down the nozzle, which it sometimes does, sometimes doesn't. When it doesn't we can attach to the spinning nozzle and observe a few things:
Which means we are telling the nozzle to stop multiple times after this goroutine has shut down. This state causes the nozzle to appear alive to BOSH but prevents any logs/metrics from being sent off to stackdriver. |
The done channel was being left open after the consumer drained the done meesage. Calling the Stop() method 2 times in this state would cause the program to hang while trying to send done <- struct{}{}. Change the nozzle to keep track of running state and to return an error when it's stopped incorrectly. See cloudfoundry-community#112 for more detail
The firehose is hitting a fatal error and seems to try to shut down the nozzle but the nozzle process does not exit, it just spins idle.
Example error:
The app either needs to retry the connection or exit. The issue is related to #107 which has the symptom of no metrics/logs being reported.
The text was updated successfully, but these errors were encountered: