-
Notifications
You must be signed in to change notification settings - Fork 12
nozzle: crash on MaxRetriesReached #151
nozzle: crash on MaxRetriesReached #151
Conversation
If we reach the max number of connection retries we need to do something besides spin. Log them as fatal errors which will cause a crash. Fixes cloudfoundry-community#149
If it fixes a bug then I guess this is fine, but why is there a connection retry limit? All you're doing when committing suicide with Fatal() is resetting that limit to zero. This seems like a band-aid to work around poor client behaviour. The default retry limit is 1000: https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L23 But it can be changed: https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L47 Unfortunately we can't set it to e.g. 0 to have infinite retries: https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L328 It looks like the retry counter is not being reset by a successful reconnection. A successful connection will also not reset the sleep time, which means that after 8 disconnects + reconnects -- for the other 992 reconnects -- the nozzle is waiting 1 minute between reconnects and effectively dropping firehose data during that time. https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L239 This seems like a bug in the noaa consumer. I can see an argument for limiting the number of contiguous unsuccessful retries, but if the client library successfully reconnects, both the counter and retry delay should be reset. Taking a step back, why is the nozzle getting disconnected from the firehose so regularly? ISTR seeing some form of websocket timeout errors in the nozzle's logs. There may be another bug here too... Review status: 0 of 3 files reviewed at latest revision, all discussions resolved, some commit checks failed. Comments from Reviewable |
Review status: 0 of 3 files reviewed at latest revision, all discussions resolved, some commit checks failed. Comments from Reviewable |
Reviewed 3 of 3 files at r1. Comments from Reviewable |
Thanks @fluffle ! Agreed on the noaa bug and need to investigate why this is happening. |
The NOAA folks correctly pointed out that I'd missed the counters being reset in the callback run on connect. So something else is wrong, which is frustrating because the symptoms match counters not being reset. Having said that, with the latest nozzle builds we have on our PCF instance I can't see any websocket disconnections, so I wonder if the disconnects were an artifact of reducing the batch size to 1, and thus taking ~2-3 minutes to send all timeseries to Stackdriver. |
A test with the batch size reduced to 1 did not result in the nozzle getting behind, but this may be because it is no longer able to post all the metrics. We have a work-around for now, so I guess I'll wait and see if we have any suicides. |
If we reach the max number of connection retries we need to do something
besides spin. Log them as fatal errors which will cause a crash.
Fixes #149
/cc @knyar @fluffle
This change is![Reviewable](https://camo.githubusercontent.com/1541c4039185914e83657d3683ec25920c672c6c5c7ab4240ee7bff601adec0b/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)