nozzle: crash on MaxRetriesReached #151

johnsonj · 2017-11-08T00:04:45Z

If we reach the max number of connection retries we need to do something
besides spin. Log them as fatal errors which will cause a crash.

Fixes #149

/cc @knyar @fluffle

This change is

If we reach the max number of connection retries we need to do something besides spin. Log them as fatal errors which will cause a crash. Fixes cloudfoundry-community#149

fluffle · 2017-11-09T14:14:45Z

If it fixes a bug then I guess this is fine, but why is there a connection retry limit? All you're doing when committing suicide with Fatal() is resetting that limit to zero. This seems like a band-aid to work around poor client behaviour.

The default retry limit is 1000:

https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L23

But it can be changed:

https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L47

Unfortunately we can't set it to e.g. 0 to have infinite retries:

https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L328

It looks like the retry counter is not being reset by a successful reconnection. A successful connection will also not reset the sleep time, which means that after 8 disconnects + reconnects -- for the other 992 reconnects -- the nozzle is waiting 1 minute between reconnects and effectively dropping firehose data during that time.

https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L239

This seems like a bug in the noaa consumer. I can see an argument for limiting the number of contiguous unsuccessful retries, but if the client library successfully reconnects, both the counter and retry delay should be reset.

Taking a step back, why is the nozzle getting disconnected from the firehose so regularly? ISTR seeing some form of websocket timeout errors in the nozzle's logs. There may be another bug here too...

Review status: 0 of 3 files reviewed at latest revision, all discussions resolved, some commit checks failed.

Comments from Reviewable

fluffle · 2017-11-09T14:35:28Z

but I filed cloudfoundry/noaa#37 because we shouldn't have to do this.

Review status: 0 of 3 files reviewed at latest revision, all discussions resolved, some commit checks failed.

Comments from Reviewable

fluffle · 2017-11-09T14:35:37Z

Reviewed 3 of 3 files at r1.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.

Comments from Reviewable

johnsonj · 2017-11-09T17:21:43Z

Thanks @fluffle ! Agreed on the noaa bug and need to investigate why this is happening.

fluffle · 2017-11-16T14:11:24Z

The NOAA folks correctly pointed out that I'd missed the counters being reset in the callback run on connect. So something else is wrong, which is frustrating because the symptoms match counters not being reset.

Having said that, with the latest nozzle builds we have on our PCF instance I can't see any websocket disconnections, so I wonder if the disconnects were an artifact of reducing the batch size to 1, and thus taking ~2-3 minutes to send all timeseries to Stackdriver.

fluffle · 2017-11-16T15:04:02Z

A test with the batch size reduced to 1 did not result in the nozzle getting behind, but this may be because it is no longer able to post all the metrics. We have a work-around for now, so I guess I'll wait and see if we have any suicides.

nozzle: crash on MaxRetriesReached

dc2824a

If we reach the max number of connection retries we need to do something besides spin. Log them as fatal errors which will cause a crash. Fixes cloudfoundry-community#149

fluffle mentioned this pull request Nov 9, 2017

Connection retry counter and backoff delay should be reset on successful connection. cloudfoundry/noaa#37

Closed

johnsonj merged commit 8cfcb61 into cloudfoundry-community:develop Nov 9, 2017

fluffle mentioned this pull request Feb 9, 2018

Stackdriver-nozzle repeat panic (every 30 seconds) #194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nozzle: crash on MaxRetriesReached #151

nozzle: crash on MaxRetriesReached #151

johnsonj commented Nov 8, 2017 •

edited

Loading

fluffle commented Nov 9, 2017

fluffle commented Nov 9, 2017

fluffle commented Nov 9, 2017

johnsonj commented Nov 9, 2017

fluffle commented Nov 16, 2017

fluffle commented Nov 16, 2017

nozzle: crash on MaxRetriesReached #151

nozzle: crash on MaxRetriesReached #151

Conversation

johnsonj commented Nov 8, 2017 • edited Loading

fluffle commented Nov 9, 2017

fluffle commented Nov 9, 2017

fluffle commented Nov 9, 2017

johnsonj commented Nov 9, 2017

fluffle commented Nov 16, 2017

fluffle commented Nov 16, 2017

johnsonj commented Nov 8, 2017 •

edited

Loading