Skip to content
This repository has been archived by the owner on Jun 25, 2020. It is now read-only.

nozzle: crash on MaxRetriesReached #151

Conversation

johnsonj
Copy link
Contributor

@johnsonj johnsonj commented Nov 8, 2017

If we reach the max number of connection retries we need to do something
besides spin. Log them as fatal errors which will cause a crash.

Fixes #149

/cc @knyar @fluffle


This change is Reviewable

If we reach the max number of connection retries we need to do something
besides spin. Log them as fatal errors which will cause a crash.

Fixes cloudfoundry-community#149
@fluffle
Copy link
Collaborator

fluffle commented Nov 9, 2017

If it fixes a bug then I guess this is fine, but why is there a connection retry limit? All you're doing when committing suicide with Fatal() is resetting that limit to zero. This seems like a band-aid to work around poor client behaviour.

The default retry limit is 1000:

https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L23

But it can be changed:

https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L47

Unfortunately we can't set it to e.g. 0 to have infinite retries:

https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L328

It looks like the retry counter is not being reset by a successful reconnection. A successful connection will also not reset the sleep time, which means that after 8 disconnects + reconnects -- for the other 992 reconnects -- the nozzle is waiting 1 minute between reconnects and effectively dropping firehose data during that time.

https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L239

This seems like a bug in the noaa consumer. I can see an argument for limiting the number of contiguous unsuccessful retries, but if the client library successfully reconnects, both the counter and retry delay should be reset.

Taking a step back, why is the nozzle getting disconnected from the firehose so regularly? ISTR seeing some form of websocket timeout errors in the nozzle's logs. There may be another bug here too...


Review status: 0 of 3 files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@fluffle
Copy link
Collaborator

fluffle commented Nov 9, 2017

:LGTM: but I filed cloudfoundry/noaa#37 because we shouldn't have to do this.


Review status: 0 of 3 files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@fluffle
Copy link
Collaborator

fluffle commented Nov 9, 2017

Reviewed 3 of 3 files at r1.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@johnsonj
Copy link
Contributor Author

johnsonj commented Nov 9, 2017

Thanks @fluffle ! Agreed on the noaa bug and need to investigate why this is happening.

@johnsonj johnsonj merged commit 8cfcb61 into cloudfoundry-community:develop Nov 9, 2017
@fluffle
Copy link
Collaborator

fluffle commented Nov 16, 2017

The NOAA folks correctly pointed out that I'd missed the counters being reset in the callback run on connect. So something else is wrong, which is frustrating because the symptoms match counters not being reset.

Having said that, with the latest nozzle builds we have on our PCF instance I can't see any websocket disconnections, so I wonder if the disconnects were an artifact of reducing the batch size to 1, and thus taking ~2-3 minutes to send all timeseries to Stackdriver.

@fluffle
Copy link
Collaborator

fluffle commented Nov 16, 2017

A test with the batch size reduced to 1 did not result in the nozzle getting behind, but this may be because it is no longer able to post all the metrics. We have a work-around for now, so I guess I'll wait and see if we have any suicides.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants