-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle expiration and auth errors on reconnects. #499
Conversation
In general we want to retry a connection that got an auth error whether its a violation or expiration. However we want to close the connection if we get the same error twice on same server. Also replaced code now present in jwt lib. Signed-off-by: Derek Collison <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - may want to have a test that shows that connection is closed if got same error twice
Will add that test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should set max reconnects to higher value or infinite for this test
Signed-off-by: Derek Collison <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// We should give up if we tried twice on this server and got the | ||
// same error. | ||
if nc.current.lastErr == err { | ||
defer nc.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @kozlovic should we be shutting down here or just removing this one from the list? If there are more servers in the list we probably should not give up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am connected to a cluster, and one server is bad but others are ok, we should be able to reconnect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant yes, why not do that. But I can't really recall why we did this in the first place. What problem exactly were we trying to solve? Is there a situation where a server will return auth error always and not other servers in the cluster? (without being a misconfig). If auth is updated in a server and a client connecting to the server while auth is returning error, we possibly can remove it from the list of urls while the server is then being updated and would have accepted that connection later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I revoke your credentials in some form or fashion I wanted the default behavior to not keep trying forever.
The expectation that a NATS client automatically reconnects through almost all scenarios is quite deeply entrenched for example, quoting the docs (emphasis mine):
As a user, this reconnect behaviour has become just a normal expectation to have, even across languages, that's just how NATS clients work ime. We're looking to add features that further entrench this by having fire-and-forget async error producing Connect() for example. So I appreciate the need for dealing with revoked creds - and I think we already have a well understood method for doing that with sane defaults. The library will only attempt 60 reconnects by default, with jitter (soon) backoffs and all that good stuff. Would treating auth errors as connection errors in the sense of the reconnect loop not serve your needs? By default after 60 connect failures - auth error or otherwise - the library will raise an async error. Those who do not wish to have this behaviour already understand and probably use If we wish to specifically support changing creds files we can do so with a stat check between retries. The alternative is that a user who wants this forever-reconnect behaviour will have to basically reimplement a lot of the features of the package - tracking all subscribes that were made, recreating them, keeping hold of handles to queues and more. Effectively rewriting the entire flow logic of the package. We also already have a LOT of options, adding a OptOut around this is probably not that good |
The other thing, is we need to start a doc on client changes - we should be describing what the changes are, and why. This will help keep a task list of what needs doing with perhaps a reference to the PR. I agree with RI that the current contract with the client is that we won't close. |
I don't claim to understand all the scenarios where NATS is used, but I'm testing using the client on low end IoT edge devices where they are deployed in remote locations and our only connection to them is NATS. Also, the application on the edge devices does a lot of different things like reading sensors, running rules, controlling stuff, etc -- thus it needs to keep running, it can't keep restarting if there is a NATS connection error. So, I'm currently configuring the client with the following options:
So the above covers most cases (like cellular communication errors, etc). The exponential backoff keeps the edge devices from flooding the server if something does go wrong. But there is one case that is not handled where the NATS server may be started with wrong or missing auth information, and the current behavior is the nats connection on the edge devices is closed and quits retrying. When the server deployment error is discovered and fixed, the clients are stuck. I could exit the app and restart on NATS connection closed, but that is not desirable as then buffered publish messages are lost, and the normal operation of the edge system is affected (rules, etc). This is perhaps a little different type of distributed system than micro-services where a microservice is dispensable -- if one exits, you have a system that simply starts more. You also have more control over a microservice as you have control over whatever is hosting it. For an IoT edge device, the only connection you have to it is NATS, and you can't just throw the edge application away if there are problems -- it is performing essential tasks. My initial thought was to be able to use the nc object in various parts of the edge app, so having a nc object that persisted for the life of the app seems useful, however, it may be better to abstract all the NATS logic (much like you would a database interface) so the lifecycle of the connection can be better maintained, but if I have to re-create the nc object, then I loose the reconnect buffer, containing published messages that have not been sent yet, and subscriptions would have to be restarted, etc. (I'm new to NATS, so just thinking out loud -- this may or may not be relevant). |
The balance I want to achieve is between doing the right thing when it is a config error that is quickly fixed and not promoting DOS scenarios where accounts can be revoked with a large number of connections and they just keep trying to connect. |
I understand why you need this @derekcollison, but what is a user supposed to do here? If it's a piece of software that should work forever - there are many scenarios where this is desirable - we give them no choice AND we change the default behaviour that makes this impossible to achieve AND we make it impossible to discover that this behaviour exist as it's so orthogonal to everything else. Worse if they want reconnect style behaviour they have to implement basically many features already in the library. Users who want reconnects - assuming they figured out this is going to happen - will either handle the situation and reconnect, or exit, restart and reconnect. They will anyway come back. Worse they will come back at the bottom of their expo backoff counts meaning many rapid connection. So rather than helping this is going to make it worse. Better treat it as a reconnect based on expressed desired reconnect behaviour and move to slowest backoff possible Users who dont want this use our already safe defaults of capped reconnects. The default behaviour is safe for the server. For me the worst class of user is those who are unaware this massive behaviour change happened in a minor release and who will now have their fleet just die unexpectedly. |
If we add some code to detect a given server was restarted or config reloaded and the client will attempt again on that server than user will get correct behavior without doing anything. |
Is there one size that fits all here, or should it be configurable in the client? A typical microservice behavior might be to just exit and go away -- new ones will be spawned as necessary, but remote IoT devices are bricked and the app needs to be restarted, loosing functionality during this process. For now I'm doing the following -- not a huge deal for the first application I'm working on as it can tolerate a restart and 15m of lost connectivity is not a huge deal for the rare instances we misconfigure the server: https://github.com/simpleiot/simpleiot/blob/feature-nats/api/nats-edge.go#L52 nc.SetClosedHandler(func(_ *nats.Conn) {
log.Println("Connection to NATS is closed! -- this should never happen, waiting 15m then exitting")
time.Sleep(15 * time.Minute)
os.Exit(-1)
}) |
detecting server restart seems like neat idea but having trouble visualizing -- would the client continue to poll the server for changes to see if it restarted or was reconfigured? How is this different than just retrying connect? |
Might be the same, would need to look more closely. As for the ClosedHandler, why not try to reconnect there and if successful replace |
if you replace nc you loose all the reconnect buffers, auto subscription recreates etc? App would need to be significantly restructured if those are chan subscribes made in various places? |
Yes would need to do all the work the initial connection did regarding subscriptions etc as part of this. |
Thats the crux of my complaint about this behaviour change, like when STAN reconect handling was added late it leeds to a massive internal app re-org. We need to find a way to solve this without forcing that. |
I think that is a good goal. Want to make sure we align with the early DNA for NATS that it protects itself as well. So during revocation or some other change that invalidates a client connection I want the default behavior to not have the client continue to try to re-connect. |
I'd reverse this or adjust things so it behaves like a normal disconnect until we figure out someting closer to the typical model |
In general we want to retry a connection that got an auth error whether its a violation or expiration. However we want to close the connection if we get the same error twice on same server.
Also replaced code now present in jwt lib.
Signed-off-by: Derek Collison [email protected]