Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle expiration and auth errors on reconnects. #499

Merged
merged 2 commits into from
Jul 31, 2019
Merged

Conversation

derekcollison
Copy link
Member

In general we want to retry a connection that got an auth error whether its a violation or expiration. However we want to close the connection if we get the same error twice on same server.

Also replaced code now present in jwt lib.

Signed-off-by: Derek Collison [email protected]

In general we want to retry a connection that got an auth error whether its a violation or expiration. However we want to close the connection if we get the same error twice on same server.

Also replaced code now present in jwt lib.

Signed-off-by: Derek Collison <[email protected]>
Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - may want to have a test that shows that connection is closed if got same error twice

@coveralls
Copy link

coveralls commented Jul 31, 2019

Coverage Status

Coverage increased (+0.09%) to 92.508% when pulling e2a5163 on expired into 6eb6122 on master.

@derekcollison
Copy link
Member Author

Will add that test.

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should set max reconnects to higher value or infinite for this test

nats_test.go Outdated Show resolved Hide resolved
Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// We should give up if we tried twice on this server and got the
// same error.
if nc.current.lastErr == err {
defer nc.Close()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kozlovic should we be shutting down here or just removing this one from the list? If there are more servers in the list we probably should not give up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am connected to a cluster, and one server is bad but others are ok, we should be able to reconnect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant yes, why not do that. But I can't really recall why we did this in the first place. What problem exactly were we trying to solve? Is there a situation where a server will return auth error always and not other servers in the cluster? (without being a misconfig). If auth is updated in a server and a client connecting to the server while auth is returning error, we possibly can remove it from the list of urls while the server is then being updated and would have accepted that connection later on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I revoke your credentials in some form or fashion I wanted the default behavior to not keep trying forever.

@ripienaar
Copy link
Contributor

The expectation that a NATS client automatically reconnects through almost all scenarios is quite deeply entrenched for example, quoting the docs (emphasis mine):

Most, if not all, of the client libraries will reconnect to the NATS system if they
are disconnected for **any reason**. The reconnect logic can differ by library, 
so check your client library's documentation.

In general, the client will try to re-connect to one of the servers it knows about,
either through the URLs provided in the connect call or the URLs provided by 
the NATS system during earlier connects. This feature allows NATS applications 
and the NATS system itself to self heal and reconfigure itself with no additional 
configuration or intervention. The library may have several options to help control 
reconnect behavior, to notify about reconnect state and to inform about a new 
server.

As a user, this reconnect behaviour has become just a normal expectation to have, even across languages, that's just how NATS clients work ime.

We're looking to add features that further entrench this by having fire-and-forget async error producing Connect() for example.

So I appreciate the need for dealing with revoked creds - and I think we already have a well understood method for doing that with sane defaults. The library will only attempt 60 reconnects by default, with jitter (soon) backoffs and all that good stuff.

Would treating auth errors as connection errors in the sense of the reconnect loop not serve your needs? By default after 60 connect failures - auth error or otherwise - the library will raise an async error.

Those who do not wish to have this behaviour already understand and probably use MaxReconnects(-1) with the expectation that the package will forever retry.

If we wish to specifically support changing creds files we can do so with a stat check between retries.

The alternative is that a user who wants this forever-reconnect behaviour will have to basically reimplement a lot of the features of the package - tracking all subscribes that were made, recreating them, keeping hold of handles to queues and more. Effectively rewriting the entire flow logic of the package.

We also already have a LOT of options, adding a OptOut around this is probably not that good

@aricart
Copy link
Member

aricart commented May 7, 2020

The other thing, is we need to start a doc on client changes - we should be describing what the changes are, and why. This will help keep a task list of what needs doing with perhaps a reference to the PR.

I agree with RI that the current contract with the client is that we won't close.

@cbrake
Copy link
Contributor

cbrake commented Aug 5, 2020

I don't claim to understand all the scenarios where NATS is used, but I'm testing using the client on low end IoT edge devices where they are deployed in remote locations and our only connection to them is NATS. Also, the application on the edge devices does a lot of different things like reading sensors, running rules, controlling stuff, etc -- thus it needs to keep running, it can't keep restarting if there is a NATS connection error. So, I'm currently configuring the client with the following options:

        nc, err := nats.Connect(server,
                nats.Timeout(30*time.Second),
                nats.DrainTimeout(30*time.Second),
                nats.PingInterval(2*time.Minute),
                nats.MaxPingsOutstanding(5),
                nats.RetryOnFailedConnect(true),
                nats.ReconnectBufSize(5*1024*1024),
                nats.MaxReconnects(-1),
                nats.SetCustomDialer(&net.Dialer{
                        KeepAlive: -1,
                }),
                nats.CustomReconnectDelay(func(attempts int) time.Duration {
                        delay := ExpBackoff(attempts, 10*time.Minute)
                        log.Printf("NATS reconnect attempts: %v, delay: %v", attempts, delay)
                        return delay
                }),
                nats.Token(authToken),
        )

So the above covers most cases (like cellular communication errors, etc). The exponential backoff keeps the edge devices from flooding the server if something does go wrong. But there is one case that is not handled where the NATS server may be started with wrong or missing auth information, and the current behavior is the nats connection on the edge devices is closed and quits retrying. When the server deployment error is discovered and fixed, the clients are stuck. I could exit the app and restart on NATS connection closed, but that is not desirable as then buffered publish messages are lost, and the normal operation of the edge system is affected (rules, etc).

This is perhaps a little different type of distributed system than micro-services where a microservice is dispensable -- if one exits, you have a system that simply starts more. You also have more control over a microservice as you have control over whatever is hosting it. For an IoT edge device, the only connection you have to it is NATS, and you can't just throw the edge application away if there are problems -- it is performing essential tasks.

My initial thought was to be able to use the nc object in various parts of the edge app, so having a nc object that persisted for the life of the app seems useful, however, it may be better to abstract all the NATS logic (much like you would a database interface) so the lifecycle of the connection can be better maintained, but if I have to re-create the nc object, then I loose the reconnect buffer, containing published messages that have not been sent yet, and subscriptions would have to be restarted, etc.

(I'm new to NATS, so just thinking out loud -- this may or may not be relevant).

@derekcollison
Copy link
Member Author

The balance I want to achieve is between doing the right thing when it is a config error that is quickly fixed and not promoting DOS scenarios where accounts can be revoked with a large number of connections and they just keep trying to connect.

@ripienaar
Copy link
Contributor

I understand why you need this @derekcollison, but what is a user supposed to do here?

If it's a piece of software that should work forever - there are many scenarios where this is desirable - we give them no choice AND we change the default behaviour that makes this impossible to achieve AND we make it impossible to discover that this behaviour exist as it's so orthogonal to everything else.

Worse if they want reconnect style behaviour they have to implement basically many features already in the library.

Users who want reconnects - assuming they figured out this is going to happen - will either handle the situation and reconnect, or exit, restart and reconnect. They will anyway come back. Worse they will come back at the bottom of their expo backoff counts meaning many rapid connection. So rather than helping this is going to make it worse. Better treat it as a reconnect based on expressed desired reconnect behaviour and move to slowest backoff possible

Users who dont want this use our already safe defaults of capped reconnects. The default behaviour is safe for the server.

For me the worst class of user is those who are unaware this massive behaviour change happened in a minor release and who will now have their fleet just die unexpectedly.

@derekcollison
Copy link
Member Author

If we add some code to detect a given server was restarted or config reloaded and the client will attempt again on that server than user will get correct behavior without doing anything.

@cbrake
Copy link
Contributor

cbrake commented Aug 14, 2020

Is there one size that fits all here, or should it be configurable in the client? A typical microservice behavior might be to just exit and go away -- new ones will be spawned as necessary, but remote IoT devices are bricked and the app needs to be restarted, loosing functionality during this process.

For now I'm doing the following -- not a huge deal for the first application I'm working on as it can tolerate a restart and 15m of lost connectivity is not a huge deal for the rare instances we misconfigure the server:

https://github.com/simpleiot/simpleiot/blob/feature-nats/api/nats-edge.go#L52

	nc.SetClosedHandler(func(_ *nats.Conn) {
		log.Println("Connection to NATS is closed! -- this should never happen, waiting 15m then exitting")
		time.Sleep(15 * time.Minute)
		os.Exit(-1)
	})

@cbrake
Copy link
Contributor

cbrake commented Aug 14, 2020

detecting server restart seems like neat idea but having trouble visualizing -- would the client continue to poll the server for changes to see if it restarted or was reconfigured? How is this different than just retrying connect?

@derekcollison
Copy link
Member Author

Might be the same, would need to look more closely.

As for the ClosedHandler, why not try to reconnect there and if successful replace nc?

@ripienaar
Copy link
Contributor

As for the ClosedHandler, why not try to reconnect there and if successful replace nc?

if you replace nc you loose all the reconnect buffers, auto subscription recreates etc? App would need to be significantly restructured if those are chan subscribes made in various places?

@derekcollison
Copy link
Member Author

Yes would need to do all the work the initial connection did regarding subscriptions etc as part of this.

@ripienaar
Copy link
Contributor

Yes would need to do all the work the initial connection did regarding subscriptions etc as part of this.

Thats the crux of my complaint about this behaviour change, like when STAN reconect handling was added late it leeds to a massive internal app re-org.

We need to find a way to solve this without forcing that.

@derekcollison
Copy link
Member Author

I think that is a good goal. Want to make sure we align with the early DNA for NATS that it protects itself as well. So during revocation or some other change that invalidates a client connection I want the default behavior to not have the client continue to try to re-connect.

@ripienaar
Copy link
Contributor

I'd reverse this or adjust things so it behaves like a normal disconnect until we figure out someting closer to the typical model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants