Handle expiration and auth errors on reconnects. #499

derekcollison · 2019-07-31T15:19:56Z

In general we want to retry a connection that got an auth error whether its a violation or expiration. However we want to close the connection if we get the same error twice on same server.

Also replaced code now present in jwt lib.

Signed-off-by: Derek Collison [email protected]

In general we want to retry a connection that got an auth error whether its a violation or expiration. However we want to close the connection if we get the same error twice on same server. Also replaced code now present in jwt lib. Signed-off-by: Derek Collison <[email protected]>

kozlovic

LGTM - may want to have a test that shows that connection is closed if got same error twice

coveralls · 2019-07-31T15:34:36Z

Coverage increased (+0.09%) to 92.508% when pulling e2a5163 on expired into 6eb6122 on master.

derekcollison · 2019-07-31T15:46:58Z

Will add that test.

kozlovic

Should set max reconnects to higher value or infinite for this test

nats_test.go

Signed-off-by: Derek Collison <[email protected]>

kozlovic

LGTM

derekcollison · 2020-05-04T22:38:45Z

nats.go

+	// We should give up if we tried twice on this server and got the
+	// same error.
+	if nc.current.lastErr == err {
+		defer nc.Close()


Hey @kozlovic should we be shutting down here or just removing this one from the list? If there are more servers in the list we probably should not give up.

If I am connected to a cluster, and one server is bad but others are ok, we should be able to reconnect.

I meant yes, why not do that. But I can't really recall why we did this in the first place. What problem exactly were we trying to solve? Is there a situation where a server will return auth error always and not other servers in the cluster? (without being a misconfig). If auth is updated in a server and a client connecting to the server while auth is returning error, we possibly can remove it from the list of urls while the server is then being updated and would have accepted that connection later on.

If I revoke your credentials in some form or fashion I wanted the default behavior to not keep trying forever.

ripienaar · 2020-05-06T15:49:12Z

The expectation that a NATS client automatically reconnects through almost all scenarios is quite deeply entrenched for example, quoting the docs (emphasis mine):

Most, if not all, of the client libraries will reconnect to the NATS system if they
are disconnected for **any reason**. The reconnect logic can differ by library, 
so check your client library's documentation.

In general, the client will try to re-connect to one of the servers it knows about,
either through the URLs provided in the connect call or the URLs provided by 
the NATS system during earlier connects. This feature allows NATS applications 
and the NATS system itself to self heal and reconfigure itself with no additional 
configuration or intervention. The library may have several options to help control 
reconnect behavior, to notify about reconnect state and to inform about a new 
server.

As a user, this reconnect behaviour has become just a normal expectation to have, even across languages, that's just how NATS clients work ime.

We're looking to add features that further entrench this by having fire-and-forget async error producing Connect() for example.

So I appreciate the need for dealing with revoked creds - and I think we already have a well understood method for doing that with sane defaults. The library will only attempt 60 reconnects by default, with jitter (soon) backoffs and all that good stuff.

Would treating auth errors as connection errors in the sense of the reconnect loop not serve your needs? By default after 60 connect failures - auth error or otherwise - the library will raise an async error.

Those who do not wish to have this behaviour already understand and probably use MaxReconnects(-1) with the expectation that the package will forever retry.

If we wish to specifically support changing creds files we can do so with a stat check between retries.

The alternative is that a user who wants this forever-reconnect behaviour will have to basically reimplement a lot of the features of the package - tracking all subscribes that were made, recreating them, keeping hold of handles to queues and more. Effectively rewriting the entire flow logic of the package.

We also already have a LOT of options, adding a OptOut around this is probably not that good

aricart · 2020-05-07T15:26:51Z

The other thing, is we need to start a doc on client changes - we should be describing what the changes are, and why. This will help keep a task list of what needs doing with perhaps a reference to the PR.

I agree with RI that the current contract with the client is that we won't close.

cbrake · 2020-08-05T12:51:40Z

I don't claim to understand all the scenarios where NATS is used, but I'm testing using the client on low end IoT edge devices where they are deployed in remote locations and our only connection to them is NATS. Also, the application on the edge devices does a lot of different things like reading sensors, running rules, controlling stuff, etc -- thus it needs to keep running, it can't keep restarting if there is a NATS connection error. So, I'm currently configuring the client with the following options:

        nc, err := nats.Connect(server,
                nats.Timeout(30*time.Second),
                nats.DrainTimeout(30*time.Second),
                nats.PingInterval(2*time.Minute),
                nats.MaxPingsOutstanding(5),
                nats.RetryOnFailedConnect(true),
                nats.ReconnectBufSize(5*1024*1024),
                nats.MaxReconnects(-1),
                nats.SetCustomDialer(&net.Dialer{
                        KeepAlive: -1,
                }),
                nats.CustomReconnectDelay(func(attempts int) time.Duration {
                        delay := ExpBackoff(attempts, 10*time.Minute)
                        log.Printf("NATS reconnect attempts: %v, delay: %v", attempts, delay)
                        return delay
                }),
                nats.Token(authToken),
        )

So the above covers most cases (like cellular communication errors, etc). The exponential backoff keeps the edge devices from flooding the server if something does go wrong. But there is one case that is not handled where the NATS server may be started with wrong or missing auth information, and the current behavior is the nats connection on the edge devices is closed and quits retrying. When the server deployment error is discovered and fixed, the clients are stuck. I could exit the app and restart on NATS connection closed, but that is not desirable as then buffered publish messages are lost, and the normal operation of the edge system is affected (rules, etc).

This is perhaps a little different type of distributed system than micro-services where a microservice is dispensable -- if one exits, you have a system that simply starts more. You also have more control over a microservice as you have control over whatever is hosting it. For an IoT edge device, the only connection you have to it is NATS, and you can't just throw the edge application away if there are problems -- it is performing essential tasks.

My initial thought was to be able to use the nc object in various parts of the edge app, so having a nc object that persisted for the life of the app seems useful, however, it may be better to abstract all the NATS logic (much like you would a database interface) so the lifecycle of the connection can be better maintained, but if I have to re-create the nc object, then I loose the reconnect buffer, containing published messages that have not been sent yet, and subscriptions would have to be restarted, etc.

(I'm new to NATS, so just thinking out loud -- this may or may not be relevant).

derekcollison · 2020-08-14T15:35:17Z

The balance I want to achieve is between doing the right thing when it is a config error that is quickly fixed and not promoting DOS scenarios where accounts can be revoked with a large number of connections and they just keep trying to connect.

ripienaar · 2020-08-14T15:48:31Z

I understand why you need this @derekcollison, but what is a user supposed to do here?

If it's a piece of software that should work forever - there are many scenarios where this is desirable - we give them no choice AND we change the default behaviour that makes this impossible to achieve AND we make it impossible to discover that this behaviour exist as it's so orthogonal to everything else.

Worse if they want reconnect style behaviour they have to implement basically many features already in the library.

Users who want reconnects - assuming they figured out this is going to happen - will either handle the situation and reconnect, or exit, restart and reconnect. They will anyway come back. Worse they will come back at the bottom of their expo backoff counts meaning many rapid connection. So rather than helping this is going to make it worse. Better treat it as a reconnect based on expressed desired reconnect behaviour and move to slowest backoff possible

Users who dont want this use our already safe defaults of capped reconnects. The default behaviour is safe for the server.

For me the worst class of user is those who are unaware this massive behaviour change happened in a minor release and who will now have their fleet just die unexpectedly.

derekcollison · 2020-08-14T15:55:41Z

If we add some code to detect a given server was restarted or config reloaded and the client will attempt again on that server than user will get correct behavior without doing anything.

cbrake · 2020-08-14T15:59:12Z

Is there one size that fits all here, or should it be configurable in the client? A typical microservice behavior might be to just exit and go away -- new ones will be spawned as necessary, but remote IoT devices are bricked and the app needs to be restarted, loosing functionality during this process.

For now I'm doing the following -- not a huge deal for the first application I'm working on as it can tolerate a restart and 15m of lost connectivity is not a huge deal for the rare instances we misconfigure the server:

https://github.com/simpleiot/simpleiot/blob/feature-nats/api/nats-edge.go#L52

	nc.SetClosedHandler(func(_ *nats.Conn) {
		log.Println("Connection to NATS is closed! -- this should never happen, waiting 15m then exitting")
		time.Sleep(15 * time.Minute)
		os.Exit(-1)
	})

cbrake · 2020-08-14T16:00:56Z

detecting server restart seems like neat idea but having trouble visualizing -- would the client continue to poll the server for changes to see if it restarted or was reconfigured? How is this different than just retrying connect?

derekcollison · 2020-08-14T16:02:19Z

Might be the same, would need to look more closely.

As for the ClosedHandler, why not try to reconnect there and if successful replace nc?

ripienaar · 2020-08-14T16:04:06Z

As for the ClosedHandler, why not try to reconnect there and if successful replace nc?

if you replace nc you loose all the reconnect buffers, auto subscription recreates etc? App would need to be significantly restructured if those are chan subscribes made in various places?

derekcollison · 2020-08-14T16:05:26Z

Yes would need to do all the work the initial connection did regarding subscriptions etc as part of this.

ripienaar · 2020-08-14T16:06:55Z

Yes would need to do all the work the initial connection did regarding subscriptions etc as part of this.

Thats the crux of my complaint about this behaviour change, like when STAN reconect handling was added late it leeds to a massive internal app re-org.

We need to find a way to solve this without forcing that.

derekcollison · 2020-08-14T16:35:32Z

I think that is a good goal. Want to make sure we align with the early DNA for NATS that it protects itself as well. So during revocation or some other change that invalidates a client connection I want the default behavior to not have the client continue to try to re-connect.

ripienaar · 2020-08-15T06:42:22Z

I'd reverse this or adjust things so it behaves like a normal disconnect until we figure out someting closer to the typical model

derekcollison requested review from wallyqs and kozlovic July 31, 2019 15:19

kozlovic approved these changes Jul 31, 2019

View reviewed changes

kozlovic requested changes Jul 31, 2019

View reviewed changes

nats_test.go Outdated Show resolved Hide resolved

Make sure we expire after multiple failures

e2a5163

Signed-off-by: Derek Collison <[email protected]>

derekcollison force-pushed the expired branch from 069817f to e2a5163 Compare July 31, 2019 15:55

kozlovic mentioned this pull request Jul 31, 2019

Handle expiration and auth errors on reconnects nats-io/nats.c#244

Closed

kozlovic approved these changes Jul 31, 2019

View reviewed changes

derekcollison merged commit d59505c into master Jul 31, 2019

derekcollison deleted the expired branch July 31, 2019 16:03

sasbury mentioned this pull request Aug 8, 2019

Mimic go client when handling security errors on connect/reconnect nats-io/nats.java#259

Closed

ColinSullivan1 mentioned this pull request Aug 27, 2019

Handle expiration and auth errors on reconnects. nats-io/nats.net.v1#282

Closed

derekcollison commented May 4, 2020

View reviewed changes

cbrake mentioned this pull request Aug 17, 2020

Option to not disconnect on Auth/TLS errors #586

Closed

aricart mentioned this pull request Nov 18, 2022

Failed Authentication Opt-out nats-io/nats-architecture-and-design#170

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle expiration and auth errors on reconnects. #499

Handle expiration and auth errors on reconnects. #499

derekcollison commented Jul 31, 2019

kozlovic left a comment

coveralls commented Jul 31, 2019 •

edited

Loading

derekcollison commented Jul 31, 2019

kozlovic left a comment

kozlovic left a comment

derekcollison May 4, 2020

kozlovic May 4, 2020

derekcollison May 5, 2020

kozlovic May 5, 2020

derekcollison May 5, 2020

ripienaar commented May 6, 2020

aricart commented May 7, 2020

cbrake commented Aug 5, 2020 •

edited

Loading

derekcollison commented Aug 14, 2020

ripienaar commented Aug 14, 2020

derekcollison commented Aug 14, 2020

cbrake commented Aug 14, 2020

cbrake commented Aug 14, 2020

derekcollison commented Aug 14, 2020

ripienaar commented Aug 14, 2020

derekcollison commented Aug 14, 2020

ripienaar commented Aug 14, 2020

derekcollison commented Aug 14, 2020

ripienaar commented Aug 15, 2020

Handle expiration and auth errors on reconnects. #499

Handle expiration and auth errors on reconnects. #499

Conversation

derekcollison commented Jul 31, 2019

kozlovic left a comment

Choose a reason for hiding this comment

coveralls commented Jul 31, 2019 • edited Loading

derekcollison commented Jul 31, 2019

kozlovic left a comment

Choose a reason for hiding this comment

kozlovic left a comment

Choose a reason for hiding this comment

derekcollison May 4, 2020

Choose a reason for hiding this comment

kozlovic May 4, 2020

Choose a reason for hiding this comment

derekcollison May 5, 2020

Choose a reason for hiding this comment

kozlovic May 5, 2020

Choose a reason for hiding this comment

derekcollison May 5, 2020

Choose a reason for hiding this comment

ripienaar commented May 6, 2020

aricart commented May 7, 2020

cbrake commented Aug 5, 2020 • edited Loading

derekcollison commented Aug 14, 2020

ripienaar commented Aug 14, 2020

derekcollison commented Aug 14, 2020

cbrake commented Aug 14, 2020

cbrake commented Aug 14, 2020

derekcollison commented Aug 14, 2020

ripienaar commented Aug 14, 2020

derekcollison commented Aug 14, 2020

ripienaar commented Aug 14, 2020

derekcollison commented Aug 14, 2020

ripienaar commented Aug 15, 2020

coveralls commented Jul 31, 2019 •

edited

Loading

cbrake commented Aug 5, 2020 •

edited

Loading