Fix the error callback from not being called sometimes #1683

pjsg · 2016-12-26T23:46:09Z

This PR is for the dev branch rather than for master.
This PR is compliant with the other contributing guidelines as well (if not, please describe why).
I have thoroughly tested my contribution.
The code changes are reflected in the documentation at docs/en/*.

There were a number of issues (now fixed);

If the socket failed to connect before the timer went off, then it wasn't reported. It now is
If the timer went off before the socket connected, then the connection attempt was not aborted.
If auto-reconnect was set, then errors were never reported. Now they are reported on the first connection attempt.

The downside is that there is a behavior change -- if the first connection fails, then reconnect is not performed.

devsaurus · 2016-12-27T10:44:07Z

The downside is that there is a behavior change -- if the first connection fails, then reconnect is not performed.

IMO this isn't really a downside but a clearer behavior. I like auto reconnect, but only in case a connection was established at all in the beginning. Observing that connection can be achieved at least once rules out a bunch of potential networking problems. Automatic algos can then keep on massaging the network layer as required.

…evied

devyte · 2016-12-27T14:49:29Z

I mentioned this elsewhere, but I thought to bring it up here as well. Feel free to yell at me if I'm talking nonesense :)
I think it may make sense to keep trying to reconnect even if the first attempt fails. Consider the case where the ESP and the broker are on the same power rail, and power fails. When power comes back, both ESP and broker will start up at the same time, but the ESP will likely come online first (boot time is short). It will then try to connect to the broker, and fail, because the broker isn't online yet. If the auto-reconnect stops after the very first fail, it will never connect. The user would then need to add a whole additional layer on top to detect if the broker is online before trying to connect.
Wouldn't a better api be to specify a max number of auto-reconnect attempts, then give up and trigger a callback if all attempts failed?

devsaurus · 2016-12-27T15:17:23Z

@devyte I noticed your post elsewhere after posting here and think that the case you describe should be considered.
My statement above originated from the understanding that the autoreconnect option means to automatically reconnect when the connection dropped - with the focus on the re. My view was probably a bit too narrow when assuming that a connection has been established initially.

Picking up your API proposal (not fully backwards compatible though):
autoreconnect n connection retries: 0 for false, > 0 for n retries, < 0 to retry infinitely, default 0

Does this sound meaningful?

devyte · 2016-12-28T05:51:58Z

That looks about right! Thank you for taking my feedback into account :)

pjsg · 2016-12-28T14:36:00Z

I'm think slightly differently -- how about this?

Autoreconnect = 0 -- no auto reconnect behavior. Error signalled via error callback
Autoreconnect = 1 -- auto reconnects after first success. Error signalled via error callback.
Autoreconnect = 2 -- auto connect/reconnect. ALL errors signalled via error callback.

The error callback gets an additional argument that says whether this is a 'final' error or not. A final error means that no further activity is expected.

Also, make sure that retries are done with an exponential backoff (with a max of 1000 seconds).

I really don't want to have a mode where, if the server ip is wrong (or password is wrong), there is no error indication given. Of course, the programmer can ignore the error, or pass a nil as the error callback.

marcelstoer · 2016-12-28T14:40:57Z

Thanks for picking this up Arnim.

autoreconnect n connection retries: 0 for false, > 0 for n retries, < 0 to retry infinitely, default 0

It may be a perfectly valid approach in the C/embedded world, I have little to compare it with, but I don't like that. I prefer explicit APIs. English is not my mother tongue but to me a parameter named autoreconnect must semantically be a boolean.
I'd also argue that the "n retries" feature won't be used much. Isn't it so that you either want retry or not and if so you want it to be infinite? Furthermore, which ever way the MQTT behaviour will eventually be modified it should be consistent with the WiFi auto-(re)connect feature.

devyte · 2016-12-28T15:36:09Z

@marcelstoer an explicit api is ok too. I don't really have a preference for how this is offered.

The name "autoreconnect" does not necessarily imply true|false. It can be the name of a feature, which takes many parameters. But like I said, I don't really have a preference for the api, so I'd rather leave that up to you :)

The n-retries approach is widely used in pretty much everything, including the Linux OS itself. It pretty much says: "retry n times, after that give up". Therefore, I would argue that it would be used quite often, especially in view of what @pjsg said.
Some users, such as me, would prefer to keep on trying for a very long time, or maybe even forever. This makes for a "soft connection" behavior, where connection errors can be recovered without restarting every single ESP in the network.
To explain this a bit: my use case is a wide wifi with many wifi routers interconnected via WDS. The thing is, the communication between the routers is not always stable, and sometimes I even have to reboot them. If I reboot the routers, all ESPs lose communication to the broker. If the autoreconnect is not handled properly, the ESPs would just give up and sit there staring at the wall. You could argue that specifying a timeout would be enough, but to be honest I've seen the wifi network take from a few seconds up to 15mins to recover. I prefer to have the ESPs just keep trying, eventually I get the wifi back up, and then the ESPs recover by themselves.

Also, like I explained in the previous post, all ESPs, the broker, and all wifi routers, among other devices, are on the same power circuit. If the power circuit goes down, when it comes back up all devices will power on at the same time, but will come online at very different times. I suspect that this is rather common among users, as is the case where the ESP is faster to come online that the broker. Therefore, if the first connection attempt by the ESP MQTT to the broker fails, it makes sense to keep retrying, at least for a while. This is also covered by the same n-retries approach, so it makes for simple usage for the programmer covering both cases in the same way.

marcelstoer · 2016-12-28T15:41:36Z

make sure that retries are done with an exponential backoff (with a max of 1000 seconds).

Yes, I believe that's important, too. I can't judge though whether 1k seconds is long enough.

devyte · 2016-12-28T15:52:38Z

@pjsg
you said:

make sure that retries are done with an exponential backoff

Exponential backoffs are usually done to cover two cases at rather opposite ends of the spectrum:
-quick and fast, where minimal overhead is desired
-long and slow, in which case you don't want to keep trying quickly, e.g.: overhead of waiting an extra few mins after an hour-long outage is ok

In this MQTT reconnect case, I don't think it's necessary to cover both ends of the spectrum. Personally, I feel that adding exponential backoff here would overcomplicate things, but that's just my own thought.

you said:

Autoreconnect = 1 -- auto reconnects after first success

I assume that you want to force an error if the first connection attempt fails, and only try to auto-reconnect if the first try succeeds. This is a Bad Idea, like I explained previously.

you said:

error callback gets an additional argument that says whether this is a 'final' error or not

This is a very good idea, it makes sense to know inside the callback whether more attempts will be tried or not.

you said:

I really don't want to have a mode where, if the server ip is wrong (or password is wrong), there is no error indication given. Of course, the programmer can ignore the error, or pass a nil as the error callback.

The callback serves the purpose of error handling. If the callback receives the "final" argument, or "i-th of n" attempts argument, then an error indication can be implemented, either for each auto-reconnect attempt, or only for the final one (i.e.: I tried n times and I'm now giving up), or both.
If the programmer is too lazy to implement proper error handling, or ignores the error, or passes nil for the callback, then I'd say he deserves his fate. Error handling is fundamental in any half-way decent code.

marcelstoer · 2017-01-20T09:23:51Z

I removed the 2.0.0 milestone as we're past the two-weeks-before-master-drop milestone and there are some unresolved issues (as per this discussion).

karrots · 2017-01-20T14:58:15Z

It seems like auto-reconnection after the error without handling the error doesn't do anything towards fixing the error. Aside from connection time-out in @devyte's power example, everything else needs the Lua programmer or user to take action. If the call back worked correctly then the Lua programmer can decide to reconnect or stop and notify the user.

As it stands now the error handler doesn't always get called and if you try to stop the auto-reconnect by calling close on the client object the connections keep happening until the error handler happens again and gives a nil reference error.

It seems like first steps should be to fix the bugs in the system; then improve the system with preference changes (if still necessary).

Fix the error call back so it works consistently.
Fix the ability to stop auto-reconnect.

After those two items are fixed the Lua programmer can choose how to correctly deal with each error. Currently, they feel tied down because the system isn't working consistently.

pjsg · 2017-01-22T22:43:14Z

The current approach is that the erro callback is supposed to be called at most once after a connect call is made. This is a final callback, no further action will be taken.

If autoreconnect means "always retry errors" then the programmer will get no error callbacks at all (there is no final error).

I'm thinking that autoreconnect true means that errors are retried (no matter what the error is). I'm going to stick a reasonable timeout there -- say 10 seconds, However, all errors will be signalled to the error callback. An argument will be passed to say if there is a retry pending. Calling close would stop the retry.

Does this meet people's needs?

devyte · 2017-01-22T23:06:31Z

Again, consider the case where the ESP and the broker power on at the same time. The ESP will take a few seconds to come online and try to connect to the broker. However, the broker could be a large system like a pc, or at least a mini-version of one, generally linux-based. How long would such a system take, from power on to the broker coming online? If you really must put a timeout on retries, this must be taken into account in a robust manner. Otherwise, after every power failure, all ESPs will need to be restarted manually, because they will have timed out.

…

On Jan 22, 2017 7:43 PM, "Philip Gladstone" ***@***.***> wrote: The current approach is that the erro callback is *supposed* to be called at most once after a connect call is made. This is a final callback, no further action will be taken. If autoreconnect means "always retry errors" then the programmer will get no error callbacks at all (there is no final error). I'm thinking that autoreconnect true means that errors are retried (no matter what the error is). I'm going to stick a reasonable timeout there -- say 10 seconds, However, *all* errors will be signalled to the error callback. An argument will be passed to say if there is a retry pending. Calling close would stop the retry. Does this meet people's needs? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1683 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AQC6BqwSd9vN-i-9Zkseu-cFCN_o_o3yks5rU9uIgaJpZM4LV8v8> .

pjsg · 2017-01-23T01:31:16Z

The programmer has two choices:

They can manage the retries themselves. Don't set the auto-reconnect flag
The system will continually retry (for ever). Each time the connection fails, the error callback will be called. If the programmer wants to stop after 3 hours of errors, then they can do that and call :close()

I suspect that most people will choose the second option (infinite retries).

pjsg · 2017-01-23T01:45:39Z

Actually, now I stare at the mqtt code some more, I'm not sure that I can make it do anything consistent around reconnects.

I'm now feeling that the auto-reconnect option is a very bad idea. The programmer should deal with the reconnects by doing these from the error callback. The error callback should be guaranteed to be called at most once per connection. It should be called whenever the system gives up trying to get a good connection, or an existing connection breaks.

Also, given that the default for clean_session is true, it doesn't make sense to transparently reconnect anyway (as the broker throws away the state when the connection dies).

I'm now in favor of the following:

Mark the auto-reconnect as deprecated with large warnings.
Work on the reliability of the error callback
Add an example showing how to do reconnect on error.

marcelstoer · 2017-01-23T07:47:45Z

Philip, your proposal sounds reasonable, I'm ok with that.

Mark the auto-reconnect as deprecated with large warnings.

Had we somehow resolved #1538 we could make sure programers don't miss that.

pjsg · 2017-01-24T02:00:00Z

I've updated the documentation to make it clear that auto-reconnect should not be used. It particular, it interacts badly with cleansession.

I think that the underlying problem (of the missing callback) is also resolved here.

NicolSpies · 2017-01-24T17:54:22Z

@pjsg Way back when I started reporting about the mqtt becoming "stale" or unresponsive in #1406 the reconnect functionality was part of the situation. The following would have been nice:

If auto-reconnect is set, when the mqtt connection becomes "stale" or unresponsive, auto-reconnect is tried as many times as I have told it to try to reconnect. If it is does not succeed after the tries the "offline" event in triggered with a auto-reconnect "true" flag allowing me to "know" reconnect failed and will allow for specific code in the callback.
If auto-reconnect is not set, the "offline" event is triggered immediately with a auto-reconnect "false" flag.
This way the developer can let mqtt try to reconnect first before passing the "offline" callback.

Now, different scenarios I experienced complicates the suggestion above. The scenarios consists of a combination of one or more of the following:

ESP connection to AP failed.
ESP connected to AP but AP connection to the internet failed.
ESP connected to AP and AP connected to internet but connection to the mqtt broker failed.
Dependent on the existence and combination of the above, auto-reconnection is futile and hence my original response regarding the "ping" functionality to establish what has failed before attempting an reconnect.

Thus as a final thought regarding the auto-reconnect: It should only be attempted if none of the 3 conditions above are true or present. If this can be incorporated in the back-end functionality and returned as flags in the callback it will be great.

jmattsson · 2017-03-02T05:07:21Z

Is this waiting for a particular milestone?

marcelstoer · 2017-03-04T11:08:25Z

Is this waiting for a particular milestone?

No, not really. But maybe for another approval (besides mine) from someone more familiar with the C-side of things 😉
Furthermore, the deprecation warning could now be embedded in the code (see #1683 (comment)).

jmattsson · 2017-03-05T01:01:20Z

👍 from me then, and +1 on adding deprecation warning in code

marcelstoer · 2017-04-04T20:08:51Z

I just tried to add the deprecation warning to the code so we could finally merge this...and I failed 😞

I added this

if ( auto_reconnect == RECONNECT_POSSIBLE ) {
    platform_print_deprecation_note("autoreconnect == 1 is deprecated", "in the next version");
}

to https://github.com/nodemcu/nodemcu-firmware/pull/1683/files#diff-904c0a57714312fb890c87ccf9beb2a4R1076 but the compiler complained about

implicit declaration of function 'platform_print_deprecation_note'

Why is this? After all, platform.h is included.

devsaurus · 2017-04-04T20:16:36Z

Why is this? After all, platform.h is included.

Yes, but the feature branch in this PR diverged before the deprecation function in ba9d3af was merged. platform_print_deprecation_note() is not included of this code base.

marcelstoer · 2017-04-04T20:21:10Z

Oh sorry, silly me...why didn't I check. I didn't realize this PR is that old. I'll merge it then and add the snippet to dev right afterwards.

marcelstoer · 2017-04-04T20:54:04Z

There you go: b645100.

* Fix the error callback from not being called sometimes * Moved the setting of the reconnect status to after the connack is recevied * Increase the irom0_seg size * Updated the documentation * Make it clearer that autoreconnect is deprecated

Fix the error callback from not being called sometimes

89b9182

Moved the setting of the reconnect status to after the connack is rec…

b8a4132

…evied

marcelstoer added this to the 2.0.0 milestone Jan 1, 2017

Merge remote-tracking branch 'upstream/dev' into mqtt-connect

b56390e

marcelstoer added the bugfix label Jan 2, 2017

marcelstoer removed this from the 2.0.0 milestone Jan 20, 2017

pjsg added 2 commits January 22, 2017 14:23

Merge remote-tracking branch 'upstream/dev' into mqtt-connect

0090828

Increase the irom0_seg size

61a1ec8

pjsg added 3 commits January 23, 2017 08:38

Formatting changes

afcb257

Updated the documentation

ff27f2e

Make it clearer that autoreconnect is deprecated

bd7ef86

pjsg mentioned this pull request Jan 24, 2017

mqtt: event 'offline' sometimes not called #1731

Closed

marcelstoer mentioned this pull request Feb 1, 2017

MQTT doesn't detect broker outage and overflows message queue #1772

Closed

marcelstoer approved these changes Feb 9, 2017

View reviewed changes

marcelstoer added this to the 2.0.0-follow-up milestone Mar 4, 2017

marcelstoer merged commit 66ffa6c into nodemcu:dev Apr 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the error callback from not being called sometimes #1683

Fix the error callback from not being called sometimes #1683

pjsg commented Dec 26, 2016

devsaurus commented Dec 27, 2016

devyte commented Dec 27, 2016

devsaurus commented Dec 27, 2016

devyte commented Dec 28, 2016 via email •

edited by marcelstoer

Loading

pjsg commented Dec 28, 2016

marcelstoer commented Dec 28, 2016 •

edited

Loading

devyte commented Dec 28, 2016

marcelstoer commented Dec 28, 2016

devyte commented Dec 28, 2016

marcelstoer commented Jan 20, 2017

karrots commented Jan 20, 2017

pjsg commented Jan 22, 2017

devyte commented Jan 22, 2017 via email

pjsg commented Jan 23, 2017

pjsg commented Jan 23, 2017

marcelstoer commented Jan 23, 2017

pjsg commented Jan 24, 2017

NicolSpies commented Jan 24, 2017

jmattsson commented Mar 2, 2017

marcelstoer commented Mar 4, 2017

jmattsson commented Mar 5, 2017

marcelstoer commented Apr 4, 2017 •

edited

Loading

devsaurus commented Apr 4, 2017

marcelstoer commented Apr 4, 2017

marcelstoer commented Apr 4, 2017

Fix the error callback from not being called sometimes #1683

Fix the error callback from not being called sometimes #1683

Conversation

pjsg commented Dec 26, 2016

devsaurus commented Dec 27, 2016

devyte commented Dec 27, 2016

devsaurus commented Dec 27, 2016

devyte commented Dec 28, 2016 via email • edited by marcelstoer Loading

pjsg commented Dec 28, 2016

marcelstoer commented Dec 28, 2016 • edited Loading

devyte commented Dec 28, 2016

marcelstoer commented Dec 28, 2016

devyte commented Dec 28, 2016

marcelstoer commented Jan 20, 2017

karrots commented Jan 20, 2017

pjsg commented Jan 22, 2017

devyte commented Jan 22, 2017 via email

pjsg commented Jan 23, 2017

pjsg commented Jan 23, 2017

marcelstoer commented Jan 23, 2017

pjsg commented Jan 24, 2017

NicolSpies commented Jan 24, 2017

jmattsson commented Mar 2, 2017

marcelstoer commented Mar 4, 2017

jmattsson commented Mar 5, 2017

marcelstoer commented Apr 4, 2017 • edited Loading

devsaurus commented Apr 4, 2017

marcelstoer commented Apr 4, 2017

marcelstoer commented Apr 4, 2017

devyte commented Dec 28, 2016 via email •

edited by marcelstoer

Loading

marcelstoer commented Dec 28, 2016 •

edited

Loading

marcelstoer commented Apr 4, 2017 •

edited

Loading