Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the error callback from not being called sometimes #1683

Merged
merged 8 commits into from
Apr 4, 2017

Conversation

pjsg
Copy link
Member

@pjsg pjsg commented Dec 26, 2016

Fixes #1680

  • This PR is for the dev branch rather than for master.
  • This PR is compliant with the other contributing guidelines as well (if not, please describe why).
  • I have thoroughly tested my contribution.
  • The code changes are reflected in the documentation at docs/en/*.

There were a number of issues (now fixed);

  • If the socket failed to connect before the timer went off, then it wasn't reported. It now is
  • If the timer went off before the socket connected, then the connection attempt was not aborted.
  • If auto-reconnect was set, then errors were never reported. Now they are reported on the first connection attempt.

The downside is that there is a behavior change -- if the first connection fails, then reconnect is not performed.

@devsaurus
Copy link
Member

The downside is that there is a behavior change -- if the first connection fails, then reconnect is not performed.

IMO this isn't really a downside but a clearer behavior. I like auto reconnect, but only in case a connection was established at all in the beginning. Observing that connection can be achieved at least once rules out a bunch of potential networking problems. Automatic algos can then keep on massaging the network layer as required.

@devyte
Copy link

devyte commented Dec 27, 2016

I mentioned this elsewhere, but I thought to bring it up here as well. Feel free to yell at me if I'm talking nonesense :)
I think it may make sense to keep trying to reconnect even if the first attempt fails. Consider the case where the ESP and the broker are on the same power rail, and power fails. When power comes back, both ESP and broker will start up at the same time, but the ESP will likely come online first (boot time is short). It will then try to connect to the broker, and fail, because the broker isn't online yet. If the auto-reconnect stops after the very first fail, it will never connect. The user would then need to add a whole additional layer on top to detect if the broker is online before trying to connect.
Wouldn't a better api be to specify a max number of auto-reconnect attempts, then give up and trigger a callback if all attempts failed?

@devsaurus
Copy link
Member

@devyte I noticed your post elsewhere after posting here and think that the case you describe should be considered.
My statement above originated from the understanding that the autoreconnect option means to automatically reconnect when the connection dropped - with the focus on the re. My view was probably a bit too narrow when assuming that a connection has been established initially.

Picking up your API proposal (not fully backwards compatible though):
autoreconnect n connection retries: 0 for false, > 0 for n retries, < 0 to retry infinitely, default 0

Does this sound meaningful?

@devyte
Copy link

devyte commented Dec 28, 2016 via email

@pjsg
Copy link
Member Author

pjsg commented Dec 28, 2016

I'm think slightly differently -- how about this?

  • Autoreconnect = 0 -- no auto reconnect behavior. Error signalled via error callback
  • Autoreconnect = 1 -- auto reconnects after first success. Error signalled via error callback.
  • Autoreconnect = 2 -- auto connect/reconnect. ALL errors signalled via error callback.

The error callback gets an additional argument that says whether this is a 'final' error or not. A final error means that no further activity is expected.

Also, make sure that retries are done with an exponential backoff (with a max of 1000 seconds).

I really don't want to have a mode where, if the server ip is wrong (or password is wrong), there is no error indication given. Of course, the programmer can ignore the error, or pass a nil as the error callback.

@marcelstoer
Copy link
Member

marcelstoer commented Dec 28, 2016

Thanks for picking this up Arnim.

autoreconnect n connection retries: 0 for false, > 0 for n retries, < 0 to retry infinitely, default 0

It may be a perfectly valid approach in the C/embedded world, I have little to compare it with, but I don't like that. I prefer explicit APIs. English is not my mother tongue but to me a parameter named autoreconnect must semantically be a boolean.
I'd also argue that the "n retries" feature won't be used much. Isn't it so that you either want retry or not and if so you want it to be infinite? Furthermore, which ever way the MQTT behaviour will eventually be modified it should be consistent with the WiFi auto-(re)connect feature.

@devyte
Copy link

devyte commented Dec 28, 2016

@marcelstoer an explicit api is ok too. I don't really have a preference for how this is offered.

The name "autoreconnect" does not necessarily imply true|false. It can be the name of a feature, which takes many parameters. But like I said, I don't really have a preference for the api, so I'd rather leave that up to you :)

The n-retries approach is widely used in pretty much everything, including the Linux OS itself. It pretty much says: "retry n times, after that give up". Therefore, I would argue that it would be used quite often, especially in view of what @pjsg said.
Some users, such as me, would prefer to keep on trying for a very long time, or maybe even forever. This makes for a "soft connection" behavior, where connection errors can be recovered without restarting every single ESP in the network.
To explain this a bit: my use case is a wide wifi with many wifi routers interconnected via WDS. The thing is, the communication between the routers is not always stable, and sometimes I even have to reboot them. If I reboot the routers, all ESPs lose communication to the broker. If the autoreconnect is not handled properly, the ESPs would just give up and sit there staring at the wall. You could argue that specifying a timeout would be enough, but to be honest I've seen the wifi network take from a few seconds up to 15mins to recover. I prefer to have the ESPs just keep trying, eventually I get the wifi back up, and then the ESPs recover by themselves.

Also, like I explained in the previous post, all ESPs, the broker, and all wifi routers, among other devices, are on the same power circuit. If the power circuit goes down, when it comes back up all devices will power on at the same time, but will come online at very different times. I suspect that this is rather common among users, as is the case where the ESP is faster to come online that the broker. Therefore, if the first connection attempt by the ESP MQTT to the broker fails, it makes sense to keep retrying, at least for a while. This is also covered by the same n-retries approach, so it makes for simple usage for the programmer covering both cases in the same way.

@marcelstoer
Copy link
Member

make sure that retries are done with an exponential backoff (with a max of 1000 seconds).

Yes, I believe that's important, too. I can't judge though whether 1k seconds is long enough.

@devyte
Copy link

devyte commented Dec 28, 2016

@pjsg
you said:

make sure that retries are done with an exponential backoff

Exponential backoffs are usually done to cover two cases at rather opposite ends of the spectrum:
-quick and fast, where minimal overhead is desired
-long and slow, in which case you don't want to keep trying quickly, e.g.: overhead of waiting an extra few mins after an hour-long outage is ok

In this MQTT reconnect case, I don't think it's necessary to cover both ends of the spectrum. Personally, I feel that adding exponential backoff here would overcomplicate things, but that's just my own thought.

you said:

Autoreconnect = 1 -- auto reconnects after first success

I assume that you want to force an error if the first connection attempt fails, and only try to auto-reconnect if the first try succeeds. This is a Bad Idea, like I explained previously.

you said:

error callback gets an additional argument that says whether this is a 'final' error or not

This is a very good idea, it makes sense to know inside the callback whether more attempts will be tried or not.

you said:

I really don't want to have a mode where, if the server ip is wrong (or password is wrong), there is no error indication given. Of course, the programmer can ignore the error, or pass a nil as the error callback.

The callback serves the purpose of error handling. If the callback receives the "final" argument, or "i-th of n" attempts argument, then an error indication can be implemented, either for each auto-reconnect attempt, or only for the final one (i.e.: I tried n times and I'm now giving up), or both.
If the programmer is too lazy to implement proper error handling, or ignores the error, or passes nil for the callback, then I'd say he deserves his fate. Error handling is fundamental in any half-way decent code.

@marcelstoer marcelstoer added this to the 2.0.0 milestone Jan 1, 2017
@marcelstoer marcelstoer removed this from the 2.0.0 milestone Jan 20, 2017
@marcelstoer
Copy link
Member

I removed the 2.0.0 milestone as we're past the two-weeks-before-master-drop milestone and there are some unresolved issues (as per this discussion).

@karrots
Copy link
Contributor

karrots commented Jan 20, 2017

It seems like auto-reconnection after the error without handling the error doesn't do anything towards fixing the error. Aside from connection time-out in @devyte's power example, everything else needs the Lua programmer or user to take action. If the call back worked correctly then the Lua programmer can decide to reconnect or stop and notify the user.

As it stands now the error handler doesn't always get called and if you try to stop the auto-reconnect by calling close on the client object the connections keep happening until the error handler happens again and gives a nil reference error.

It seems like first steps should be to fix the bugs in the system; then improve the system with preference changes (if still necessary).

  1. Fix the error call back so it works consistently.
  2. Fix the ability to stop auto-reconnect.

After those two items are fixed the Lua programmer can choose how to correctly deal with each error. Currently, they feel tied down because the system isn't working consistently.

@pjsg
Copy link
Member Author

pjsg commented Jan 22, 2017

The current approach is that the erro callback is supposed to be called at most once after a connect call is made. This is a final callback, no further action will be taken.

If autoreconnect means "always retry errors" then the programmer will get no error callbacks at all (there is no final error).

I'm thinking that autoreconnect true means that errors are retried (no matter what the error is). I'm going to stick a reasonable timeout there -- say 10 seconds, However, all errors will be signalled to the error callback. An argument will be passed to say if there is a retry pending. Calling close would stop the retry.

Does this meet people's needs?

@devyte
Copy link

devyte commented Jan 22, 2017 via email

@pjsg
Copy link
Member Author

pjsg commented Jan 23, 2017

The programmer has two choices:

  • They can manage the retries themselves. Don't set the auto-reconnect flag

  • The system will continually retry (for ever). Each time the connection fails, the error callback will be called. If the programmer wants to stop after 3 hours of errors, then they can do that and call :close()

I suspect that most people will choose the second option (infinite retries).

@pjsg
Copy link
Member Author

pjsg commented Jan 23, 2017

Actually, now I stare at the mqtt code some more, I'm not sure that I can make it do anything consistent around reconnects.

I'm now feeling that the auto-reconnect option is a very bad idea. The programmer should deal with the reconnects by doing these from the error callback. The error callback should be guaranteed to be called at most once per connection. It should be called whenever the system gives up trying to get a good connection, or an existing connection breaks.

Also, given that the default for clean_session is true, it doesn't make sense to transparently reconnect anyway (as the broker throws away the state when the connection dies).

I'm now in favor of the following:

  • Mark the auto-reconnect as deprecated with large warnings.
  • Work on the reliability of the error callback
  • Add an example showing how to do reconnect on error.

@marcelstoer
Copy link
Member

Philip, your proposal sounds reasonable, I'm ok with that.

Mark the auto-reconnect as deprecated with large warnings.

Had we somehow resolved #1538 we could make sure programers don't miss that.

@pjsg
Copy link
Member Author

pjsg commented Jan 24, 2017

I've updated the documentation to make it clear that auto-reconnect should not be used. It particular, it interacts badly with cleansession.

I think that the underlying problem (of the missing callback) is also resolved here.

@NicolSpies
Copy link

@pjsg Way back when I started reporting about the mqtt becoming "stale" or unresponsive in #1406 the reconnect functionality was part of the situation. The following would have been nice:

If auto-reconnect is set, when the mqtt connection becomes "stale" or unresponsive, auto-reconnect is tried as many times as I have told it to try to reconnect. If it is does not succeed after the tries the "offline" event in triggered with a auto-reconnect "true" flag allowing me to "know" reconnect failed and will allow for specific code in the callback.
If auto-reconnect is not set, the "offline" event is triggered immediately with a auto-reconnect "false" flag.
This way the developer can let mqtt try to reconnect first before passing the "offline" callback.

Now, different scenarios I experienced complicates the suggestion above. The scenarios consists of a combination of one or more of the following:

ESP connection to AP failed.
ESP connected to AP but AP connection to the internet failed.
ESP connected to AP and AP connected to internet but connection to the mqtt broker failed.
Dependent on the existence and combination of the above, auto-reconnection is futile and hence my original response regarding the "ping" functionality to establish what has failed before attempting an reconnect.

Thus as a final thought regarding the auto-reconnect: It should only be attempted if none of the 3 conditions above are true or present. If this can be incorporated in the back-end functionality and returned as flags in the callback it will be great.

@jmattsson
Copy link
Member

Is this waiting for a particular milestone?

@marcelstoer marcelstoer added this to the 2.0.0-follow-up milestone Mar 4, 2017
@marcelstoer
Copy link
Member

Is this waiting for a particular milestone?

No, not really. But maybe for another approval (besides mine) from someone more familiar with the C-side of things 😉
Furthermore, the deprecation warning could now be embedded in the code (see #1683 (comment)).

@jmattsson
Copy link
Member

👍 from me then, and +1 on adding deprecation warning in code

@marcelstoer
Copy link
Member

marcelstoer commented Apr 4, 2017

I just tried to add the deprecation warning to the code so we could finally merge this...and I failed 😞

I added this

if ( auto_reconnect == RECONNECT_POSSIBLE ) {
    platform_print_deprecation_note("autoreconnect == 1 is deprecated", "in the next version");
}

to https://github.com/nodemcu/nodemcu-firmware/pull/1683/files#diff-904c0a57714312fb890c87ccf9beb2a4R1076 but the compiler complained about

implicit declaration of function 'platform_print_deprecation_note'

Why is this? After all, platform.h is included.

@devsaurus
Copy link
Member

Why is this? After all, platform.h is included.

Yes, but the feature branch in this PR diverged before the deprecation function in ba9d3af was merged. platform_print_deprecation_note() is not included of this code base.

@marcelstoer
Copy link
Member

Oh sorry, silly me...why didn't I check. I didn't realize this PR is that old. I'll merge it then and add the snippet to dev right afterwards.

@marcelstoer marcelstoer merged commit 66ffa6c into nodemcu:dev Apr 4, 2017
@marcelstoer
Copy link
Member

There you go: b645100.

eiselekd pushed a commit to eiselekd/nodemcu-firmware that referenced this pull request Jan 7, 2018
* Fix the error callback from not being called sometimes
* Moved the setting of the reconnect status to after the connack is recevied
* Increase the irom0_seg size
* Updated the documentation
* Make it clearer that autoreconnect is deprecated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants