Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MQTT client stays connected but becomes stale or unresponsive #1406

Closed
NicolSpies opened this issue Jul 18, 2016 · 16 comments
Closed

MQTT client stays connected but becomes stale or unresponsive #1406

NicolSpies opened this issue Jul 18, 2016 · 16 comments

Comments

@NicolSpies
Copy link

Expected behavior

If the MQTT connection becomes stale or unresponsive the MQTT Client should trigger the offline event and automatically reconnect if the client:connect method has been configured to auto reconnect.

Actual behavior

  1. On ESP startup MQTT client connects to the broker and subscribes to topics. Topics wit payloads can be published and received.
  2. The connection is periodically confirmed by publishing a dummy message and verifying a successful return.
  3. When the MQTT connection becomes stale or unresponsive the MQTT broker sends the LWT confirming the ungraceful client disconnection.
  4. When this happens the client is still connected and the offline event does not fire. This is confirmed by the "already connected" message when trying to reconnect in this state.
  5. If the stale connection is closed using the client:close() method, the offline event fires.
  6. The auto reconnect option set for the client does not fire when the offline event has fired.
  7. The client object has not been destroyed and when the client:connect method is executed again no events are triggered and the re-connection does not take place.
  8. The only way to re-establish the MQTT connection is to restart the ESP and then everything happens normally until the connection becomes stale again. The stale state occurs about once every few hours.

Test code

-- The standard example code in the NodeMCU documentation is used

NodeMCU version

Lua 5.1.4 on SDK 1.5.1(e67da894)
branch: dev
commit: adf7173
SSL: false
modules: adc,bit,cjson,enduser_setup,file,gpio,i2c,mdns,mqtt,net,node,rotary,rtctime,sntp,tmr,uart,wifi

Hardware

ESP12-E is used in a standard configuration.

@marcelstoer
Copy link
Member

Isn't that a dup of #1394 and #1395 (which themselves seem dups)?

@NicolSpies
Copy link
Author

Yes, I have been re-testing and believe we can make this the root issue describing the sequence of events and link the other two issues that are inter related to this one

@NicolSpies
Copy link
Author

In point 6 of the actual behavior above, the auto reconnect does not fire when the MQTT connection has become stale or unresponsive,

HOWEVER,

the auto reconnect does fire and executes successfully if the mqtt client goes offline immediately after a new mqtt connection has been established.

@djphoenix
Copy link
Contributor

@NicolSpies can I ask you to try patch from #1349? Maybe it will establish something...

@NicolSpies
Copy link
Author

@djphoenix unfortunately not possible as I only program in Lua obtained from NodeMCU custom builds.

@djphoenix
Copy link
Contributor

OK, try out this binary: link
It's latest dev (3eccf5) with patch from #1349. Module selection is similar to your build.
Note you may need to upload esp_init_data_default.bin according to flash docs.

@NicolSpies
Copy link
Author

Thanks, will try as soon as I have time (in few days).

@djphoenix
Copy link
Contributor

@NicolSpies any progress or reports here?

@NicolSpies
Copy link
Author

@djphoenix, apologies for only getting to it now. Good news..

  1. Connection and subscription process still operates as before.
  2. When mqqt connection disconnects, offline event fires and auto reconnect takes place (immediately).
  3. After auto re-connection subscriptions are still valid and re-subscription are not required.

@marcelstoer
Copy link
Member

Thanks Nicol for testing. I conclude that the code Yuri provided fixes all the problems you reported, correct? If so, then please close this issue.

Looks like we only need to track #1349 then and hope Yuri finds the time to turn that patch into a PR.

@NicolSpies
Copy link
Author

Agreed, on all points, my money is on Yuri for the PR. 👍

@NicolSpies
Copy link
Author

NicolSpies commented Aug 18, 2016

Additional information: It may be a new condition not discovered before. It might be that that the connection is fine hence no offline triggering but, the moment a message is received a disconnect happens hence the lwt triggering. During this condition the on message method is not triggered as normally happens when the connection is responsive.

Something else that might be of value. I publish a message every 60 seconds but messages are not received often.

@djphoenix, Hi Yury, I have been testing the last dev merge for #1445 since the release.
The condition that the mqtt connection becomes unresponsive for incoming messages or subscriptions without the offline method being triggered, has surfaced again.

The sequence of events is that the connection becomes unresponsive. Offline method is not triggered. When a mqtt message is received the lwt is immediately triggered. As before the re-connection is not initiated most probably because the offline is not triggered.

The strange thing is that it worked for a while and then stopped without making any changes to the way mqtt is used. Formatting and reflashing the ESP with the originally "working" build does not restore the operation.

@djphoenix
Copy link
Contributor

@NicolSpies Hello again.
As I see your case was not covered with my patch... So strange things happened. MQTT client with my patch was tested "in-the-wild" for weeks (in production-like environment), and no issues like yours was appeared.
So let's go deeper. Can you make an example of your code? So how I see you post much messages to MQTT broker... Also server and network configuration may be important.

@marcelstoer
Copy link
Member

@NicolSpies I still can't quite picture what the real issue is but I suspect it should be tracked separately.

@NicolSpies
Copy link
Author

Hi, I am testing "in the wild" on two identical units in a production-like environment in order to obtain a clearer picture of the real issue. Will report back when a pattern emerges.

@marcelstoer
Copy link
Member

Thanks Nicol, feel free to create an issue here then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants