-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MQTT drops connection for no apparent reason #2932
Comments
Can you please be more specific which branch are you using now? Have you tried with clean mosquitto? |
@KT819GM Sorry, I am using NodeMCU 2.2.1-master_20190405 build I have tried connecting to Mosquitto with NodeMCU and its flawless, however not a fair test, as Mosquitto does not support secure connections. I have to say, I am disappointed with the lack of support from @joysfera @NicolSpies @pjsg on this issue. For the novice, this is a massive learning curve and I was hoping that an expert could contribute. It's not something a lot of people are able to fix themselves. |
@georeb , I strongly suspect 27Kb heap is not enough for a successful TLS azure handshake as has been discussed in detail in previous posts. For Telegram I need the heap to be at least 35Kb with a |
@NicolSpies Perhaps I should have been clearer; I am connected with 27Kb heap, not attempting to connect with 27Kb heap. The handshake is successful and I am connected. The issue intermittently arises on a message receive, publish or most commonly, sending a keep-alive packet (basically, any communication event between the host and client and vice versa). There is no pattern, just completely unpredictable. I have debugged as you've described and the only useful information from that debug is above:
I'd expect to see an |
@georeb , sorry to disappoint you with the lack of support. Couple of notes from my own experience:
|
No problem @joysfera I understand that this is all open source, therefore no-one is obliged to progress this, however, this is all only as powerful as the support users are provided with. I get unjustifiably frustrated sometimes. Oh well!
Your 'work around' solution may work for Mosquitto but isn't a viable solution for when using Azure, as users are charged for every message they process, which, being Microsoft, gets quite expensive.
Sorry, my mistake, I wasn't aware of this. It's been at least 2 years since I last used it.
I disagree, I use many different TLS aspects of the NodeMCU firmware and have no issue usually, just MQTT. I've learnt that you just have to code as cleanly as possible. @NicolSpies I agree that there is a bug in the MQTT module somewhere, rather than being a TLS issue, but where and why has this been ignored for 3 years!? This is also backed up (as I mentioned in #1811) by the following debug I got in 2017 and I still get now:
Where can we go from here? We have a clear and known issue thanks to @joysfera sniffing around with Wireshark, so why are we just ignoring this problem? |
NodeMCU MQTT module got some fundamental improvements about a year ago, IIRC. For example a receiving message fragmented by TCP is now correctly concatenated at the NodeMCU side. This didn't work previously at all. I assume the change needed to implement this could also fixed most (if not all) other MQTT issues I've been having originally. I don't go back and test old things due to lack of time and also no possibility to upgrade firmware in units that have been deployed already. FYI, I use the 20181207 release in all the nodes deployed in last 10 months but I have also many nodes running some older firmwares (2017, 2016 and earlier) including those that infamously corrupt memory and do other nasty things. Nevertheless they all seem to work rather stable (knocking on wood!), mostly thanks to automatic NodeMCU reboot when things go hairy. FYI2, I may have the largest NodeMCU deployment of you all. Also, my software is rather complex and mostly does not fit into memory so it tends to reboot when say timer fires and web request arrives at the same time. And I cannot update old firmware because there isn't any OTA (I didn't go for the rboot thingy), so it's a pain when NodeMCU breaks compatibility with old firmware (which happens surprisingly often). That's also why I am generally not much interested in NodeMCU progress because I simply cannot use it easily (I still provide OTA update of my software even to several years old nodes running outdated and buggy firmware). I also have an interesting experience trying to run my Lua software on ESP32 (I was hoping TLS would be stable there). It forced me to do things I was hoping would never have to do, but that's another story. Anyway, the MQTT without TLS seems to do what I need, most of the time, using the workarounds I explained earlier/above. Wish you better luck with Azure. |
@NicolSpies Any thoughts? Is there a way to monitor the behaviours of the TLS handshake that is maintained? I suspect that when my software processes the messages from Azure and generates the required response, that perhaps I use too much heap and the connection is dropped. Is this possible or even able to be shown on the debug? Like I said, I don't get any |
Compiling a image with all the debug switches and using the tls.debug(3) command provides reams of debug information of all the TLS messages exchanged between the 8266 and the server. I have posted a number such debugs on the list. Without the debug logs it is impossible to say what the problem could be. I had TLS related problems in the past not related to low heap that could only be spotted by fine-combing the debug log line by line. |
@NicolSpies Okay, that sounds promising and I'll comb through as much debug output as required if it helps fix the issue for me and others... Currently my debug output gives this:
PLEASE NOTE
How do I enable this |
@georeb I assume you have included You could find something similar to the following in the debug info related to the failure you reported:
0x7880 is MBEDTLS_ERR_SSL_PEER_CLOSE_NOTIFY, normal shutdown of a TLS connection. this means the connection was otherwise fine, but server shut it down. this usually means MQTT protocol error, the server didn't like something about what was sent. |
@NicolSpies WOW that really outputs a lot of info, thanks for that!! I'll start sifting :) However, using This is the exact point that the watchdog barks...
|
@georeb, You need to step away a bit to see the big picture and follow the logical process from the start to see what it was busy with.
This refers to a handshake failure. You will be able to confirm this reading the debug info from the start to see if this is still during the initial TLS handshake process. I suspect it is and the failure happens before the TLS handshake is confirmed to be established. If this is so, the problem is not related to MQTT. |
@NicolSpies @pjsg @TerryE The watchdog timer reset is unrelated to this issue. It occurs every time I try to connect to Azure. I have got around this by issuing Anyway... Extract from a successful PUBLISH event Extract from an unsuccessful PUBLISH event and immediate disconnect by host To save you the time sifting through, below are the only 4 lines of inconsistencies between a successful and unsuccessful PUBLISH event. Remember that nothing has been changed between these two publish attempts. I am sending exactly the same payload to exactly the same topic... Publishing the payload Successful PUBLISH event
Unsuccessful PUBLISH event
It appears to show a payload encryption mismatch, differing between 80 or 112 bytes in length! I should clarify what I mean when I state 'unsuccessful PUBLISH event'. You can see that the publish routine does actually always publish the (sometimes bad) data, but the data that has been published is rejected by the host. |
@georeb , try the same but with the QoS in the publish command set to 1 and 2 respectively to see if it makes a difference. |
@NicolSpies Okay, I'll give that a try, although, please see a quote from Azure below...
I understand that your suggestion would change the QoS to 1 but what does a 'retain' value of 2 do? I always thought that retain could be either 0 or 1 (although this isn't specified in the docs). |
@georeb, the more I investigate, the more I run into old post of more or less the same issue. The fact that it is still around indicates to me that nobody is trying to do what you are or have accepted that it can not work. All I can suggest is to try and find a pattern that could give a clue to why the problem is intermittent. More eyes on the problem would also help even if it is only to suggest what to try. |
@georeb, the problem could be caused by MQTT, TLS or Azure. It would help to have eyes on the server side as well. I do not know if a local azure setup is possible. You could try another secure MQTT server to see if the fault persists. You know it is fine in a unsecured connection. So that could maybe rule MQTT out. It does work but occasionally breaks down, could it be connection instability. Run as stand alone code using LFS to ensure lots of heap. Hope this helps |
@NicolSpies I think I may have found the culprit... An if( mode == MBEDTLS_MODE_STREAM ||
( mode == MBEDTLS_MODE_CBC
#if defined(MBEDTLS_SSL_ENCRYPT_THEN_MAC)
&& ssl->session_out->encrypt_then_mac == MBEDTLS_SSL_ETM_DISABLED
#endif
) ) I have tried to get an output of the values within the I've tried using MBEDTLS_SSL_DEBUG_MSG( 1, ( "MODE = ", mode ) );
MBEDTLS_SSL_DEBUG_MSG( 1, ( "MBEDTLS_MODE_STREAM = ", MBEDTLS_MODE_STREAM ) );
MBEDTLS_SSL_DEBUG_MSG( 1, ( "MBEDTLS_MODE_CBC = ", MBEDTLS_MODE_CBC ) );
MBEDTLS_SSL_DEBUG_MSG( 1, ( "ssl->session_out->encrypt_then_mac = ", ssl->session_out->encrypt_then_mac ) );
MBEDTLS_SSL_DEBUG_MSG( 1, ( "MBEDTLS_SSL_ETM_DISABLED = ", MBEDTLS_SSL_ETM_DISABLED ) ); Can you shed any light on this @TerryE and/or @nwf ? I'm assuming that an incorrect cipher mode is being selected, resulting in Azure rejecting the published data and kicking the device offline... |
@georeb, accolades for your determination not to give up. You have been doing a lot of digging to try and isolate the area where the problem could occur. In the past I did the same to enable the hardcore developers to give their expert opinion without the hours of trying to find a pattern that could give a clue of the problem. @nwf and @TerryE provided excellent guidance that allowed me to fix the problem. |
@NicolSpies Haha, I don't like giving up!
Exactly my mentality! I have the time to dig, just need @TerryE and @nwf to keep me heading in the right direction... In addition to my last comment: It appears that, on an unsuccessful publish/keepalive, as suspected, I am also getting:
...followed by...
...then...
Which falls in line with the fact that the |
MbedTLS Support have replied with this:
And then more worryingly, this:
Which cannot be lack of RAM, as TLS debug reports 26432 bytes of free heap at this IF statement. |
Right, so, by implementing extra TLS debug messages, I have determined the following: Constants (that we already knew): MBEDTLS_MODE_STREAM = 7
MBEDTLS_MODE_CBC = 2
MBEDTLS_SSL_ETM_DISABLED = 0 On a successful MQTT publish: mode = 2
ssl->session_out->encrypt_then_mac = 0 On an unsuccessful MQTT publish: mode = 2
ssl->session_out->encrypt_then_mac = 1701013878 So that’s the discrepancy that's causing the IF statement to return false; @TerryE and @nwf I would really appreciate your input on this, I've done the leg work here, so I'm hoping a quick glance is all you need to get to the bottom of this...?! |
FYI, the unsuccessful mac value is |
@georeb I agree that this needs some diagnostic time with someone who knows their way around using the gdbstub interface, but unfortunately I am up to my eyes in alligators at the moment trying to get the Lua53 release out. Perhaps @nwf or @HHHartmann can help? |
Skimming mbedTLS's source, There is no particularly good way to debug such things; perhaps printing out more of the session structure could give a hint as to who's overflowing some adjacent buffer. If the gdbstub understands watchpoints (@TerryE?), then that's another option as well. |
@georeb, @HHHartmann, picking up @nwf's point:I do understand the gdbstub and
But I worked out all of this by trial and error and looking at the sources. I also feel that it is a mistake for me to be the single point of all such knowledge on the project, hence #2731. If anyone has basic C development skills, preferably has used |
Re #2731: @georeb @NicolSpies @joysfera @nwf @HHHartmann, any volunteers? |
Using Thank you @NicolSpies for suggesting to use This fixed my unique issue however, it doesn't explain why
Perhaps this potential memory corruption is something worth investigating... |
@georeb , great news. To confirm, the random disconnection does not happen anymore ? |
@NicolSpies No, the device is bomb proof now :) The ciphers were conflicting on what looks like the event of a memory corruption. |
Happy to contribute where I can, however I'm a novice at C. I only dip in and out when I absolutely have to, case in point. |
@georeb, I would like to try it as well, would you mind to share your code with me offline |
@NicolSpies My code is using the standard MQTT module, following the connection method explained here. You'll have to sign up for an Azure account to try it however... |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I'm not sure this is stale; we may still have some un-diagnosed memory corruption. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Expected behavior
Stay connected to MQTT broker, subscribe, publish and send keep-alive packets without randomly disconnecting!
Actual behaviour
Intermittent disconnects with no rhyme nor reason.
Disconnects on a message receive
Disconnects on a publish
Disconnects on sending a keep-alive packet
Error code from NodeMCU
MBEDTLS_ERR_SSL_PEER_CLOSE_NOTIFY -0x7880
The peer notified us that the connection is going to be closed.
Error code from Azure
404104
The connection was closed by the client, but IoT Hub doesn't know why.
One of them is lying.
I'm connected with 27k heap (using LFS) so memory shouldn't be a problem (and doesn't error)
Test code
Connecting to Microsoft Azure IoT Hub (a pretty standard requirement for any IoT device)
NodeMCU version
Master branch
Hardware
ESP8266-12S
The text was updated successfully, but these errors were encountered: