Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mqtt not properly acknowledging messages? #1773

Closed
nwf opened this issue Feb 1, 2017 · 8 comments
Closed

mqtt not properly acknowledging messages? #1773

nwf opened this issue Feb 1, 2017 · 8 comments
Labels

Comments

@nwf
Copy link
Member

nwf commented Feb 1, 2017

Something appears to go wrong, occasionally, especially with empty messages. I've tested this against both a mosquitto and a vernemq broker, so I think it really is an issue with nodemcu's mqtt implementation.

I am running the following code on the device

rtctime.set(0)
m = mqtt.Client("NODE",15,nil,nil,1)
m:on("connect",function(c) print("CONNECT") end)
m:on("offline",function(c) print("OFFLINE") end)
m:on("message",function(c,t,m) print("MESSAGE",rtctime.get(),t,m) end)
m:connect("[broker redacted]",1883,0,0)
m:subscribe("ctfws/game/message",2)

The device itself identifies as

NodeMCU custom build by frightanic.com
	branch: dev
	commit: a271ce4d34da4efb2a25114da7a5725d344c85ce
	SSL: true
	modules: bit,cjson,cron,file,gpio,i2c,mqtt,net,node,rtcmem,rtctime,sntp,tmr,uart,wifi,tls
 build 	built on: 2017-01-20 06:54
 powered by Lua 5.1.4 on SDK 2.0.0(656edbf)

On another machine, I am using mosquitto_pub to lob messages at the broker. If I run

mosquitto_pub [...] -q 2 -t ctfws/game/message -r -m 'Test 1'

then I get the expected single shot delivery:

MESSAGE	55	ctfws/game/message	Test 1

with the message never repeated. But if I then run

mosquitto_pub [...] -q 2 -t ctfws/game/message -r -m ''

then I get the message sent several times, 20 seconds apart:

MESSAGE	109	ctfws/game/message	nil
MESSAGE	129	ctfws/game/message	nil
MESSAGE	150	ctfws/game/message	nil

20 seconds is the default for mosquitto's "retry_interval" value, so this looks very much like broker-initiated retransmission, and indeed the broker screams bloody murder:

1485983002: Warning: Received PUBREC from NODE for an unknown packet identifier 2.

repeated over and over. Curiously this seems to happen most often with empty messages, but I have seen it with others, too.

If I downgrade the QoS on transmission from 2 to 1, and send

mosquitto_pub [...] -q 1 -t ctfws/game/message -r -m 'Test 1'        
mosquitto_pub [...] -q 1 -t ctfws/game/message -r -m ''      
mosquitto_pub [...] -q 1 -t ctfws/game/message -r -m 'Test 2'

I will not get the PUBREC warning above in the broker's logs, but I will be rewarded with an empty message being re-delivered out of order after the retry timeout:

MESSAGE	17	ctfws/game/message	Test 1
MESSAGE	20	ctfws/game/message	nil
MESSAGE	24	ctfws/game/message	Test 2
MESSAGE	41	ctfws/game/message	nil

I have seen this happen on other builds, too, including on master rather than dev.

If there's anything more I can add, please don't hesitate to ask.
Thanks much!
--nwf;

@marcelstoer
Copy link
Member

m:connect("[broker redacted]",1883,0,0)
m:subscribe("ctfws/game/message",2)

Not the root cause here but I'm surprised this even works considering that you should subscribe/publish only if connected.

@nwf
Copy link
Member Author

nwf commented Mar 28, 2017

Almost surely not the relevant cause; I'll fix the minimized example to respect that constraint, tho'.

@nwf
Copy link
Member Author

nwf commented Apr 1, 2017

OK, this is going to be a long comment. I have asked for, and gotten, a dev-branch build with debugging turned on (thanks to the builder!). This build identifies itself as

NodeMCU custom build by frightanic.com
	branch: dev
	commit: 9c71c64dcd2b2705774a0f3d00c9a5cbc8677afc
	SSL: false
	modules: mqtt,net,node,rtctime,tmr,uart,wifi
 build 	built on: 2017-04-01 00:17
 powered by Lua 5.1.4 on SDK 2.0.0(656edbf)

I have revised the minimum (non-)working example thusly. Having set all of wifiss, wifipw, user, passwd, topic, qos and broker, I run the following right after the device restarts

wifi.setmode(wifi.STATION, false)
wifi.sta.config( { ["ssid"] = wifiss, ["pwd"] = wifipw, ["save"] = false } )
wifi.sta.connect()

when I see that DHCP has succeeded, I run

rtctime.set(0)
m = mqtt.Client("NODE",15,user,passwd,1)
m:on("connect",function(c) print("CONNECT") ; m:subscribe(topic,2,function() print("SUBSCRIBED") end) end)
m:on("offline",function(c) print("OFFLINE") end)
m:on("message",function(c,t,m) print("MESSAGE",rtctime.get(),m) end)
m:connect(broker,1883,0,0)

On a nearby host, I send some traffic, including an empty message, thus:

P=(-h $BROKER -u ... -P ... -t $TOPIC)
for i in `seq 1 5`; do mosquitto_pub "$P[@]" -q $QOS -m $i; sleep 5; done
mosquitto_pub ."$P[@]" -q $QOS -m ''
for i in `seq 11 15`; do mosquitto_pub "$P[@]" -q $QOS -m $i; sleep 5; done

For all QOS=0, this seems to work fine, and no nil message appears out of sequence; messages come in and debugging spews from the machine:

enter mqtt_socket_received.
MQTT_DATA: type: 3, qos: 0, msg_id: 0, pending_id: 0
enter deliver_publish.
MESSAGE	415	9
leave deliver_publish.
send_if_poss, queue size: 0
leave mqtt_socket_received.

However, if I set all the QOS to 1 and re-run the experiment, something is immediately fishy:

enter mqtt_socket_received.
MQTT_DATA: type: 3, qos: 1, msg_id: 5, pending_id: 0
MQTT: Queue response QoS: 1
enter deliver_publish.
MESSAGE	1171	5
leave deliver_publish.
Sent: 4
send_if_poss, queue size: 1
leave mqtt_socket_received.
enter mqtt_socket_sent.
sent1, queue size: 1
send_if_poss, queue size: 0
sent2, queue size: 0
leave mqtt_socket_sent.
[... ping timer debugging printouts elided ...]
enter mqtt_socket_received.
MQTT_DATA: type: 3, qos: 1, msg_id: 0, pending_id: 0
MQTT: Queue response QoS: 1
enter deliver_publish.
MESSAGE	1178	nil
leave deliver_publish.
Sent: 4
send_if_poss, queue size: 1
leave mqtt_socket_received.
enter mqtt_socket_sent.
sent1, queue size: 1
send_if_poss, queue size: 0
sent2, queue size: 0
leave mqtt_socket_sent.
[... ping timer debugging printouts elided ...]
enter mqtt_socket_received.
MQTT_DATA: type: 3, qos: 1, msg_id: 7, pending_id: 0
MQTT: Queue response QoS: 1
enter deliver_publish.
MESSAGE	1179	11
leave deliver_publish.
Sent: 4
send_if_poss, queue size: 1
leave mqtt_socket_received.
enter mqtt_socket_sent.
sent1, queue size: 1
send_if_poss, queue size: 0
sent2, queue size: 0
leave mqtt_socket_sent.
[... ping timer debugging printouts elided, as well as receipt of messages with bodies 12 and 13 ...]
enter mqtt_socket_received.
MQTT_DATA: type: 3, qos: 1, msg_id: 0, pending_id: 0
MQTT: Queue response QoS: 1
enter deliver_publish.
MESSAGE	1198	nil
leave deliver_publish.
Sent: 4
send_if_poss, queue size: 1
leave mqtt_socket_received.
enter mqtt_socket_sent.
sent1, queue size: 1
send_if_poss, queue size: 0
sent2, queue size: 0
leave mqtt_socket_sent.
enter mqtt_socket_received.
MQTT_DATA: type: 3, qos: 1, msg_id: 10, pending_id: 0
MQTT: Queue response QoS: 1
enter deliver_publish.
MESSAGE	1198	14
leave deliver_publish.
Sent: 4
send_if_poss, queue size: 1
leave mqtt_socket_received.
enter mqtt_socket_sent.
sent1, queue size: 1
send_if_poss, queue size: 0
sent2, queue size: 0
leave mqtt_socket_sent.

Zooming in a bit, I see

MQTT_DATA: type: 3, qos: 1, msg_id: 5, pending_id: 0
MQTT_DATA: type: 3, qos: 1, msg_id: 0, pending_id: 0
MQTT_DATA: type: 3, qos: 1, msg_id: 7, pending_id: 0
MQTT_DATA: type: 3, qos: 1, msg_id: 0, pending_id: 0
MQTT_DATA: type: 3, qos: 1, msg_id: 10, pending_id: 0

The message IDs are perfectly incremented up through and including message "5", then mysteriously 0 rather than 6 for the empty message, then 7 and on again for the non-empty messages. That looks like a parser bug on the nodemcu side. In fact, 0 looks like an error value from mqtt_get_id, but nobody seems to be paying attention (I mean, they're just using the returned 0 as the message id).

In fact, looking at mqtt_get_id's handling of MQTT_MSG_TYPE_PUBLISH... surely the check for QOS 0 should be moved before the topic parsing, since we won't use all that work if QOS is zero. In fact, is the problem here just that if(i + 2 >= length) (line 254 as of 426e805) should be if (i+2 > length) instead? We're absolutely OK with consuming every byte of a message if there isn't a payload.

Anyway, back to the test... running this with QOS=2 really confuses things; we don't even always make it to the empty message before falling offline:

enter mqtt_socket_received.
MQTT_DATA: type: 3, qos: 2, msg_id: 3, pending_id: 0
MQTT: Queue response QoS: 2
enter deliver_publish.
MESSAGE	2364	3
leave deliver_publish.
Sent: 4
send_if_poss, queue size: 1
leave mqtt_socket_received.
enter mqtt_socket_sent.
sent1, queue size: 1
sent2, queue size: 1
leave mqtt_socket_sent.
enter mqtt_socket_timer.
timer, queue size: 1
Sent: 4
send_if_poss, queue size: 1
keep_alive_tick: 0
leave mqtt_socket_timer.
enter mqtt_socket_received.
MQTT_DATA: type: 6, qos: 1, msg_id: 3, pending_id: 3
MQTT: Response PUBCOMP
send_if_poss, queue size: 1
leave mqtt_socket_received.
enter mqtt_socket_sent.
sent1, queue size: 1
send_if_poss, queue size: 0
sent2, queue size: 0
leave mqtt_socket_sent.
enter mqtt_socket_disconnected.
OFFLINE
leave mqtt_socket_disconnected.

I am not sure what to make of that.

@nwf
Copy link
Member Author

nwf commented Apr 1, 2017

It occurs to me, belatedly, to tag @pjsg for his opinion on the above.

devsaurus pushed a commit that referenced this issue Apr 19, 2017
@marcelstoer
Copy link
Member

Now that #1906 is merged do you want to keep this open?

@nwf
Copy link
Member Author

nwf commented Apr 20, 2017

Yes; I believe #1906 is only a partial fix to the issues enumerated above. In particular, I do not believe it has bearing on the QoS=2 case above.

eiselekd pushed a commit to eiselekd/nodemcu-firmware that referenced this issue Jan 7, 2018
@stromnet
Copy link
Contributor

stromnet commented Dec 2, 2018

Please take this for a new test now with #2571 merged to dev, I did some quick tests and it seems at least I don't get a screaming warnings in mosquitto, but I don't think its 100% perfect, seeing some repeated messages for some reason, when there is no sleep between the sends (i.e between the empty msg and the first on in second loop):

1543749425: Received PUBREC from 3071920 (Mid: 17) 
1543749425: Sending PUBREL to 3071920 (Mid: 17)                       
1543749425: Received PUBREC from 3071920 (Mid: 17)                  
1543749425: Sending PUBREL to 3071920 (Mid: 17)
1543749425: Received PUBREC from 3071920 (Mid: 18)
1543749425: Sending PUBREL to 3071920 (Mid: 18)                       
1543749425: Received PUBREC from 3071920 (Mid: 18)                  
1543749425: Sending PUBREL to 3071920 (Mid: 18)

Don't have time to dig further myself though!

@stale
Copy link

stale bot commented Nov 27, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Nov 27, 2019
@stale stale bot closed this as completed Dec 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants