Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mosquitto 5.1.1 is broken. #1887

Closed
adamf663 opened this issue Feb 26, 2021 · 158 comments · Fixed by #2007
Closed

Mosquitto 5.1.1 is broken. #1887

adamf663 opened this issue Feb 26, 2021 · 158 comments · Fixed by #2007

Comments

@adamf663
Copy link

The problem

Environment

  • Add-on with the issue: Mosquitto broker
  • Add-on release with the issue: 5.1.1
  • Last working add-on release (if known): 5.1
  • Operating environment (OS/Supervised): hassio

Problem-relevant configuration

Traceback/Error logs

Additional information

@LifeBandit666
Copy link

LifeBandit666 commented Feb 27, 2021

I know OP has left zero information here and it will probably be closed out, but on a surface level I have to agree. Something has gone wrong with this point update.

I get my whole Home Assistant instance go down overnight. When I try to access it over the network I cant connect.

After a week of troubleshooting I have narrowed it down to my Mqtt server. What seems to be happening is it loses connection to HA and throws the whole thing off. HA is still running and inaccessible. I have to power cycle to get it up and running again and when I do, I get errors in HA logs about "timed out waiting for mid 2" for Mqtt, and errors in mosquitto about being unable to find HA.

I also find my Zigbee2mqtt devices are found at startup then immediately lost again (presumably when the time out occurs) and have to restart Zigbee2mqtt a couple of times to get it running again.

I wish i could prove that the HA hang is due to mosquitto but since it's inaccessible (but still running) I can't. What I can say is that in Recorder I can see the only devices that are unavailable are Zigbee2mqtt and this drops because of mosquitto (there's zero errors in the Zigbee2mqtt logs in debug mode).

I've restored an old backup of the previous version of Mosquitto to try and combat the issue. I'll try to remember to report back.

Edit: I think the same issue is #1817 and #1814

Also this one home-assistant/core#45036

@LifeBandit666
Copy link

Ok so after rolling back to 5.1 version of Mosquitto I no longer get the timeout errors on startup in the core log and Zigbee2mqtt seems to boot from the get go again (instead of requiring multiple reboots to find Mosquitto) Hopefully this will also fix the issue of my whole HA server becoming unresponsive periodically too.

This has been a week long hunt to try and find out why my system is dying which I was initially blaming on a new cable for my SSD. I've also tried looking at my Powerline system thinking it was cutting connection to HA.

@hoopsta1423
Copy link

Yup, I updated last night and the add-on CPU usage spiked up to 150%. All my zwave2mqtt devices stopped working. Rolled back to 5.1 this morning and it's back to normal

@grantalewis
Copy link

Confirming same. Very strange, sluggish system recently with hundreds of lines similar to

2021-02-27 07:18:29 ERROR (MainThread) [homeassistant.components.mqtt] Timed out waiting for mid 98

in home-assistant.log. After rolling Mosquitto back to 5.1, no more errors in the log, performance back to normal.

@peter-vanpoucke
Copy link

peter-vanpoucke commented Feb 27, 2021

I also needed to rollback to 5.1. I had the feeling Hassio couldn't log in any longer, but didn't put time in investigating. 😔

@andersmoldin
Copy link

andersmoldin commented Feb 27, 2021

I too seem to have this problem after installing Mosquitto 5.1.1. I did not make a snapshot, so I wonder if it is possible to install a specific version of Mosquitto, or in any other way downgrade to 5.1?

Edit: Attaching a screenshot. Connecting locally works fine, but connecting through my duckdns address suddenly doesn't work – after upgrading Mosquitto from 5.1. to 5.1.1 yesterday.
Screenshot 2021-02-27 at 14 59 03

@willysoft80
Copy link

Confirming same. Very strange, sluggish system recently with hundreds of lines similar to

2021-02-27 07:18:29 ERROR (MainThread) [homeassistant.components.mqtt] Timed out waiting for mid 98

in home-assistant.log. After rolling Mosquitto back to 5.1, no more errors in the log, performance back to normal.

I have the same problem and the home assistant has become slow to read the topics

@realthk
Copy link

realthk commented Feb 27, 2021

I have the same problem and the home assistant has become slow to read the topics

Yes, that was what I've noticed first: most of my stuff is MQTT-based, and it suddenly took a few seconds to switch on a light through MQTT, controlled by a door- or motion-sensor, also connected through MQTT. I tried restart, then hundreds of "Timed out waiting for mid xxx" messages in the log.

Found this topic, rolled back MQTT to 5.1 and no messages any more, and light is switched on well under a second again.

@Dinth
Copy link

Dinth commented Feb 27, 2021

Yep, same here - none of the things using MQTT in my home works anymore - Openzwave, Ebus, Zigbee2MQTT, Valetudo - all the devices went "Unavailable". I have tried restarting every component but still cant get any of my devices to work.
Strangle - MQTT addon log shows that addons are connected:

1614440933: Client zwave already connected, closing old connection.
1614440933: New bridge connected from 172.30.33.2 as zwave (p2, c1, k60, u'addons').
1614440933: Client blueIris already connected, closing old connection.
1614440933: New client connected from 10.10.1.14 as blueIris (p1, c1, k20, u'xxx').
1614440933: Client mqttjs_c5c88712 already connected, closing old connection.
1614440933: New client connected from 172.30.33.5 as mqttjs_c5c88712 (p2, c1, k60, u'xxx').
1614440933: Client ebusd_21.1_770 already connected, closing old connection.
1614440933: New client connected from 10.10.25.2 as ebusd_21.1_770 (p1, c1, k60, u'xxx').
1614440933: Client blueIris already connected, closing old connection.
1614440933: New client connected from 10.10.1.14 as blueIris (p1, c1, k20, u'xxx').

But the devices those components are responsible for are unavailable in HA.
Also when im trying to connect to MQTT using an MQTT client, it doesnt connect anymore:
1614440958: Socket error on client auto-FBB4A2D6-C2EF-EC2F-C7C9-04379354796B, disconnecting.

@JoJa1101
Copy link

Same here, MQTT not working well anymore. Getting a ton of [homeassistant.components.mqtt] Timed out waiting for mid messages

@davidms12
Copy link

I have one instance running 5.1 and it's working fine. My main machine I upgraded to 5.1.1 - big mistake - all devices through mqtt are now showing as unavailable and tons of "timeouts waiting for mid..."

@JoJa1101
Copy link

Exactly: looks like this since 5.1.1

image

@G3rry71
Copy link

G3rry71 commented Feb 28, 2021

Same issue here! How do I force 5.1.0 installation from Supervisor? Thanks

@adamf663
Copy link
Author

Don't quote me as I'm guessing. I had used an old snapshot.
I think a 'docker pull homeassistant/aarch64-addon-mosquitto:5.1' might work. The only ways I know to access the docker containers is through the portainer addon, or by ssh'ing to -p 22222 root@. I think enabling 22222 also requires an addon.

@Dinth
Copy link

Dinth commented Feb 28, 2021

I think that restoring a snapshot (it is possible to restore JUST the addon) is the only option

@sir106
Copy link

sir106 commented Feb 28, 2021

after restart of home assistant and upgrading i have the same problem.

@JoJa1101
Copy link

this is srsly annoying right now! I have to switch out lights manually and close shutters now ^^

@maxlyth
Copy link

maxlyth commented Mar 1, 2021

Same symptoms here with 5.1.1 update

In the log I was getting a lot of 'Socket error on client DVES..' errors so I thought this was caused by my recent changes to inter-VLAN routing of Unfi-Wifi config.

Wasted too many hours on this before I saw this thread and reverted just the Mosquitto part of a recent snapshot which instantly fixed the problem.

@ahknight
Copy link

ahknight commented Mar 1, 2021

Exact same issue. Everything's broken with 5.1.1. Luckily had an old snapshot and I'll be trying to restore from that.

@JoJa1101
Copy link

JoJa1101 commented Mar 1, 2021

I did the same. Rolled back only Mosquitto to 5.1 and it works well again.

@alex-savin
Copy link

@JoJa1101 How did you roll back?

@blacknacoustic
Copy link

I did the same. Rolled back only Mosquitto to 5.1 and it works well again.

Hey can anyone help me roll back? How would i do that?

@bsmeding
Copy link

bsmeding commented Mar 2, 2021

Same here only rollback from old snapshot restored the mqtt function

Would be a nice feature to get a version dropdown by the addons so a rollback would be easier for everyone

@JoJa1101
Copy link

JoJa1101 commented Mar 2, 2021

@JoJa1101 How did you roll back?

I made a snapshot on my HASS Testsystem (was still running on Mosquitto 5.1) and did the partial rollback on my Prod System.

@alex-savin
Copy link

alex-savin commented Mar 2, 2021

The issue is that socat (which serves as auth point) is being continuously executed again and again!

/data # ps axu
PID   USER     TIME  COMMAND
    1 root      0:03 /sbin/docker-init -- /init /run.sh
    8 root      0:00 s6-svscan -t0 /var/run/s6/services
   36 root      0:00 foreground  if   /etc/s6/init/init-stage2-redirfd   foreground    if     if      s6-echo      -n      --      [s6-init] making user provided files available at /var/run/s6/etc...          foreground      backtick      -n      S6_RUNTIME_
   37 root      0:00 s6-supervise s6-fdholderd
   48 root      0:00 foreground  s6-setsid  -gq  --  with-contenv  backtick  -D  0  -n  S6_LOGGING   printcontenv   S6_LOGGING    importas  S6_LOGGING  S6_LOGGING  ifelse   s6-test   ${S6_LOGGING}   -eq   2     redirfd   -w   1   /var/run/s6/uncaught-logs-fi
  189 root      0:00 bash /usr/bin/bashio /run.sh
  246 root      6:14 socat TCP-LISTEN:8080,fork,reuseaddr SYSTEM:/bin/auth_srv.sh
  247 root     22:02 mosquitto -c /etc/mosquitto.conf
 5453 root      0:00 socat TCP-LISTEN:8080,fork,reuseaddr SYSTEM:/bin/auth_srv.sh
 5454 root      0:00 socat TCP-LISTEN:8080,fork,reuseaddr SYSTEM:/bin/auth_srv.sh
 5455 root      0:00 bash /usr/bin/bashio /bin/auth_srv.sh
 5459 root      0:00 ps axu
32379 root      0:00 /bin/sh

@rpitera
Copy link

rpitera commented Mar 2, 2021

Glad I found this; I was just about to install and try to migrate from CloudMQTT to a local instance. I only used CM because I needed an external server to use with an older mobile location method that I don't use anymore since the HA App came out. Guess I'll subscribe to this thread and wait it out.

@Mariusthvdb
Copy link

very much an issue here, especially since the mqtt binary_sensors keep triggering the automations on each reload (don't know how to call it but Mqtt gets re-triggered quite often now...)

do we have an add-on dev listening in on this?

@poudenes
Copy link

poudenes commented Mar 3, 2021

Have same issue... It seems that the speed of mqtt is also changed...
I get those errors as well while restart HA.
I see that my zwave2mqtt reaction time is changed with a short delay.

I don't know if this is because of the bug that's how in latest version

@davidms12
Copy link

Any Devs looking into this? thx.

@geoffrey-vl
Copy link

geoffrey-vl commented May 15, 2021

Is it possible to use an alternative mqtt broker in home assistant while this is being fixed?

See #1887 (comment)

@cometsfernbroccoli
Copy link

Is it possible to use an alternative mqtt broker in home assistant while this is being fixed?

I just installed the eclipse-mosquitto in docker, exposed the nessecary ports and its working perfectly for now. hope the add-on will be fixed asap..

for now, everything is up and running and working smoothly..

@matejzero
Copy link

I just installed the eclipse-mosquitto in docker, exposed the nessecary ports and its working perfectly for now. hope the add-on will be fixed asap..

Did the same and it's now working really fast.

In the core addon, when I initially connected with MQTT Explorer, it needed 1-2s to connect and then sometimes 5s+ to get topics. Reconnect was faster.

This most likely also resolves my issues with automations not working: https://community.home-assistant.io/t/on-motion-wait-until-illuminance-sensor-is-updated

@frenck
Copy link
Member

frenck commented May 17, 2021

Status update time 🕐

I've created quite a few versions of the add-on the past week and ended up with the simplest of them all...
(NOTE TO SELF: Nothing beats KISS 😓 )

  • MQTT ✅
  • MQTT over WebSocket ✅
  • MQTT with SSL ✅
  • MQTT over WebSocket with SSL ✅
  • Local users auth ✅
  • Home Assistant users auth ✅
  • Home Assistant integration discovery ✅
  • Supervisor add-on services ✅
  • Migrated everything to S6 overlay ✅
  • Migrated everything to bashio ✅
  • Running on latest Alpine Linux ✅
  • Can handle a metric ton of auth requests in a couple of milliseconds per request ✅
  • Duration testing ✅
  • Anonymous users 🔴 Removed

I'm happy with the result, it seems to perform better. Currently got Z-Wave JS 2 MQTT, Home Assistant, a local mosquitto_sub subscribed to all topics and MQTT explorer connected; Around 25 ESPChip on ESPHome sending all a lot of status updates each second. Additionally opened up a script that tries to send in updates as fast as possible (while disconnecting/connecting between each update).

This is how fast it is now, each line in the screenshot below is a client that publishes a time (with microseconds) as fast as possible while re-connecting between each update:

image

I'll leave this all running overnight to see how it holds and do a final round of reviewing my own code in the morning.

Morning update: Everything still alive and kicking.

@matejzero
Copy link

Do you maybe have numbers from the current mqtt addon to compare, to see if the poor results can be replicated in the old addon:)

@CubieMedia
Copy link

Can handle a metric ton of auth requests in a couple of milliseconds per request

Very well done, this is hard to test as the problem always was only the first login (of each device) that took very long (> 5 sec).

Thanks for putting you head into this!

@matejzero
Copy link

@frenck if you connect to the MQTT server, do you get all topics right away? In the problematic server, my connection took some time (between 0-2s), but then I had to wait 5s-6s for topics to apear. Is that mitigated in your new version?

@frenck
Copy link
Member

frenck commented May 18, 2021

if you connect to the MQTT server, do you get all topics right away?

Yes, however, this of course is affected by the amount of data you store, and how fast both the client and server are.

Very well done, this is hard to test as the problem always was only the first login (of each device) that took very long (> 5 sec).

True! There is a big difference in the handling now though, but will run a test without caching layers to measure the first-time every time.

@frenck
Copy link
Member

frenck commented May 18, 2021

@CubieMedia Auth/ACL caching disabled, local user:

image

Auth/ACL caching disabled, Home Assistant user account:

image

@matejzero
Copy link

Doesn't seem to do much difference, around 3ms for local users and a few more when asking HA. This is fast!

if you connect to the MQTT server, do you get all topics right away?

Yes, however, this of course is affected by the amount of data you store, and how fast both the client and server are.

I'm currently running the obsolete MQTT addon and compared to the core addon, it connects and shows all topics instantly compared to core addon where it takes a few seconds to get them. I wanted to do some mosquito_pub tests today but seeing as you got pretty far with the updated version, I'll just wait for the new release and test with that.

@CubieMedia
Copy link

With the HA user it looks like 8-10 ms and Auth/ACL disabled should create an environment like a restart and devices newly connect to the broker.

Test approved!

The 10ms are putting my hopes up that this will work soon again! Can i get any info of what you changed when the new version is released? Startup parameters from MQTT Broker or something container specific?

@frenck
Copy link
Member

frenck commented May 18, 2021

Can i get any info of what you changed when the new version is released?

There is a changelog and, of course, it is open source: Yes, you can see that.

The main difference in this case, is that part of the authentication handling is now done by the NGINX webserver. That thing is able to handle a lot of requests: fast. This replaces the bash scripts piped to socat in the current version.

@CubieMedia
Copy link

The main difference in this case, is that part of the authentication handling is now done by the NGINX webserver. That thing is able to handle a lot of requests: fast. This replaces the bash scripts piped to socat in the current version.

@frenck Thank you for that short inside.

@matejzero
Copy link

Will report results once the PR is merged and new version is available. Or is there an easy way to test it before it's merged?

@matejzero
Copy link

matejzero commented May 19, 2021

I did some quick testing and initial filing is that it works as it should now. MQTT client connects really fast and gets all it's topic at the same time.

Motion and button responses are also way faster then before, the same as with deprecated mqtt client from community addons.

There are also no reconnect issues in home assistant with MQTT.

Great work @frenck

@srnoth
Copy link

srnoth commented May 19, 2021 via email

@CubieMedia
Copy link

Yes it is working again, well done!

@Geoff571
Copy link

I'm happy to report things are looking a lot more responsive now that I've upgraded to 6.0.1, the errors in the logs and laggy switch/light responses seem to be gone.

Many thanks @frenck

@bsmeding
Copy link

Yes working perfectly and fast! Thanks @frenck !

@pbzdyl
Copy link

pbzdyl commented May 19, 2021

Thank you @frenck - great job!

@jeffgraham-hub
Copy link

Is anyone else seeing errors in the supervisor log after upgrading to 6.0.1.? If I disable the watchdog on Mosquitto the errors quit. I also rolled back to 5.1.1 and verified no errors with watchdog enabled.

21-05-21 04:59:12 ERROR (MainThread) [asyncio] Task exception was never retrieved
future: <Task finished name='Task-207448' coro=<Scheduler._run_task.<locals>._wrap_task() done, defined at /usr/src/supervisor/supervisor/misc/scheduler.py:58> exception=TypeError('an integer is required (got type str)')>
Traceback (most recent call last):
  File "/usr/src/supervisor/supervisor/misc/scheduler.py", line 62, in _wrap_task
    await task.coro_callback()
  File "/usr/src/supervisor/supervisor/misc/tasks.py", line 416, in _watchdog_addon_application
    if addon.in_progress or await addon.watchdog_application():
  File "/usr/src/supervisor/supervisor/addons/addon.py", line 479, in watchdog_application
    return await self.sys_run_in_executor(check_port, self.ip_address, port)
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/src/supervisor/supervisor/utils/__init__.py", line 43, in check_port
    result = sock.connect_ex((str(address), port))
TypeError: an integer is required (got type str)

@jrhbcn
Copy link

jrhbcn commented May 21, 2021

@jeffgraham-hub I see the same error in the logs.

@realthk
Copy link

realthk commented May 21, 2021

Same here, though still having errors like these after reverting back to Mosquitto 5.1, but only during startup, so probably from another addon (with MQTT 6.x the supervisor log was flooded with these even during the day, not only at startup)

@bsmeding
Copy link

Yes same errors

@Mariusthvdb
Copy link

Mariusthvdb commented May 21, 2021

these errors are gone/fixed in supervisor-2021.05.1

seeing this though in the MQTT log:

1621565643: New connection from 125.64.94.138 on port 8883.
1621565643: Socket error on client <unknown>, disconnecting.
1621565643: New connection from 125.64.94.138 on port 8883.
1621565644: Socket error on client <unknown>, disconnecting.
1621565644: New connection from 125.64.94.138 on port 8883.
1621565644: Socket error on client <unknown>, disconnecting.
1621565644: New connection from 125.64.94.138 on port 8883.
1621565645: Socket error on client <unknown>, disconnecting.
1621565645: New connection from 125.64.94.138 on port 8883.

@frenck
Copy link
Member

frenck commented May 21, 2021

Please stop digging into old issues reporting unrelated new issues. Instead, just create a new issue for each individual thing so it can be tracked and looked at.

Thanks.

@jacoscar
Copy link

I upgraded to 6.0.1 and it was fine for a day but now I’m experiencing delays again

5.1 was fine though

@tylerszabo
Copy link

@frenck I've come up with a slightly less abrasive option for a similar comment. It's a bit more verbose but feel free to copy and paste it. The goal is to avoid discouraging good faith participation while still adhering to your preferred workflow. While it's not as efficient to add a buffer for the emotional state of the reader it can be considered a "social optimization" to add a bit more explanation.

"If this isn't resolved or there's a new problem please create a new issue for each individual thing so it can be tracked and looked at.

We have a workflow that involves opening a new issue instead of reviving old ones; this greatly helps with our tracking and organizing. Re-opening an issue or combining multiple subjects into a single issue adds complexity to the tracking."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.