Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zha lights work OK following homeassistant.restart, but stop responding within hours #124

Closed
wixoff opened this issue Jul 10, 2018 · 243 comments

Comments

@wixoff
Copy link

wixoff commented Jul 10, 2018

I posted something about this in the Community Forum, but I hope it will get better visibility here.

At the moment zha network has seven light bulbs, three door sensors, and one switched outlet. The sensors are reliable, but the lights and switch become unavailable (and unresponsive to commands) after a few hours.

SETUP

Currently the lights are all Osram/Sylvania Lightify (U.S. versions) - five are RGBW A19 and two are two are Tunable White A19. When they work, they work great - almost instant response. The switched outlet is an IRIS v2 plug (reads as CentraLite 3210-L, with the odd dual z-wave/zigbee radios).

I’m not sure what the issue is, because ZHA had been pretty darn reliable for a good number of months. (I also have a Tradfi 1000lm A19 bulb that I was able to include months ago, but I have since reset it, nuked my zigbee.db, and attempted to re-add, and it will no longer show up.) Sometimes repeating the request via the UI several times in a row will cause the bulb to wake up and respond, and eventually even that will change to no responses whatsoever.

I also have three Visonic MCT-340E door/window sensors spread around a fairly large house. Even after the bulbs (and the IRIS plug) quit responding, these sensors still work and are very reliable. The built-in temperature sensor works too, on one of them; the others never change their temperature and one of those is stuck at 32F.

ERRORS

Here’s what an error in the log looks like after the lights stop responding - this represents an attempt to turn off a light via the hass UI:

Mon Jul 09 2018 00:10:40 GMT-0400 (EDT)
Error executing service ServiceCall light.turn_off: entity_id=['light.osram_lightify_a19_rgbw_00a3342d_3']
Traceback (most recent call last):
File "/usr/src/app/homeassistant/core.py", line 1021, in _event_to_service_call
await service_handler.func(service_call)
File "/usr/src/app/homeassistant/components/light/__init__.py", line 362, in async_handle_light_service
await light.async_turn_off(**params)
File "/usr/src/app/homeassistant/components/light/zha.py", line 127, in async_turn_off
await self._endpoint.on_off.off()
File "/usr/local/lib/python3.6/site-packages/zigpy/device.py", line 89, in request
expect_reply=expect_reply,
File "/usr/local/lib/python3.6/site-packages/bellows/zigbee/application.py", line 213, in request
assert sequence not in self._pending
AssertionError

And here is another error, trying to use a scene to turn off four lights (as noted above, the Tradfri light fails because it's currently disconnected):

2018-07-10 07:11:09 WARNING (MainThread) [homeassistant.helpers.state] reproduce_state: Unable to find entity switch.ikea_of_sweden_tradfri_bulb_e26_ws_opal_980lm_fe477406_1
2018-07-10 07:11:09 ERROR (MainThread) [homeassistant.core] Error executing service ServiceCall <light.turn_off: entity_id=['light.osram_lightify_a19_tunable_white_0001d4ac_3', 'light.osram_lightify_a19_tunable_white_0001e6c8_3', 'light.osram_lightify_a19_rgbw_000a381d_3']>
Traceback (most recent call last):
File "/usr/src/app/homeassistant/core.py", line 1021, in _event_to_service_call
await service_handler.func(service_call)
File "/usr/src/app/homeassistant/components/light/__init__.py", line 362, in async_handle_light_service
await light.async_turn_off(**params)
File "/usr/src/app/homeassistant/components/light/zha.py", line 127, in async_turn_off
await self._endpoint.on_off.off()
File "/usr/local/lib/python3.6/site-packages/zigpy/device.py", line 89, in request
expect_reply=expect_reply,
File "/usr/local/lib/python3.6/site-packages/bellows/zigbee/application.py", line 213, in request
assert sequence not in self._pending
AssertionError

There are no other zha-related errors in the log, other than during startup.

I can't imagine this is correct behavior. The five RGBW lights and the IRIS plug are all within 10 feet of the HUSBZB-1 stick, most of them line-of-sight. And yet they drop off just as quickly as the ones further away. And the other two Tunable White bulbs are only another 5 feet past the zigbee outlet switch (which should be a router), but they are behind a modern-contstruction wall (wood studs, wallboard, paint).

As I mentioned in my Community post, I'll update the firmware on the bulbs, but I don't expect improvement because the lights were working pretty well a few hass releases ago.

@walthowd
Copy link

A long shot, but try modifying your bellows/zigbee/application.py and set CONFIG_PACKET_BUFFER_COUNT to 0x4b.

From what I have seen the main issue on the Si Labs EM zigbee chip is lack of memory and according to their documentation the PACKET BUFFER COUNT is the number one contributing factor to memory usage.

@wixoff
Copy link
Author

wixoff commented Jul 10, 2018

Thanks @walthowd - I will give that a try. Are you the same walt that's in the Community thread?

EDIT: application.py edited and hass restarted at 1:45 pm EDT. Let's see what happens!

@wixoff
Copy link
Author

wixoff commented Jul 10, 2018

Didn't work. Attempted to turn off one of the zigbee bulbs at 2:40 pm EDT and it failed. I don't think there was anything else queued - I just manually turned off one single bulb via the hass UI, waited 5-10 seconds, and it popped itself back on in the UI. Light remained on. Same type of error shows in the log.

I tried again about 15 seconds later, and it worked. Light off. This is typical -- bad reliability first, total loss of functionality later.

@wixoff
Copy link
Author

wixoff commented Jul 12, 2018

Some more verbose logging, as suggested by user adminiuga in the Community thread:

https://paste.drhack.net/?65a1595f54bc5328#7mPFttNdjg7ha59Zic0cuBa37Vr6+ghmHKf98Ng9LWU=

(that's a pastebin-like link, for those who might be woried)

These are excerpts, the first one being an error during setup, and the other two being errors that appear when a remote button is pressed to turn a light on. So I have excerpted sections from the log entry that begin where the remote button press is recognized and continue all the way up through the error and traceback.

@wixoff wixoff closed this as completed Jul 12, 2018
@wixoff
Copy link
Author

wixoff commented Jul 13, 2018

Oops, inadvertently closed.

@wixoff wixoff reopened this Jul 13, 2018
@wixoff
Copy link
Author

wixoff commented Jul 25, 2018

Here's the same verbose log but in a non-expiring link:

https://paste.drhack.net/?9f4a4ed3411e7b05#mhgZ6H4ofzqLlxxDW2RJijQv/8+QYafpjlvGTyXF5fM=

When my zigbee devices aren't working, all I need to do is restart hass. Something has regressed, and per the Community thread, one user had to revert to hass 0.67.1 to get it working again. That's about the time I started having trouble too.

@Uniqen
Copy link

Uniqen commented Jul 30, 2018

I've got the same issue as @wixoff. Using a bunch of tradfri and Telegesis ETRX357USB-LRS+8M (flashed with EZSP firmware).

2018-07-30` 21:25:36 ERROR (MainThread) [homeassistant.core] Error executing service <ServiceCall light.turn_on: entity_id=['light.ikea_of_sweden_tradfri_bulb_gu10_w_400lm_fed472da_1', 'light.ikea_of_sweden_tradfri_bulb_gu10_w_400lm_fed548f1_1', 'light.ikea_of_sweden_tradfri_bulb_gu10_w_400lm_fed5a563_1', 'light.ikea_of_sweden_tradfri_bulb_gu10_w_400lm_fed7f30a_1', 'light.ikea_of_sweden_tradfri_bulb_gu10_w_400lm_fed978ee_1', 'light.ikea_of_sweden_tradfri_bulb_gu10_w_400lm_fed98735_1']>
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/homeassistant/core.py", line 1021, in _event_to_service_call
    await service_handler.func(service_call)
  File "/usr/lib/python3.6/site-packages/homeassistant/components/light/__init__.py", line 360, in async_handle_light_service
    await light.async_turn_on(**params)
  File "/usr/lib/python3.6/site-packages/homeassistant/components/light/zha.py", line 108, in async_turn_on
    duration
  File "/usr/lib/python3.6/site-packages/zigpy/device.py", line 89, in request
    expect_reply=expect_reply,
  File "/usr/lib/python3.6/site-packages/bellows/zigbee/application.py", line 213, in request
    assert sequence not in self._pending
AssertionError

@techdoutdev
Copy link

I downgraded due to all the same issues as mentioned here; been running 0.67.1 (as @wixoff mentioned above) - no issues like the issues seen in the later version. I believe this is just before HA updated to bellow 0.5.2 in HA 0.68.

@rcloran
Copy link
Collaborator

rcloran commented Sep 3, 2018

Is this a dup of #78 ?

@walthowd
Copy link

walthowd commented Sep 3, 2018

@rcloran Yes, this is a dupe of 78 in zigpy/zigpy. I think we found the root issue there - asyncio throwing exceptions which leads to abandoned sequence IDs that eventfully fill self._pending. Over time every request will throw an assertion error.

zigpy/zigpy#78

@rcloran
Copy link
Collaborator

rcloran commented Sep 3, 2018

OK, let’s keep the discussion in one place.

@rcloran rcloran closed this as completed Sep 3, 2018
@rcloran rcloran reopened this Sep 3, 2018
@rcloran
Copy link
Collaborator

rcloran commented Sep 3, 2018

Oh, I guess the issue is in bellows, not zigpy.

@walthowd
Copy link

walthowd commented Sep 3, 2018

sorry, put this in the wrong place twice -- was typing with a toddler on my lap!

Tested out some exception catching this afternoon, still seeing a few abandoned/orphaned sequence IDs in pending, not sure why. Here is the modifed request function of zigbee/application.py I'm running:

        aps_frame.groupId = t.uint16_t(0)
        aps_frame.sequence = t.uint8_t(sequence)

        try:
            v = await asyncio.wait_for(self._ezsp.sendUnicast(self.direct, nwk, aps_frame, sequence, data), timeout)
        except asyncio.TimeoutError:
            LOGGER.debug("sendunicast timeout %s:%s:%s", nwk, dst_ep, cluster)
            self._pending.pop(sequence)
            send_fut.cancel()
            if expect_reply:
                reply_fut.cancel()
            raise DeliveryError("Message send failure timeout")

        # Wait for messageSentHandler message
        v = await send_fut
        if expect_reply:
            # Wait for reply
            try:
                v = await asyncio.wait_for(reply_fut, timeout)
            except:
                self._pending.pop(sequence)
                raise DeliveryError("Message reply timeout %s" % (sequence, ))
        return v

Did another tortue test of toggling remote bulbs and many other bulbs, and had a few sequence IDs get stuck again. The first in the below log is sequence 152 which starts on line 56 of the home assistant log:

https://www.dropbox.com/s/fwjq9rafsu7wl4z/stuck.txt?dl=0

By the end of the log there are two other stuck sequences in self._pending (12 and 228) but those are after the sequence IDs have wrapped around at least once.

@rcloran
Copy link
Collaborator

rcloran commented Sep 3, 2018

#132 should fix what looks like the common case of this symptom. I'm fairly unsatisfied that I can't figure out what made it regress.

@walthowd : I can't repro this other case. Let me send you a diff with what will probably be useful logging changes.

@walthowd
Copy link

walthowd commented Sep 3, 2018

@rcloran I can reproduce the issue fairly reliably with my setup. If you send me the diff I can spin up HA with a stripped down ZHA only config and post the log as well as start looking myself. Thanks.

@rcloran rcloran closed this as completed in 65783d1 Sep 3, 2018
@rcloran rcloran reopened this Sep 4, 2018
@rcloran
Copy link
Collaborator

rcloran commented Sep 4, 2018

@walthowd : Can you repro with https://github.com/rcloran/bellows-rcloran/commit/11f9aec24b2ae1a150eb1f250a3f0b7933bee8da ? I'm mostly interested in where we lose the request. It's on top of 0.7.0, which I will push a release of shortly and update hass with. I haven't tested that logging code much, so there might be bugs just in the logging itself ... should be easy fixes if you run into any problems with it.

@walthowd
Copy link

walthowd commented Sep 4, 2018

@rcloran Here is the first run through the sequence IDs. Sequence ID 43 stuck. I'll leave it running a bit more to see if there is a pattern:

https://www.dropbox.com/s/8inubip7k624keq/forty-three.txt?dl=0

@rcloran
Copy link
Collaborator

rcloran commented Sep 4, 2018

@walthowd : OK, looks like we're getting an incomingMessageHandler and no messageSentHandler at all. Thanks, this should be enough to work on. I'll try take a look at it next weekend. If anyone else wants to take a look, we should probably call _handle_frame_sent from within _handle_reply if send_fut is not done().

@StephenWetzel
Copy link

StephenWetzel commented Sep 11, 2018

EDIT: To anyone having the symptoms I describe here the issue is actually this one: #37

Not sure if I'm having the same issue but my symptoms seem very similar. Devices work fine for a few hours, then suddenly stop. When that happens my logs have a ERROR (MainThread) [homeassistant.core] Timer got out of sync. Resetting and then fill with WARNING (MainThread) [homeassistant.components.sensor] Updating zha sensor took longer than the scheduled update interval 0:00:30.

Turning on debugging shows this message when it happens DEBUG (MainThread) [bellows.uart] Error frame: b'c20251a8bd7e' and I'm just now noticing that the two recent times it's happened the frame hash was exactly the same b'c20251a8bd7e'.

Like I said I'm not sure if this is the same issue or not, but it also started a few months ago, around the time power consumption monitoring was added. I also have CentraLite 3210-L switches, and my only other zigbeee devices are Sengled bulbs. I can open another issue if you don't think it's related.

Two recent logs:
https://pastebin.com/qkNc04ih

https://pastebin.com/zxzWXwRV

@techdoutdev
Copy link

@StephenWetzel yup - these are the exact same symptoms I experienced. I'm using the same bulbs and switches as well.

damarco pushed a commit that referenced this issue Mar 10, 2019
Use timeout when awaiting for a messageSent callback. Targeted as a partial fix for #124
@valexi7
Copy link

valexi7 commented Mar 10, 2019

Darn. I ditched my Telegesis Zigbee stick and moved to CC2531.
https://www.zigbee2mqtt.io/

My problems seems to be gone now...

@Adminiuga
Copy link
Collaborator

I've been running my production system (about 65 devices) on Raspberry Pi2 for about two weeks now.
Just homeassistant with bunch of automations, but no any databases, except the recorder.
Problems are mostly encountered during HA restarts, when it hammers the disk IO and those SD cards are slow. When running, it runs fine and I have only two occasions of a non-commanded EZSP resets:

  1. once it just stopped responding to any commands send by HA, but still was sending data to HA. This was reported by others as well
  2. Once it got Error frame "Exceeded Maximum ACK timeout count" -- which also seems to be a commonality of running ZHA on Pi
    In both cases Reset EZSP stack on error frames or missing heartbeats #147 recovered successfully.

@tbrock47
Copy link

Can someone tell me how to force the ZHA polling behavior in 0.90.0 that was "fixed" in 0.90.1?

My devices were actually reporting dependably.

@scottocs11
Copy link

I'm mobile now, but I had to re pair my zigbee devices to my home assistant after all the shit I tried.

@musicatwrk
Copy link

musicatwrk commented Mar 23, 2019

@Adminiuga

I've been running my production system (about 65 devices) on Raspberry Pi2 for about two weeks now.
Just homeassistant with bunch of automations, but no any databases, except the recorder.
Problems are mostly encountered during HA restarts, when it hammers the disk IO and those SD cards are slow. When running, it runs fine and I have only two occasions of a non-commanded EZSP resets:

  1. once it just stopped responding to any commands send by HA, but still was sending data to HA. This was reported by others as well
  2. Once it got Error frame "Exceeded Maximum ACK timeout count" -- which also seems to be a commonality of running ZHA on Pi
    In both cases Reset EZSP stack on error frames or missing heartbeats #147 recovered successfully.

This is very similar to the behavior I was running on the dedicated server I built after your previous suggestion (turns out the usb pass-thru was a problem), once I transitioned to the other system, my lockups disappeared, but my lights would still stop responding correctly.

I'd receive data for however many hours the system would decide to run (usually around 10-12), then all of a sudden (around the time some of my bigger zigbee automations would occur), I'd start seeing the system attempt sending out messages into my zigbee network, but no confirmation of the message getting sent or a response.

I fought with that for about a week or 2 trying to make it work, finally gave up and put the system on my more powerful host system (i7, 16gb ram, physical HDD etc). I had originally ditched plans to put it on the host OS for service segmentation.

After installing the system onto my faster host, I've enjoyed relatively decent stability in my whole HA instance (I have other bugs to work out but they aren't relevant here). I also have a large mix of zigbee devices in case any one wonders about complexity. (7 Lightify, 12 Sengleds, 3 GE Link, 1 Commercial Electric, SmartThings Plug)

@brlodi
Copy link

brlodi commented Mar 23, 2019

@tbrock47 looking at what was changed 0.90.0 → 0.90.1 I can confidently say you were just experiencing a lucky coincidence that it was apparently working before you upgraded again. A significant part of the polling was actually completely broken due to a typo.

@rjgrandy
Copy link

rjgrandy commented Mar 23, 2019

Wow, I started with HA not too long ago and got up and running with a docker rpi HA installation with the HUSBZB. I am coming from Smart Things and wanted to port all my stuff over. I've been slowly working my way through transferring all the devices. ZHA has been a huge point of frustration for me as it seems super unstable.

My most recent update was to 0.90.1 and all my zigbee would work for a bit then everything went unavailable, not even a reboot fixed it.

I just installed Yoda-x's bellows and lo and behold, my zigbee is back to working! But it seemed to only last for a few minutes...

However based on this thread I should be upgrading my rpi to a better PS, better SD or maybe a SSD and limiting the recorder.

My log file is filling up with these 2 lines:

2019-03-23 06:35:07 INFO (MainThread) [homeassistant.components.http.view] Serving / to 192.168.1.246 (auth: False)
2019-03-23 06:35:21 WARNING (MainThread) [homeassistant.components.sensor] Updating zha sensor took longer than the scheduled update interval 0:00:30

@tbrock47
Copy link

@brlodi Maybe that's the key. If polling was broken in 0.90.0, how do I modify my ZHA polling behavior to similarly replicate what was going on in 0.90.0? I'm guessing my installation just performed better with no/infrequent polling?

Can I simply add should_poll: false to customize for those devices/domain? Having trouble finding clear documentation on how to modify polling behavior on ZHA unlike Z-Wave without having to alter component files.

@MartinHjelmare @Adminiuga @robbiet480 I'm posting in here rather than opening a new issue.

damarco pushed a commit that referenced this issue Mar 27, 2019
This is one of the fixes targeted to address #124

Reset EZSP if we receive error frames
Reset EZSP if there're missing heartbeats (EZSP nop command) from EZSP
Reset EZSP if serial connection is lost
@drjared88
Copy link

Just wanted to provide an update that since 0.90.0 I am seeing zha lock up again even using @Yoda-x branch of bellows. Not sure if something else changed on my side or not but here are logs if anyone wants to see.

https://hastebin.com/imuzekotuy.bash

@musicatwrk
Copy link

@drjared88 I'm also seeing periodic lockups again, however, I think they may be related to a bug in my Lightify bulb firmware corrupting my zigbee mesh.
https://support.smartthings.com/hc/en-us/articles/214191863-How-to-connect-SYLVANIA-Bulbs

Note: SYLVANIA bulbs have a known firmware issue that can occasionally cause them to fail as ZigBee repeaters. This can potentially cause other ZigBee devices to fail to update and/or respond.

To attempt to isolate the problem, I've slowly been moving individual bulbs off my HUSBZB-1 to a lightify bridge, partly to make sure the firmware is fully up to date, and also to test the theory of the faulty firmware. Since I started this process, my lights have slowly become more stable. I still run into the odd lockup, but I can't see any errors on the HA side that would indicate a problem there.

If I can get 2 weeks without a lockup (eventually) related restart, I'll be happy, and then slowly start to hopefully migrate bulbs back to the HUSBZB. It's tedious, but it is worth a try if you have any lightify bulbs.

@drjared88
Copy link

@musicatwrk I don't have lightify bulbs but I do have a lightify switch. That being said this was working well for quite awhile before last weeks HA upgrade.

@musicatwrk
Copy link

@drjared88 Fair enough.

@drjared88
Copy link

The current beta of home assistant uses it's own branch of bellows and zigpy now. So far I've had no lock ups though I haven't ran it that long. If anyone else wants to test it that would be awesome. Lots of other zha improvements as well in 0.91.

@Adminiuga
Copy link
Collaborator

Update: So I've been running my production system on Pi2 for about 4 weeks now. About 65 Zigbee devices + weather forecast. No InfluxDB/Prometheus/MySQL DBs, but it also runs a samba dc.

Sometime between 0.90 and 0.91 beta releases, there was a period when EZSP was resetting like every 30min-1.5 hours. Now, my recorder component was configure to keep data for 7 days and record everything, but select domains and entities. This resulted in home-assistant-v2.db growing to about 1GB in 2-3 days. I examined the recorder db and there were a lot of entities which I didn't really need and after tailoring the recorder, now DB grows about 20mb a day and results were amazing. No more resets during normal operations and console have been much snapier.

So if you are running on Raspberry Pi and using EZSP based Zigbee radios, make sure you are not running any disk IO intensive tasks/components. Running disk IO benchmark utilities resulted in excessive EZSP resets, so I don't think running bellows in a separate thread is going to solve the underlying issue. Running the system of SSD was much better, but still disk IO benchmark resulted in reset, although to far less extent than running from SD.

@tbrock47
Copy link

tbrock47 commented Apr 6, 2019

I've had zero lockups lately on everything since 90.1.
I limited my recorder to 24 hours and excluded domains I didn't care about history on.
I also turned off polling on all my Sengled bulbs.
One or both actions has made a huge impact.

I'm not sold that the SD card is the bottleneck here. I would hazard a guess that its more likely the Pi bus speed, not the card itself. SD cards can be used in 4K video recording after all.

Just want to clear up that little bit of misinformation.

@Adminiuga
Copy link
Collaborator

Could be the bus, I don't know much about Pi arch to tell exactly. What I do know that disk io is notoriously slow to my personal taste on PIs.
4K video recording is a sequential writing. I'd expect quite different performance from random read/writes. And journaling FS does not help the performance and this is also quite a difference from writing video.

Also, forgot to mention, excluding group domain from recorder, helped to reduce unnecessary disk io.

@tbrock47
Copy link

tbrock47 commented Apr 6, 2019 via email

@Adminiuga
Copy link
Collaborator

groups actually caught me off guard. I was excluding specific entities I wasn't interested in, but those were included in the groups and therefore were triggering group state changes registering in recorder.

@musicatwrk
Copy link

@Adminiuga Just noticed there's an update for 0.91.1 and that some polling code was removed from the component. Is this to address the entity unavailable bug or another bug? I couldn't tell from the commit comments.

@dmulcahey
Copy link
Contributor

91.2 will fix the lights going unavailable

@musicatwrk
Copy link

musicatwrk commented Apr 6, 2019

My HERO!
hahaha.

I just hope whatever the "unavailable" bug is, is the behavior i'm actually experiencing now!

@Adminiuga
Copy link
Collaborator

Fixed via #147

@walthowd
Copy link

@Adminiuga Congrats on closing this monster of a bug! And thanks for all the hard work and long hours you've been putting into this project!

@Adminiuga
Copy link
Collaborator

Need to have room for the new ones :D Thank you.

@tbrock47
Copy link

Yes. Thanks so much for all the work on this critical component!

@bereska
Copy link

bereska commented May 27, 2019

Great team work, mates!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests