-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zha lights work OK following homeassistant.restart, but stop responding within hours #124
Comments
A long shot, but try modifying your bellows/zigbee/application.py and set CONFIG_PACKET_BUFFER_COUNT to 0x4b. From what I have seen the main issue on the Si Labs EM zigbee chip is lack of memory and according to their documentation the PACKET BUFFER COUNT is the number one contributing factor to memory usage. |
Thanks @walthowd - I will give that a try. Are you the same walt that's in the Community thread? EDIT: application.py edited and hass restarted at 1:45 pm EDT. Let's see what happens! |
Didn't work. Attempted to turn off one of the zigbee bulbs at 2:40 pm EDT and it failed. I don't think there was anything else queued - I just manually turned off one single bulb via the hass UI, waited 5-10 seconds, and it popped itself back on in the UI. Light remained on. Same type of error shows in the log. I tried again about 15 seconds later, and it worked. Light off. This is typical -- bad reliability first, total loss of functionality later. |
Some more verbose logging, as suggested by user adminiuga in the Community thread: https://paste.drhack.net/?65a1595f54bc5328#7mPFttNdjg7ha59Zic0cuBa37Vr6+ghmHKf98Ng9LWU= (that's a pastebin-like link, for those who might be woried) These are excerpts, the first one being an error during setup, and the other two being errors that appear when a remote button is pressed to turn a light on. So I have excerpted sections from the log entry that begin where the remote button press is recognized and continue all the way up through the error and traceback. |
Oops, inadvertently closed. |
Here's the same verbose log but in a non-expiring link: https://paste.drhack.net/?9f4a4ed3411e7b05#mhgZ6H4ofzqLlxxDW2RJijQv/8+QYafpjlvGTyXF5fM= When my zigbee devices aren't working, all I need to do is restart hass. Something has regressed, and per the Community thread, one user had to revert to hass 0.67.1 to get it working again. That's about the time I started having trouble too. |
I've got the same issue as @wixoff. Using a bunch of tradfri and Telegesis ETRX357USB-LRS+8M (flashed with EZSP firmware).
|
I downgraded due to all the same issues as mentioned here; been running 0.67.1 (as @wixoff mentioned above) - no issues like the issues seen in the later version. I believe this is just before HA updated to bellow 0.5.2 in HA 0.68. |
Is this a dup of #78 ? |
@rcloran Yes, this is a dupe of 78 in zigpy/zigpy. I think we found the root issue there - asyncio throwing exceptions which leads to abandoned sequence IDs that eventfully fill self._pending. Over time every request will throw an assertion error. |
OK, let’s keep the discussion in one place. |
Oh, I guess the issue is in bellows, not zigpy. |
sorry, put this in the wrong place twice -- was typing with a toddler on my lap! Tested out some exception catching this afternoon, still seeing a few abandoned/orphaned sequence IDs in pending, not sure why. Here is the modifed request function of zigbee/application.py I'm running:
Did another tortue test of toggling remote bulbs and many other bulbs, and had a few sequence IDs get stuck again. The first in the below log is sequence 152 which starts on line 56 of the home assistant log: https://www.dropbox.com/s/fwjq9rafsu7wl4z/stuck.txt?dl=0 By the end of the log there are two other stuck sequences in self._pending (12 and 228) but those are after the sequence IDs have wrapped around at least once. |
@rcloran I can reproduce the issue fairly reliably with my setup. If you send me the diff I can spin up HA with a stripped down ZHA only config and post the log as well as start looking myself. Thanks. |
@walthowd : Can you repro with https://github.com/rcloran/bellows-rcloran/commit/11f9aec24b2ae1a150eb1f250a3f0b7933bee8da ? I'm mostly interested in where we lose the request. It's on top of 0.7.0, which I will push a release of shortly and update hass with. I haven't tested that logging code much, so there might be bugs just in the logging itself ... should be easy fixes if you run into any problems with it. |
@rcloran Here is the first run through the sequence IDs. Sequence ID 43 stuck. I'll leave it running a bit more to see if there is a pattern: https://www.dropbox.com/s/8inubip7k624keq/forty-three.txt?dl=0 |
@walthowd : OK, looks like we're getting an incomingMessageHandler and no messageSentHandler at all. Thanks, this should be enough to work on. I'll try take a look at it next weekend. If anyone else wants to take a look, we should probably call _handle_frame_sent from within _handle_reply if send_fut is not done(). |
EDIT: To anyone having the symptoms I describe here the issue is actually this one: #37 Not sure if I'm having the same issue but my symptoms seem very similar. Devices work fine for a few hours, then suddenly stop. When that happens my logs have a Turning on debugging shows this message when it happens Like I said I'm not sure if this is the same issue or not, but it also started a few months ago, around the time power consumption monitoring was added. I also have CentraLite 3210-L switches, and my only other zigbeee devices are Sengled bulbs. I can open another issue if you don't think it's related. Two recent logs: |
@StephenWetzel yup - these are the exact same symptoms I experienced. I'm using the same bulbs and switches as well. |
Use timeout when awaiting for a messageSent callback. Targeted as a partial fix for #124
Darn. I ditched my Telegesis Zigbee stick and moved to CC2531. My problems seems to be gone now... |
I've been running my production system (about 65 devices) on Raspberry Pi2 for about two weeks now.
|
Can someone tell me how to force the ZHA polling behavior in 0.90.0 that was "fixed" in 0.90.1? My devices were actually reporting dependably. |
I'm mobile now, but I had to re pair my zigbee devices to my home assistant after all the shit I tried. |
This is very similar to the behavior I was running on the dedicated server I built after your previous suggestion (turns out the usb pass-thru was a problem), once I transitioned to the other system, my lockups disappeared, but my lights would still stop responding correctly. I'd receive data for however many hours the system would decide to run (usually around 10-12), then all of a sudden (around the time some of my bigger zigbee automations would occur), I'd start seeing the system attempt sending out messages into my zigbee network, but no confirmation of the message getting sent or a response. I fought with that for about a week or 2 trying to make it work, finally gave up and put the system on my more powerful host system (i7, 16gb ram, physical HDD etc). I had originally ditched plans to put it on the host OS for service segmentation. After installing the system onto my faster host, I've enjoyed relatively decent stability in my whole HA instance (I have other bugs to work out but they aren't relevant here). I also have a large mix of zigbee devices in case any one wonders about complexity. (7 Lightify, 12 Sengleds, 3 GE Link, 1 Commercial Electric, SmartThings Plug) |
@tbrock47 looking at what was changed 0.90.0 → 0.90.1 I can confidently say you were just experiencing a lucky coincidence that it was apparently working before you upgraded again. A significant part of the polling was actually completely broken due to a typo. |
Wow, I started with HA not too long ago and got up and running with a docker rpi HA installation with the HUSBZB. I am coming from Smart Things and wanted to port all my stuff over. I've been slowly working my way through transferring all the devices. ZHA has been a huge point of frustration for me as it seems super unstable. My most recent update was to 0.90.1 and all my zigbee would work for a bit then everything went unavailable, not even a reboot fixed it. I just installed Yoda-x's bellows and lo and behold, my zigbee is back to working! But it seemed to only last for a few minutes... However based on this thread I should be upgrading my rpi to a better PS, better SD or maybe a SSD and limiting the recorder. My log file is filling up with these 2 lines:
|
@brlodi Maybe that's the key. If polling was broken in 0.90.0, how do I modify my ZHA polling behavior to similarly replicate what was going on in 0.90.0? I'm guessing my installation just performed better with no/infrequent polling? Can I simply add @MartinHjelmare @Adminiuga @robbiet480 I'm posting in here rather than opening a new issue. |
This is one of the fixes targeted to address #124 Reset EZSP if we receive error frames Reset EZSP if there're missing heartbeats (EZSP nop command) from EZSP Reset EZSP if serial connection is lost
Just wanted to provide an update that since 0.90.0 I am seeing zha lock up again even using @Yoda-x branch of bellows. Not sure if something else changed on my side or not but here are logs if anyone wants to see. |
@drjared88 I'm also seeing periodic lockups again, however, I think they may be related to a bug in my Lightify bulb firmware corrupting my zigbee mesh.
To attempt to isolate the problem, I've slowly been moving individual bulbs off my HUSBZB-1 to a lightify bridge, partly to make sure the firmware is fully up to date, and also to test the theory of the faulty firmware. Since I started this process, my lights have slowly become more stable. I still run into the odd lockup, but I can't see any errors on the HA side that would indicate a problem there. If I can get 2 weeks without a lockup (eventually) related restart, I'll be happy, and then slowly start to hopefully migrate bulbs back to the HUSBZB. It's tedious, but it is worth a try if you have any lightify bulbs. |
@musicatwrk I don't have lightify bulbs but I do have a lightify switch. That being said this was working well for quite awhile before last weeks HA upgrade. |
@drjared88 Fair enough. |
The current beta of home assistant uses it's own branch of bellows and zigpy now. So far I've had no lock ups though I haven't ran it that long. If anyone else wants to test it that would be awesome. Lots of other zha improvements as well in 0.91. |
Update: So I've been running my production system on Pi2 for about 4 weeks now. About 65 Zigbee devices + weather forecast. No InfluxDB/Prometheus/MySQL DBs, but it also runs a samba dc. Sometime between 0.90 and 0.91 beta releases, there was a period when EZSP was resetting like every 30min-1.5 hours. Now, my So if you are running on Raspberry Pi and using EZSP based Zigbee radios, make sure you are not running any disk IO intensive tasks/components. Running disk IO benchmark utilities resulted in excessive EZSP resets, so I don't think running |
I've had zero lockups lately on everything since 90.1. I'm not sold that the SD card is the bottleneck here. I would hazard a guess that its more likely the Pi bus speed, not the card itself. SD cards can be used in 4K video recording after all. Just want to clear up that little bit of misinformation. |
Could be the bus, I don't know much about Pi arch to tell exactly. What I do know that disk io is notoriously slow to my personal taste on PIs. Also, forgot to mention, excluding |
Agreed. Group seemed to be a big help considering I have a group for every
room.
…On Fri, Apr 5, 2019, 8:33 PM Alexei Chetroi ***@***.***> wrote:
Could be the bus, I don't know much about Pi arch to tell exactly. What I
do know that disk io is notoriously slow to my personal taste on PIs.
4K video recording is a sequential writing. I'd expect quite different
performance from random read/writes. And journaling FS does not help the
performance and this is also quite a difference from writing video.
Also, forgot to mention, excluding group domain from recorder, helped to
reduce unnecessary disk io.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#124 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AmG8cDLSvo0b-Fhu21dPuek-EYKPDj8xks5veAePgaJpZM4VJkuc>
.
|
groups actually caught me off guard. I was excluding specific entities I wasn't interested in, but those were included in the groups and therefore were triggering group state changes registering in recorder. |
@Adminiuga Just noticed there's an update for 0.91.1 and that some polling code was removed from the component. Is this to address the entity unavailable bug or another bug? I couldn't tell from the commit comments. |
91.2 will fix the lights going unavailable |
My HERO! I just hope whatever the "unavailable" bug is, is the behavior i'm actually experiencing now! |
Fixed via #147 |
@Adminiuga Congrats on closing this monster of a bug! And thanks for all the hard work and long hours you've been putting into this project! |
Need to have room for the new ones :D Thank you. |
Yes. Thanks so much for all the work on this critical component! |
Great team work, mates! |
I posted something about this in the Community Forum, but I hope it will get better visibility here.
At the moment zha network has seven light bulbs, three door sensors, and one switched outlet. The sensors are reliable, but the lights and switch become unavailable (and unresponsive to commands) after a few hours.
SETUP
Currently the lights are all Osram/Sylvania Lightify (U.S. versions) - five are RGBW A19 and two are two are Tunable White A19. When they work, they work great - almost instant response. The switched outlet is an IRIS v2 plug (reads as CentraLite 3210-L, with the odd dual z-wave/zigbee radios).
I’m not sure what the issue is, because ZHA had been pretty darn reliable for a good number of months. (I also have a Tradfi 1000lm A19 bulb that I was able to include months ago, but I have since reset it, nuked my zigbee.db, and attempted to re-add, and it will no longer show up.) Sometimes repeating the request via the UI several times in a row will cause the bulb to wake up and respond, and eventually even that will change to no responses whatsoever.
I also have three Visonic MCT-340E door/window sensors spread around a fairly large house. Even after the bulbs (and the IRIS plug) quit responding, these sensors still work and are very reliable. The built-in temperature sensor works too, on one of them; the others never change their temperature and one of those is stuck at 32F.
ERRORS
Here’s what an error in the log looks like after the lights stop responding - this represents an attempt to turn off a light via the hass UI:
And here is another error, trying to use a scene to turn off four lights (as noted above, the Tradfri light fails because it's currently disconnected):
There are no other zha-related errors in the log, other than during startup.
I can't imagine this is correct behavior. The five RGBW lights and the IRIS plug are all within 10 feet of the HUSBZB-1 stick, most of them line-of-sight. And yet they drop off just as quickly as the ones further away. And the other two Tunable White bulbs are only another 5 feet past the zigbee outlet switch (which should be a router), but they are behind a modern-contstruction wall (wood studs, wallboard, paint).
As I mentioned in my Community post, I'll update the firmware on the bulbs, but I don't expect improvement because the lights were working pretty well a few hass releases ago.
The text was updated successfully, but these errors were encountered: