-
-
Notifications
You must be signed in to change notification settings - Fork 31.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lutron_caseta BRIDGE_TIMEOUT too short #110067
lutron_caseta BRIDGE_TIMEOUT too short #110067
Comments
Hey there @swails, @bdraco, @danaues, mind taking a look at this issue as it has been labeled with an integration ( Code owner commandsCode owners of
(message by CodeOwnersMention) lutron_caseta documentation |
I think this makes the most sense. I added the timeout in #45769 because it would block startup if the bridge was offline. Its actually quite a bit higher than we would like as HA will throw a warning if it takes more than 10s to startup the integration. We probably only need to wait ~5-6s (maybe 8-9 even) for the connection to establish, and than it would be fine to wait ~50s for the communication (I hope it wouldn't ever take longer than that) as I believe there is another warning that kicks in at 60s. That's going to require a change to |
There is a chance that the cause was the event loop being blocked if the system was I/O or CPU bound loading python code. We load more code in the executor in 2024.3.x so there is a chance it won't be needed to increase the timeout anymore. |
@bdraco Sounds good. I will try with 2024.3 when it's out. |
@bdraco FWIW, 2024.3 unfortunately does not resolve the issue. |
We found a lot more places where integrations were doing blocking I/O in the event loop in the last few days. 2024.4 will fix a lot more of those places so if its an event loop being blocked problem, it will probably be solved. If the device actually does take that long to give back the data, it will need a code change. |
I'm assuming the latter, as there are loads of LEAP calls being made to parse the whole QSX home. There's not much else going on in my HA and it's on pretty beefy hardware. Just pulling a quick debug log while "reloading" the integration, it looks like the initialization takes about 42 seconds and is making ~800 LEAP calls to the QSX processor. Maybe bumping the timeout to 60 is naive, but might make this go away? 😆 |
Another thing is that the pylutroncaseta library uses the built-in, json parser, which is very very slow. We could switch it to use orjson like everything in Home Assistant does. But I'm speculating on what the problem is. |
Oh I really like that idea. I'll try that and see what impact it has. |
Eh, looks like little to no impact. I suppose I shouldn't be surprised as it's far from a CPU-bound task with all that network activity. |
I haven't verified this but it looks like the design of the library currently isn't very asynchronous, and it could probably be refactored to use its own asyncio protocol instead of using streams and be so much faster |
The streams do add a lot of overhead but I don't think thats the problem |
The whole login process has quite a sync design. I think we could send all the requests and wait for them to come back instead |
gurumitts/pylutron-caseta#163 might solve the issue but I don't have any of the newer Lutron stuff. It works great with my Caseta and RA2 select systems |
All the tests are likely failing on that PR as they assume everything comes in synchronously |
There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. |
Still an active issue |
Hey @jonoberheide, do you have a temporary solution for this? Every time I restart HA it fails to re-initialize to QSX. Note: Adjusting bridge timeout did not resolve this for me |
@averj I overrode the default BRIDGE_TIMEOUT via custom_components. It's not pretty, but it works. Do you think your issue is timeout related or not? What are you seeing in the logs? Do you have a lot of Lutron zones? |
@jonoberheide I meant to edit my comment, I didn't include a version in the manifest.json so it was failing to load the "custom integration". Once I did that it was able to initialize. I opened issue #126001 and then came across yours shortly after, I think it is similar if not the same as what you are experiencing. Thank you for discovering this workaround, saved me a lot of time. |
@averj Yeah, looks like it may be the same. If you look at your debug log, how much time does it take and how many LEAP calls? Mine was like 40+ seconds and 800+ LEAP calls. @bdraco had some good ideas on a more elegant fix of separating the timeout logic, but it required more plumbing both in pylutron-caseta and HA than I'm able to handle. |
@jonoberheide It's taking a little over 50 seconds and sent about 1300 LEAP calls. There's roughly 350 devices in this system. @bdraco removed himself as a code owner for this integration a couple of weeks ago. If anyone else has the capacity to work on this I'm happy to help with testing/validation. |
I'm having this issue as well with a Lutron Homeworks QSX install. I was able to use the workaround:
I'm not looking forward to maintaining a custom version this component, but the show must go on! I'm motivated to provide assistance to anyone who might be able to help resolve this timeout issue. |
Now that I have a HA dev environment and a little experience, I'll take a swing at this. |
Happy to do a review and help get something merged. However I don't have the capacity to do support for this integration at this time. |
Added a pull request (gurumitts/pylutron-caseta#176) to the upstream library that will do a connectivity test to the bridge separate from the long/full load that happens during the Smartbridge.connect() function. If this approach is accepted upstream, then I'll add a modification to HA-land to test connectivity with this new function and then also increase the BRIDGE_TIMEOUT to 59 seconds. |
Also added a PR for this gurumitts/pylutron-caseta#175 |
It might look something like this:
|
The problem
Hi! I have a large Lutron QSX system. The default BRIDGE_TIMEOUT of 35 seconds in components/lutron_caseta/const.py is not sufficient to perform all the required LEAP queries to connect to the bridge and parse all the response data.
A couple ideas for resolution:
(1) Increase the timeout.
(2) Make the timeout configurable during integration setup, so that it's easier to tweak for large systems.
(3) Separate the connect timeout from a retrieval/parsing timeout to help debug initial connection attempt quickly (network-level connectivity, proper Lutron creds, etc) while providing a more generous timeout for the second part.
Naively editing and increasing the const.py timeout on the filesystem allowed the integration to load up successfully!
Also, a PR for API paging support in the upstream pylutron_caseta package is also needed to integrate: gurumitts/pylutron-caseta#161
What version of Home Assistant Core has the issue?
2024.2.0
What was the last working version of Home Assistant Core?
No response
What type of installation are you running?
Home Assistant OS
Integration causing the issue
lutron_caseta
Link to integration documentation on our website
https://www.home-assistant.io/integrations/lutron_caseta/
Diagnostics information
No response
Example YAML snippet
No response
Anything in the logs that might be useful for us?
No response
Additional information
No response
The text was updated successfully, but these errors were encountered: