lutron_caseta BRIDGE_TIMEOUT too short #110067

jonoberheide · 2024-02-09T01:41:01Z

The problem

Hi! I have a large Lutron QSX system. The default BRIDGE_TIMEOUT of 35 seconds in components/lutron_caseta/const.py is not sufficient to perform all the required LEAP queries to connect to the bridge and parse all the response data.

A couple ideas for resolution:

(1) Increase the timeout.

(2) Make the timeout configurable during integration setup, so that it's easier to tweak for large systems.

(3) Separate the connect timeout from a retrieval/parsing timeout to help debug initial connection attempt quickly (network-level connectivity, proper Lutron creds, etc) while providing a more generous timeout for the second part.

Naively editing and increasing the const.py timeout on the filesystem allowed the integration to load up successfully!

Also, a PR for API paging support in the upstream pylutron_caseta package is also needed to integrate: gurumitts/pylutron-caseta#161

What version of Home Assistant Core has the issue?

2024.2.0

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

lutron_caseta

Link to integration documentation on our website

https://www.home-assistant.io/integrations/lutron_caseta/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

home-assistant · 2024-02-09T01:41:06Z

Hey there @swails, @bdraco, @danaues, mind taking a look at this issue as it has been labeled with an integration (lutron_caseta) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of lutron_caseta can trigger bot actions by commenting:

@home-assistant close Closes the issue.
@home-assistant rename Awesome new title Renames the issue.
@home-assistant reopen Reopen the issue.
@home-assistant unassign lutron_caseta Removes the current integration label and assignees on the issue, add the integration domain after the command.
@home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue.
@home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

_{^{(message by CodeOwnersMention)}}

lutron_caseta documentation
lutron_caseta source
_{^{(message by IssueLinks)}}

bdraco · 2024-02-09T02:39:16Z

(3) Separate the connect timeout from a retrieval/parsing timeout to help debug initial connection attempt quickly (network-level connectivity, proper Lutron creds, etc) while providing a more generous timeout for the second part.

I think this makes the most sense. I added the timeout in #45769 because it would block startup if the bridge was offline. Its actually quite a bit higher than we would like as HA will throw a warning if it takes more than 10s to startup the integration.

We probably only need to wait ~5-6s (maybe 8-9 even) for the connection to establish, and than it would be fine to wait ~50s for the communication (I hope it wouldn't ever take longer than that) as I believe there is another warning that kicks in at 60s.

That's going to require a change to pylutron_caseta as well. Consider this comment a 👍 for accepting a PR in HA to that effect.

bdraco · 2024-02-28T07:08:26Z

There is a chance that the cause was the event loop being blocked if the system was I/O or CPU bound loading python code. We load more code in the executor in 2024.3.x so there is a chance it won't be needed to increase the timeout anymore.

jonoberheide · 2024-02-28T12:38:40Z

@bdraco Sounds good. I will try with 2024.3 when it's out.

jonoberheide · 2024-03-07T00:41:31Z

@bdraco FWIW, 2024.3 unfortunately does not resolve the issue.

bdraco · 2024-03-07T02:56:20Z

We found a lot more places where integrations were doing blocking I/O in the event loop in the last few days. 2024.4 will fix a lot more of those places so if its an event loop being blocked problem, it will probably be solved. If the device actually does take that long to give back the data, it will need a code change.

jonoberheide · 2024-03-07T03:10:15Z

I'm assuming the latter, as there are loads of LEAP calls being made to parse the whole QSX home. There's not much else going on in my HA and it's on pretty beefy hardware.

Just pulling a quick debug log while "reloading" the integration, it looks like the initialization takes about 42 seconds and is making ~800 LEAP calls to the QSX processor.

Maybe bumping the timeout to 60 is naive, but might make this go away? 😆

bdraco · 2024-03-07T03:27:28Z

Another thing is that the pylutroncaseta library uses the built-in, json parser, which is very very slow. We could switch it to use orjson like everything in Home Assistant does. But I'm speculating on what the problem is.

jonoberheide · 2024-03-07T14:29:35Z

Oh I really like that idea. I'll try that and see what impact it has.

jonoberheide · 2024-03-07T18:53:29Z

Eh, looks like little to no impact. I suppose I shouldn't be surprised as it's far from a CPU-bound task with all that network activity.

bdraco · 2024-03-07T19:25:11Z

I haven't verified this but it looks like the design of the library currently isn't very asynchronous, and it could probably be refactored to use its own asyncio protocol instead of using streams and be so much faster

bdraco · 2024-03-07T19:30:24Z

The streams do add a lot of overhead but I don't think thats the problem

bdraco · 2024-03-07T19:32:13Z

The whole login process has quite a sync design. I think we could send all the requests and wait for them to come back instead

bdraco · 2024-03-07T20:07:25Z

gurumitts/pylutron-caseta#163 might solve the issue but I don't have any of the newer Lutron stuff. It works great with my Caseta and RA2 select systems

bdraco · 2024-03-07T20:08:12Z

All the tests are likely failing on that PR as they assume everything comes in synchronously

issue-triage-workflows · 2024-06-05T21:06:33Z

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates.
Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍
This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

jonoberheide · 2024-06-05T21:43:45Z

Still an active issue

averj · 2024-09-24T10:31:19Z

Hey @jonoberheide, do you have a temporary solution for this? Every time I restart HA it fails to re-initialize to QSX.

Note: Adjusting bridge timeout did not resolve this for me

jonoberheide · 2024-09-24T13:58:48Z

@averj I overrode the default BRIDGE_TIMEOUT via custom_components. It's not pretty, but it works.

Do you think your issue is timeout related or not? What are you seeing in the logs? Do you have a lot of Lutron zones?

averj · 2024-09-24T14:54:46Z

@jonoberheide I meant to edit my comment, I didn't include a version in the manifest.json so it was failing to load the "custom integration". Once I did that it was able to initialize. I opened issue #126001 and then came across yours shortly after, I think it is similar if not the same as what you are experiencing. Thank you for discovering this workaround, saved me a lot of time.

jonoberheide · 2024-09-24T15:02:52Z

@averj Yeah, looks like it may be the same. If you look at your debug log, how much time does it take and how many LEAP calls? Mine was like 40+ seconds and 800+ LEAP calls.

@bdraco had some good ideas on a more elegant fix of separating the timeout logic, but it required more plumbing both in pylutron-caseta and HA than I'm able to handle.

averj · 2024-09-24T16:00:10Z

@jonoberheide It's taking a little over 50 seconds and sent about 1300 LEAP calls. There's roughly 350 devices in this system. @bdraco removed himself as a code owner for this integration a couple of weeks ago. If anyone else has the capacity to work on this I'm happy to help with testing/validation.

beausmith · 2024-11-13T07:51:12Z

I'm having this issue as well with a Lutron Homeworks QSX install. I was able to use the workaround:

Copy the lutron_caseta component to config/custom_components/lutron_caseta
Update the BRIDGE_TIMEOUT = 35 to BRIDGE_TIMEOUT = 35 in config/custom_components/lutron_caseta/const.py
Restart HA.
Install the custom component integration as you would the official integration. (Note that it will appear as a sibling to the official component in the list, but marked with a custom component icon.)
Then following the rest of the steps to set up.

I'm not looking forward to maintaining a custom version this component, but the show must go on!

I'm motivated to provide assistance to anyone who might be able to help resolve this timeout issue.

jonoberheide · 2024-11-13T12:58:39Z

Now that I have a HA dev environment and a little experience, I'll take a swing at this.

bdraco · 2024-11-13T13:18:31Z

Happy to do a review and help get something merged. However I don't have the capacity to do support for this integration at this time.

jonoberheide · 2024-11-14T03:39:51Z

Added a pull request (gurumitts/pylutron-caseta#176) to the upstream library that will do a connectivity test to the bridge separate from the long/full load that happens during the Smartbridge.connect() function. If this approach is accepted upstream, then I'll add a modification to HA-land to test connectivity with this new function and then also increase the BRIDGE_TIMEOUT to 59 seconds.

jonoberheide · 2024-11-14T14:56:52Z

Another thing is that the pylutroncaseta library uses the built-in, json parser, which is very very slow. We could switch it to use orjson like everything in Home Assistant does. But I'm speculating on what the problem is.

Also added a PR for this gurumitts/pylutron-caseta#175

jonoberheide · 2024-11-14T15:23:00Z

It might look something like this:

diff --git a/homeassistant/components/lutron_caseta/__init__.py b/homeassistant/components/lutron_caseta/__init__.py
index 26fc5ba153..664c277897 100644
--- a/homeassistant/components/lutron_caseta/__init__.py
+++ b/homeassistant/components/lutron_caseta/__init__.py
@@ -168,6 +168,10 @@ async def async_setup_entry(
         _LOGGER.error("Invalid certificate used to connect to bridge at %s", host)
         return False
 
+    # Test out our connection and credentials first before loading lutron devices
+    if not await bridge.test_connection():
+        raise ConfigEntryNotReady(f"Timed out while trying to connect to bridge at {host}")
+
     timed_out = True
     with contextlib.suppress(TimeoutError):
         async with asyncio.timeout(BRIDGE_TIMEOUT):
@@ -177,7 +181,7 @@ async def async_setup_entry(
     if timed_out or not bridge.is_connected():
         await bridge.close()
         if timed_out:
-            raise ConfigEntryNotReady(f"Timed out while trying to connect to {host}")
+            raise ConfigEntryNotReady(f"Timed out while trying to load bridge devices at {host}")
         if not bridge.is_connected():
             raise ConfigEntryNotReady(f"Cannot connect to {host}")
 
diff --git a/homeassistant/components/lutron_caseta/config_flow.py b/homeassistant/components/lutron_caseta/config_flow.py
index cd566b767f..fe7e7259a3 100644
--- a/homeassistant/components/lutron_caseta/config_flow.py
+++ b/homeassistant/components/lutron_caseta/config_flow.py
@@ -231,12 +231,20 @@ class LutronCasetaFlowHandler(ConfigFlow, domain=DOMAIN):
             )
             return None
 
+        # Test out our connection and credentials first before loading lutron devices
+        if not await bridge.test_connection():
+            _LOGGER.error(
+                "Timeout while trying to connect to bridge at %s",
+                self.data[CONF_HOST],
+            )
+            return None
+
         try:
             async with asyncio.timeout(BRIDGE_TIMEOUT):
                 await bridge.connect()
         except TimeoutError:
             _LOGGER.error(
-                "Timeout while trying to connect to bridge at %s",
+                "Timeout while trying to load bridge devices at %s",
                 self.data[CONF_HOST],
             )
         else:
diff --git a/homeassistant/components/lutron_caseta/const.py b/homeassistant/components/lutron_caseta/const.py
index 7493878bec..dfd0b5270c 100644
--- a/homeassistant/components/lutron_caseta/const.py
+++ b/homeassistant/components/lutron_caseta/const.py
@@ -33,7 +33,7 @@ ACTION_RELEASE = "release"
 
 CONF_SUBTYPE = "subtype"
 
-BRIDGE_TIMEOUT = 35
+BRIDGE_TIMEOUT = 59
 
 UNASSIGNED_AREA = "Unassigned"

home-assistant bot added the integration: lutron_caseta label Feb 9, 2024

home-assistant bot assigned bdraco Feb 9, 2024

mdonoughe mentioned this issue Feb 22, 2024

Update pylutron-caseta to 0.20.0 #111169

Merged

20 tasks

bdraco mentioned this issue Mar 7, 2024

Refactor login process to be more async gurumitts/pylutron-caseta#163

Draft

jonoberheide mentioned this issue Apr 17, 2024

Increase BRIDGE_TIMEOUT for large Lutron QSX deployments #115725

Closed

20 tasks

issue-triage-workflows bot added the stale label Jun 5, 2024

issue-triage-workflows bot removed the stale label Jun 5, 2024

bdraco added the no-stale label Jun 6, 2024

averj mentioned this issue Sep 30, 2024

Lutron Caseta failing to initialize with QSX System #126001

Closed

jonoberheide linked a pull request Nov 14, 2024 that will close this issue

add a helper function test bridge connectivity gurumitts/pylutron-caseta#176

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lutron_caseta BRIDGE_TIMEOUT too short #110067

lutron_caseta BRIDGE_TIMEOUT too short #110067

jonoberheide commented Feb 9, 2024

home-assistant bot commented Feb 9, 2024

bdraco commented Feb 9, 2024

bdraco commented Feb 28, 2024

jonoberheide commented Feb 28, 2024

jonoberheide commented Mar 7, 2024

bdraco commented Mar 7, 2024 •

edited

Loading

jonoberheide commented Mar 7, 2024

bdraco commented Mar 7, 2024

jonoberheide commented Mar 7, 2024

jonoberheide commented Mar 7, 2024

bdraco commented Mar 7, 2024

bdraco commented Mar 7, 2024

bdraco commented Mar 7, 2024

bdraco commented Mar 7, 2024

bdraco commented Mar 7, 2024

issue-triage-workflows bot commented Jun 5, 2024

jonoberheide commented Jun 5, 2024

averj commented Sep 24, 2024 •

edited

Loading

jonoberheide commented Sep 24, 2024

averj commented Sep 24, 2024

jonoberheide commented Sep 24, 2024

averj commented Sep 24, 2024 •

edited

Loading

beausmith commented Nov 13, 2024 •

edited

Loading

jonoberheide commented Nov 13, 2024

bdraco commented Nov 13, 2024

jonoberheide commented Nov 14, 2024

jonoberheide commented Nov 14, 2024

jonoberheide commented Nov 14, 2024

lutron_caseta BRIDGE_TIMEOUT too short #110067

lutron_caseta BRIDGE_TIMEOUT too short #110067

Comments

jonoberheide commented Feb 9, 2024

The problem

What version of Home Assistant Core has the issue?

What was the last working version of Home Assistant Core?

What type of installation are you running?

Integration causing the issue

Link to integration documentation on our website

Diagnostics information

Example YAML snippet

Anything in the logs that might be useful for us?

Additional information

home-assistant bot commented Feb 9, 2024

bdraco commented Feb 9, 2024

bdraco commented Feb 28, 2024

jonoberheide commented Feb 28, 2024

jonoberheide commented Mar 7, 2024

bdraco commented Mar 7, 2024 • edited Loading

jonoberheide commented Mar 7, 2024

bdraco commented Mar 7, 2024

jonoberheide commented Mar 7, 2024

jonoberheide commented Mar 7, 2024

bdraco commented Mar 7, 2024

bdraco commented Mar 7, 2024

bdraco commented Mar 7, 2024

bdraco commented Mar 7, 2024

bdraco commented Mar 7, 2024

issue-triage-workflows bot commented Jun 5, 2024

jonoberheide commented Jun 5, 2024

averj commented Sep 24, 2024 • edited Loading

jonoberheide commented Sep 24, 2024

averj commented Sep 24, 2024

jonoberheide commented Sep 24, 2024

averj commented Sep 24, 2024 • edited Loading

beausmith commented Nov 13, 2024 • edited Loading

jonoberheide commented Nov 13, 2024

bdraco commented Nov 13, 2024

jonoberheide commented Nov 14, 2024

jonoberheide commented Nov 14, 2024

jonoberheide commented Nov 14, 2024

bdraco commented Mar 7, 2024 •

edited

Loading

averj commented Sep 24, 2024 •

edited

Loading

averj commented Sep 24, 2024 •

edited

Loading

beausmith commented Nov 13, 2024 •

edited

Loading