Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lutron_caseta BRIDGE_TIMEOUT too short #110067

Open
jonoberheide opened this issue Feb 9, 2024 · 28 comments · May be fixed by gurumitts/pylutron-caseta#176
Open

lutron_caseta BRIDGE_TIMEOUT too short #110067

jonoberheide opened this issue Feb 9, 2024 · 28 comments · May be fixed by gurumitts/pylutron-caseta#176

Comments

@jonoberheide
Copy link
Contributor

The problem

Hi! I have a large Lutron QSX system. The default BRIDGE_TIMEOUT of 35 seconds in components/lutron_caseta/const.py is not sufficient to perform all the required LEAP queries to connect to the bridge and parse all the response data.

A couple ideas for resolution:

(1) Increase the timeout.

(2) Make the timeout configurable during integration setup, so that it's easier to tweak for large systems.

(3) Separate the connect timeout from a retrieval/parsing timeout to help debug initial connection attempt quickly (network-level connectivity, proper Lutron creds, etc) while providing a more generous timeout for the second part.

Naively editing and increasing the const.py timeout on the filesystem allowed the integration to load up successfully!

Also, a PR for API paging support in the upstream pylutron_caseta package is also needed to integrate: gurumitts/pylutron-caseta#161

What version of Home Assistant Core has the issue?

2024.2.0

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

lutron_caseta

Link to integration documentation on our website

https://www.home-assistant.io/integrations/lutron_caseta/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

@home-assistant
Copy link

home-assistant bot commented Feb 9, 2024

Hey there @swails, @bdraco, @danaues, mind taking a look at this issue as it has been labeled with an integration (lutron_caseta) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of lutron_caseta can trigger bot actions by commenting:

  • @home-assistant close Closes the issue.
  • @home-assistant rename Awesome new title Renames the issue.
  • @home-assistant reopen Reopen the issue.
  • @home-assistant unassign lutron_caseta Removes the current integration label and assignees on the issue, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


lutron_caseta documentation
lutron_caseta source
(message by IssueLinks)

@bdraco
Copy link
Member

bdraco commented Feb 9, 2024

(3) Separate the connect timeout from a retrieval/parsing timeout to help debug initial connection attempt quickly (network-level connectivity, proper Lutron creds, etc) while providing a more generous timeout for the second part.

I think this makes the most sense. I added the timeout in #45769 because it would block startup if the bridge was offline. Its actually quite a bit higher than we would like as HA will throw a warning if it takes more than 10s to startup the integration.

We probably only need to wait ~5-6s (maybe 8-9 even) for the connection to establish, and than it would be fine to wait ~50s for the communication (I hope it wouldn't ever take longer than that) as I believe there is another warning that kicks in at 60s.

That's going to require a change to pylutron_caseta as well. Consider this comment a 👍 for accepting a PR in HA to that effect.

@bdraco
Copy link
Member

bdraco commented Feb 28, 2024

There is a chance that the cause was the event loop being blocked if the system was I/O or CPU bound loading python code. We load more code in the executor in 2024.3.x so there is a chance it won't be needed to increase the timeout anymore.

@jonoberheide
Copy link
Contributor Author

@bdraco Sounds good. I will try with 2024.3 when it's out.

@jonoberheide
Copy link
Contributor Author

@bdraco FWIW, 2024.3 unfortunately does not resolve the issue.

@bdraco
Copy link
Member

bdraco commented Mar 7, 2024

We found a lot more places where integrations were doing blocking I/O in the event loop in the last few days. 2024.4 will fix a lot more of those places so if its an event loop being blocked problem, it will probably be solved. If the device actually does take that long to give back the data, it will need a code change.

@jonoberheide
Copy link
Contributor Author

I'm assuming the latter, as there are loads of LEAP calls being made to parse the whole QSX home. There's not much else going on in my HA and it's on pretty beefy hardware.

Just pulling a quick debug log while "reloading" the integration, it looks like the initialization takes about 42 seconds and is making ~800 LEAP calls to the QSX processor.

Maybe bumping the timeout to 60 is naive, but might make this go away? 😆

@bdraco
Copy link
Member

bdraco commented Mar 7, 2024

Another thing is that the pylutroncaseta library uses the built-in, json parser, which is very very slow. We could switch it to use orjson like everything in Home Assistant does. But I'm speculating on what the problem is.

@jonoberheide
Copy link
Contributor Author

Oh I really like that idea. I'll try that and see what impact it has.

@jonoberheide
Copy link
Contributor Author

Eh, looks like little to no impact. I suppose I shouldn't be surprised as it's far from a CPU-bound task with all that network activity.

@bdraco
Copy link
Member

bdraco commented Mar 7, 2024

I haven't verified this but it looks like the design of the library currently isn't very asynchronous, and it could probably be refactored to use its own asyncio protocol instead of using streams and be so much faster

@bdraco
Copy link
Member

bdraco commented Mar 7, 2024

The streams do add a lot of overhead but I don't think thats the problem

@bdraco
Copy link
Member

bdraco commented Mar 7, 2024

The whole login process has quite a sync design. I think we could send all the requests and wait for them to come back instead

@bdraco
Copy link
Member

bdraco commented Mar 7, 2024

gurumitts/pylutron-caseta#163 might solve the issue but I don't have any of the newer Lutron stuff. It works great with my Caseta and RA2 select systems

@bdraco
Copy link
Member

bdraco commented Mar 7, 2024

All the tests are likely failing on that PR as they assume everything comes in synchronously

@issue-triage-workflows
Copy link

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates.
Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍
This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

@jonoberheide
Copy link
Contributor Author

Still an active issue

@averj
Copy link

averj commented Sep 24, 2024

Hey @jonoberheide, do you have a temporary solution for this? Every time I restart HA it fails to re-initialize to QSX.

Note: Adjusting bridge timeout did not resolve this for me

@jonoberheide
Copy link
Contributor Author

@averj I overrode the default BRIDGE_TIMEOUT via custom_components. It's not pretty, but it works.

Do you think your issue is timeout related or not? What are you seeing in the logs? Do you have a lot of Lutron zones?

@averj
Copy link

averj commented Sep 24, 2024

@jonoberheide I meant to edit my comment, I didn't include a version in the manifest.json so it was failing to load the "custom integration". Once I did that it was able to initialize. I opened issue #126001 and then came across yours shortly after, I think it is similar if not the same as what you are experiencing. Thank you for discovering this workaround, saved me a lot of time.

@jonoberheide
Copy link
Contributor Author

@averj Yeah, looks like it may be the same. If you look at your debug log, how much time does it take and how many LEAP calls? Mine was like 40+ seconds and 800+ LEAP calls.

@bdraco had some good ideas on a more elegant fix of separating the timeout logic, but it required more plumbing both in pylutron-caseta and HA than I'm able to handle.

@averj
Copy link

averj commented Sep 24, 2024

@jonoberheide It's taking a little over 50 seconds and sent about 1300 LEAP calls. There's roughly 350 devices in this system. @bdraco removed himself as a code owner for this integration a couple of weeks ago. If anyone else has the capacity to work on this I'm happy to help with testing/validation.

@beausmith
Copy link

beausmith commented Nov 13, 2024

I'm having this issue as well with a Lutron Homeworks QSX install. I was able to use the workaround:

  1. Copy the lutron_caseta component to config/custom_components/lutron_caseta
  2. Update the BRIDGE_TIMEOUT = 35 to BRIDGE_TIMEOUT = 35 in config/custom_components/lutron_caseta/const.py
  3. Restart HA.
  4. Install the custom component integration as you would the official integration. (Note that it will appear as a sibling to the official component in the list, but marked with a custom component icon.)
  5. Then following the rest of the steps to set up.

I'm not looking forward to maintaining a custom version this component, but the show must go on!

I'm motivated to provide assistance to anyone who might be able to help resolve this timeout issue.

@jonoberheide
Copy link
Contributor Author

Now that I have a HA dev environment and a little experience, I'll take a swing at this.

@bdraco
Copy link
Member

bdraco commented Nov 13, 2024

Happy to do a review and help get something merged. However I don't have the capacity to do support for this integration at this time.

@jonoberheide
Copy link
Contributor Author

Added a pull request (gurumitts/pylutron-caseta#176) to the upstream library that will do a connectivity test to the bridge separate from the long/full load that happens during the Smartbridge.connect() function. If this approach is accepted upstream, then I'll add a modification to HA-land to test connectivity with this new function and then also increase the BRIDGE_TIMEOUT to 59 seconds.

@jonoberheide
Copy link
Contributor Author

Another thing is that the pylutroncaseta library uses the built-in, json parser, which is very very slow. We could switch it to use orjson like everything in Home Assistant does. But I'm speculating on what the problem is.

Also added a PR for this gurumitts/pylutron-caseta#175

@jonoberheide
Copy link
Contributor Author

It might look something like this:

diff --git a/homeassistant/components/lutron_caseta/__init__.py b/homeassistant/components/lutron_caseta/__init__.py
index 26fc5ba153..664c277897 100644
--- a/homeassistant/components/lutron_caseta/__init__.py
+++ b/homeassistant/components/lutron_caseta/__init__.py
@@ -168,6 +168,10 @@ async def async_setup_entry(
         _LOGGER.error("Invalid certificate used to connect to bridge at %s", host)
         return False
 
+    # Test out our connection and credentials first before loading lutron devices
+    if not await bridge.test_connection():
+        raise ConfigEntryNotReady(f"Timed out while trying to connect to bridge at {host}")
+
     timed_out = True
     with contextlib.suppress(TimeoutError):
         async with asyncio.timeout(BRIDGE_TIMEOUT):
@@ -177,7 +181,7 @@ async def async_setup_entry(
     if timed_out or not bridge.is_connected():
         await bridge.close()
         if timed_out:
-            raise ConfigEntryNotReady(f"Timed out while trying to connect to {host}")
+            raise ConfigEntryNotReady(f"Timed out while trying to load bridge devices at {host}")
         if not bridge.is_connected():
             raise ConfigEntryNotReady(f"Cannot connect to {host}")
 
diff --git a/homeassistant/components/lutron_caseta/config_flow.py b/homeassistant/components/lutron_caseta/config_flow.py
index cd566b767f..fe7e7259a3 100644
--- a/homeassistant/components/lutron_caseta/config_flow.py
+++ b/homeassistant/components/lutron_caseta/config_flow.py
@@ -231,12 +231,20 @@ class LutronCasetaFlowHandler(ConfigFlow, domain=DOMAIN):
             )
             return None
 
+        # Test out our connection and credentials first before loading lutron devices
+        if not await bridge.test_connection():
+            _LOGGER.error(
+                "Timeout while trying to connect to bridge at %s",
+                self.data[CONF_HOST],
+            )
+            return None
+
         try:
             async with asyncio.timeout(BRIDGE_TIMEOUT):
                 await bridge.connect()
         except TimeoutError:
             _LOGGER.error(
-                "Timeout while trying to connect to bridge at %s",
+                "Timeout while trying to load bridge devices at %s",
                 self.data[CONF_HOST],
             )
         else:
diff --git a/homeassistant/components/lutron_caseta/const.py b/homeassistant/components/lutron_caseta/const.py
index 7493878bec..dfd0b5270c 100644
--- a/homeassistant/components/lutron_caseta/const.py
+++ b/homeassistant/components/lutron_caseta/const.py
@@ -33,7 +33,7 @@ ACTION_RELEASE = "release"
 
 CONF_SUBTYPE = "subtype"
 
-BRIDGE_TIMEOUT = 35
+BRIDGE_TIMEOUT = 59
 
 UNASSIGNED_AREA = "Unassigned"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment