Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make re-connect logic more robust #177

Closed
wants to merge 2 commits into from
Closed

Make re-connect logic more robust #177

wants to merge 2 commits into from

Conversation

postlund
Copy link
Collaborator

@postlund postlund commented Nov 21, 2020

This change fixes a few problems with the current re-connect loop, where no re-connect attempts would be made. This usually happens if a device is already in use by someone else (e.g. app). It also fixes a problem where the heartbeater task would not quit (yielding those pesky AttributeError: 'NoneType' object has no attribute 'write' errors). I have only tested it myself, but I would like to get feedback from at least someone else. Gonna ask for help in #160.

@postlund postlund added bug Something isn't working enhancement New feature or request labels Nov 21, 2020
@postlund postlund self-assigned this Nov 21, 2020
This was referenced Nov 21, 2020
@ultratoto14
Copy link
Collaborator

Hi @postlund having some difficulties to check this one as i was on passive devices (that works so well for my lights 😄).

So i switched to this one with some of my devices powered and some not.
The device with the id 54802417d8f15b95251f in logs [548...51f] was on but even after 10 minutes, it's still reported as unavailable in HA.
home-assistant.log (1).zip

This is the same hardware as the 548024172cf432b1846f that was on too on start and is reported available.

@ultratoto14
Copy link
Collaborator

@postlund an additional remark, some bulbs are reported as available but do not respond to commands. Nothing in the logs. No exception.

@postlund
Copy link
Collaborator Author

@ultratoto14 Hmm, I don't see any apparent issue. Since I haven't touched that code, it should work as before assuming the connection succeeds. Availability is based on if the DP used by ID is present in the datapoints we fetch from the device, which seems to be the case. Maybe you can add some debug logs in available (common.py) is and print self._status, to see what datapoints have been received? I will fix the PR for passive devices once this has been sorted out and merged, since there are a few dependencies.

@ultratoto14
Copy link
Collaborator

@postlund ok, I'll try to find time and retest it, I will be pretty busy in the next weeks.

@ultratoto14
Copy link
Collaborator

@postlund found some time, relaunched with this branch at the moment.

My devices are configured through yaml. So i removed all the devices from the integration page and restart HA.
I had some exceptions occurring during the start, i'm not sure it is linked to this one, but it may explain some of the weird behavior I had sometimes.

2020-11-23 14:42:38 ERROR (MainThread) [homeassistant.config_entries] Error setting up entry SP22 Socket4 (YAML) for sensor
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 236, in async_setup
    result = await component.async_setup_entry(hass, self)  # type: ignore
  File "/usr/src/homeassistant/homeassistant/components/sensor/__init__.py", line 67, in async_setup_entry
    return await hass.data[DOMAIN].async_setup_entry(entry)
  File "/usr/src/homeassistant/homeassistant/helpers/entity_component.py", line 155, in async_setup_entry
    raise ValueError("Config entry has already been setup!")
ValueError: Config entry has already been setup!
grep '\[homeassistant.config_entries\]'  home-assistant.log
2020-11-23 14:42:35 ERROR (MainThread) [homeassistant.config_entries] Error unloading entry Calex Bulb 3 (YAML) for light
2020-11-23 14:42:36 ERROR (MainThread) [homeassistant.config_entries] Error setting up entry Calex Bulb 3 (YAML) for light
2020-11-23 14:42:37 ERROR (MainThread) [homeassistant.config_entries] Error setting up entry SP22 Parents (YAML) for sensor
2020-11-23 14:42:37 ERROR (MainThread) [homeassistant.config_entries] Error setting up entry SP22 Socket3 (YAML) for sensor
2020-11-23 14:42:37 ERROR (MainThread) [homeassistant.config_entries] Error setting up entry SP22 Enzo (YAML) for sensor
2020-11-23 14:42:38 ERROR (MainThread) [homeassistant.config_entries] Error setting up entry SP22 Socket4 (YAML) for sensor

These logs happens on the first start.

After the third restart, it comes ok, switching physically off and on is working, the time to detect the availability vary, seems normal.

I also observed that when having multiple devices physically off during startup may increase the probability to get into trouble. Maybe we face a race during the startup. It seems that when I encounter these problems, HA takes a long time to startup (around 5 minutes instead of less than one). Don't know if all of these help you in any manner.

@postlund
Copy link
Collaborator Author

@ultratoto14 I will have a look at that, I'm a bit unsure what causes delays to start up but that should hopefully be easy to settle (I hope). The config entry issue is something I thought I fixed. Can you paste your config?

@ultratoto14
Copy link
Collaborator

Just listed the TUYA configs

############ TUYA #################
localtuya:

  - host: 192.168.15.160
    device_id: ---------------------
    local_key: ----------------
    friendly_name: RGBW Led Strip LSC
    protocol_version: "3.3"
    entities:
      - platform: light
        friendly_name: RGBW Led Strip LSC
        id: 20
        color_mode: 21
        # brightness: 22
        # color_temp: 23
        color: 24
        scene: 25
        # brightness_upper: 1000
        # color_temp_min_kelvin: 2700
        # color_temp_max_kelvin: 6500
      # - platform: sensor
      #   friendly_name: RGBW Led Strip LSC - 25
      #   id: 25

  - host: 192.168.15.161
    device_id: --------------------
    local_key: ----------------
    friendly_name: LSC 1
    protocol_version: "3.3"
    # passive_device: True
    entities:
      - platform: light
        friendly_name: Bureau 1
        id: 1
        color_mode: 2
        brightness: 3
        color_temp: 4
        color: 5
        scene: 6
        brightness_upper: 255
        color_temp_min_kelvin: 2700
        color_temp_max_kelvin: 6500
        


  - host: 192.168.15.162
    device_id: --------------------
    local_key: ----------------
    friendly_name: LSC 2
    protocol_version: "3.3"
    # passive_device: True
    entities:
      - platform: light
        friendly_name: Bureau 2
        id: 1
        color_mode: 2
        brightness: 3
        color_temp: 4
        color: 5
        scene: 6
        brightness_upper: 255
        color_temp_min_kelvin: 2700
        color_temp_max_kelvin: 6500

  - host: 192.168.15.163
    device_id: --------------------
    local_key: ----------------
    friendly_name: Testtttt LSC
    protocol_version: "3.3"
    # passive_device: True
    entities:
      - platform: light
        friendly_name: Testtttt
        id: 1
        color_mode: 2
        brightness: 3
        color_temp: 4
        color: 5
        scene: 6
        brightness_upper: 255
        color_temp_min_kelvin: 2700
        color_temp_max_kelvin: 6500

  - host: 192.168.15.165
    device_id: ---------------------
    local_key: ----------------
    friendly_name: Calex 1
    protocol_version: "3.3"
    # passive_device: True
    entities:
      - platform: light
        friendly_name: Salon 1
        id: 20
        color_mode: 21
        brightness: 22
        color_temp: 23
        color: 24
        scene: 25
        brightness_upper: 1000
        color_temp_min_kelvin: 2200
        color_temp_max_kelvin: 4000

  - host: 192.168.15.166
    device_id: ---------------------
    local_key: ----------------
    friendly_name: Calex 2
    protocol_version: "3.3"
    # passive_device: True
    entities:
      - platform: light
        friendly_name: Salon 2
        id: 20
        color_mode: 21
        brightness: 22
        color_temp: 23
        color: 24
        scene: 25
        brightness_upper: 1000
        color_temp_min_kelvin: 2200
        color_temp_max_kelvin: 4000
        
  - host: 192.168.15.167
    device_id: --------------------zl 
    local_key: ----------------
    friendly_name: Calex 3
    protocol_version: "3.3"
    # passive_device: True
    entities:
      - platform: light
        friendly_name: Salon 3
        id: 20
        color_mode: 21
        brightness: 22
        color_temp: 23
        color: 24
        scene: 25
        brightness_upper: 1000
        color_temp_min_kelvin: 2200
        color_temp_max_kelvin: 4000

  - host: 192.168.15.168
    device_id: ---------------------
    local_key: ----------------
    friendly_name: Calex 4
    protocol_version: "3.3"
    # passive_device: True
    entities:
      - platform: light
        friendly_name: Salon 4
        id: 20
        color_mode: 21
        brightness: 22
        color_temp: 23
        color: 24
        scene: 25
        brightness_upper: 1000
        color_temp_min_kelvin: 2200
        color_temp_max_kelvin: 4000

  - host: 192.168.15.169
    device_id: --------------------xh
    local_key: ----------------
    friendly_name: Calex Bulb 1
    protocol_version: "3.3"
    entities:
      - platform: light
        friendly_name: Calex Bulb Salon 1
        id: 20
        color_mode: 21
        brightness: 22
        color_temp: 23
        color: 24
        scene: 25
        brightness_upper: 1000
        color_temp_min_kelvin: 2200
        color_temp_max_kelvin: 4000

  - host: 192.168.15.170
    device_id: ---------------------
    local_key: ----------------
    friendly_name: Calex Bulb 2
    protocol_version: "3.3"
    entities:
      - platform: light
        friendly_name: Calex Bulb Salon 2
        id: 20
        color_mode: 21
        brightness: 22
        color_temp: 23
        color: 24
        scene: 25
        brightness_upper: 1000
        color_temp_min_kelvin: 2200
        color_temp_max_kelvin: 4000

  - host: 192.168.15.171
    device_id: ----------------------
    local_key: ----------------
    friendly_name: Calex Bulb 3
    protocol_version: "3.3"
    entities:
      - platform: light
        friendly_name: Calex Bulb Salon 3
        id: 20
        color_mode: 21
        brightness: 22
        color_temp: 23
        color: 24
        scene: 25
        brightness_upper: 1000
        color_temp_min_kelvin: 2200
        color_temp_max_kelvin: 4000

  - host: 192.168.15.172
    device_id: ---------------------
    local_key: ----------------
    friendly_name: Calex Bulb 4
    protocol_version: "3.3"
    entities:
      - platform: light
        friendly_name: Calex Bulb Salon 4
        id: 20
        color_mode: 21
        brightness: 22
        color_temp: 23
        color: 24
        scene: 25
        brightness_upper: 1000
        color_temp_min_kelvin: 2200
        color_temp_max_kelvin: 4000
            
  - host: 192.168.15.173
    device_id: ---------------------
    local_key: ----------------
    friendly_name: SP22 Enzo
    protocol_version: "3.3"
    entities:
      - platform: switch
        friendly_name: SP22 Enzo
        id: 1
      
      - platform: sensor
        friendly_name: SP22 Enzo - Tension
        id: 20
        scaling: 0.1
        device_class: voltage
        unit_of_measurement: V

      - platform: sensor
        friendly_name: SP22 Enzo - Courant
        id: 18
        device_class: current
        unit_of_measurement: mA

      - platform: sensor
        friendly_name: SP22 Enzo - Puissance
        id: 19
        scaling: 0.1
        device_class: power
        unit_of_measurement: W

  - host: 192.168.15.174
    device_id: ---------------------
    local_key: ----------------
    friendly_name: SP22 Parents
    protocol_version: "3.3"
    entities:
      - platform: switch
        friendly_name: SP22 Parents
        id: 1
      
      - platform: sensor
        friendly_name: SP22 Parents - Tension
        id: 20
        scaling: 0.1
        device_class: voltage
        unit_of_measurement: V

      - platform: sensor
        friendly_name: SP22 Parents - Courant
        id: 18
        device_class: current
        unit_of_measurement: mA

      - platform: sensor
        friendly_name: SP22 Parents - Puissance
        id: 19
        scaling: 0.1
        device_class: power
        unit_of_measurement: W

  - host: 192.168.15.175
    device_id: ---------------------
    local_key: ----------------
    friendly_name: SP22 Socket3
    protocol_version: "3.3"
    entities:
      - platform: switch
        friendly_name: SP22 Socket3
        id: 1
      
      - platform: sensor
        friendly_name: SP22 Socket3 - Tension
        id: 20
        scaling: 0.1
        device_class: voltage
        unit_of_measurement: V

      - platform: sensor
        friendly_name: SP22 Socket3 - Courant
        id: 18
        device_class: current
        unit_of_measurement: mA

      - platform: sensor
        friendly_name: SP22 Socket3 - Puissance
        id: 19
        scaling: 0.1
        device_class: power
        unit_of_measurement: W

  - host: 192.168.15.176
    device_id: --------------------sh
    local_key: ----------------
    friendly_name: SP22 Socket4
    protocol_version: "3.3"
    entities:
      - platform: switch
        friendly_name: SP22 Socket4
        id: 1
      
      - platform: sensor
        friendly_name: SP22 Socket4 - Tension
        id: 20
        scaling: 0.1
        device_class: voltage
        unit_of_measurement: V

      - platform: sensor
        friendly_name: SP22 Socket4 - Courant
        id: 18
        device_class: current
        unit_of_measurement: mA

      - platform: sensor
        friendly_name: SP22 Socket4 - Puissance
        id: 19
        scaling: 0.1
        device_class: power
        unit_of_measurement: W

sensor:

  - platform: integration
    source: sensor.sp22_enzo_puissance
    name: sp22_enzo_wattage
    unit_time: h
    round: 1

  - platform: integration
    source: sensor.sp22_parents_puissance
    name: sp22_parents_wattage
    unit_time: h
    round: 1

utility_meter:

  sp22_enzo_energy:
    source: sensor.sp22_enzo_wattage
    cycle: daily

  sp22_parents_energy:
    source: sensor.sp22_parents_puissance
    cycle: daily

@ultratoto14
Copy link
Collaborator

@postlund is the yaml configuration is helping ?

An additional remark, when I have the problem of one device not discovered or wrong status reported to HA. A simple reload of the integration is making things ok again.

@ultratoto14
Copy link
Collaborator

@postlund if I encounter the slow startup problem, maybe I can use this would it be useful for you ?

@postlund
Copy link
Collaborator Author

@ultratoto14 I haven't had the time to look at it yet, will do that soon. The problem isn't that something "hangs" in the way that a profiler would find it. It's more that Home Assistant expects the entities to be reported as added ok. But I will look into it.

@ultratoto14
Copy link
Collaborator

Hi @postlund, as mentioned earlier I will have less time until the end of the year, will try to help as soon as i can.

About this PR, I stayed on it for the past two days. I'm sorry to say that but it doesn't seem to help in re-connection. Devices that are switched off for a while then on are randomly discovered, after a while (>15m) they may be all discovered.

In the passive device PR, it's working every time (i was not able to have a failing bulb), not sure if the re-connection is the same but it works. As in this one, it's based on time (as master) instead of message received it should work the same when the reconnect is triggered but not. Tell me if I can help in any way.

@postlund
Copy link
Collaborator Author

@ultratoto14 Please do note that this PR is not about discovery speed or anything like it. It is purely to make the re-connection routine more robust (don't mix up with passive devices, that's something else). What is currently on master sometime ends up in a state where no re-connects are made at all. The fact that the connection is re-established at all likely mean it's working.

However... since there's a cap set to five minutes on the backoff time (before re-connecting), it shouldn't take that long for it to re-connect. Do you see any errors in the log?

@ultratoto14
Copy link
Collaborator

@postlund i clearly understand that it's not about discovery time, i just said that some of my bulbs do not reconnect after not only 5 minutes but 25 and 35 minutes. I just made the parallel that the reconnect always work on passive pr and here the reconnect is not working every time.

About the logs, it's so verbose, as i have plenty of devices, that at the end it makes my system fail. I have exception trace on each failing reconnect, that is surely does not help in debugging. I may get back to this PR and try to limit the logs to only one or two devices and see.

@ultratoto14
Copy link
Collaborator

ultratoto14 commented Nov 27, 2020

Hi I let it run all the day, the ones that you should look at are:
[548...46f]
[548...51f]

2020-11-27 18:50:12 DEBUG (MainThread) [custom_components.localtuya.common] [548...46f] Waiting 300 seconds before connecting to 192.168.15.161
...
2020-11-27 18:51:25 DEBUG (MainThread) [custom_components.localtuya.common] [548...51f] Waiting 300 seconds before connecting to 192.168.15.162

at 18:54 I switched them both on
Then we can see the same kind of logs for both:

2020-11-27 18:56:25 DEBUG (MainThread) [custom_components.localtuya.pytuya] [548...51f] Started heartbeat loop
2020-11-27 18:56:25 DEBUG (MainThread) [custom_components.localtuya.pytuya] [548...51f] Sending command heartbeat (device type: type_0a)
2020-11-27 18:56:25 DEBUG (MainThread) [custom_components.localtuya.pytuya] [548...51f] Send payload: b'{}'
2020-11-27 18:56:25 DEBUG (MainThread) [custom_components.localtuya.pytuya] [548...51f] Waiting for sequence number -100
2020-11-27 18:56:25 DEBUG (MainThread) [custom_components.localtuya.common] [548...51f] Retrieving initial state
2020-11-27 18:56:25 DEBUG (MainThread) [custom_components.localtuya.pytuya] [548...51f] Sending command status (device type: type_0a)
2020-11-27 18:56:25 DEBUG (MainThread) [custom_components.localtuya.pytuya] [548...51f] Send payload: b'{"gwId":"54802417d8f15b95251f","devId":"54802417d8f15b95251f"}'
2020-11-27 18:56:25 DEBUG (MainThread) [custom_components.localtuya.pytuya] [548...51f] Waiting for sequence number 1
2020-11-27 18:56:25 DEBUG (MainThread) [custom_components.localtuya.pytuya] [548...51f] Connection lost: [Errno 104] Connection reset by peer
2020-11-27 18:56:25 DEBUG (MainThread) [custom_components.localtuya.pytuya] [548...51f] Closing connection
2020-11-27 18:56:25 DEBUG (MainThread) [custom_components.localtuya.pytuya] [548...51f] Stopped heartbeat loop
2020-11-27 18:56:25 DEBUG (MainThread) [custom_components.localtuya.pytuya] [548...51f] Connection lost: None
2020-11-27 18:56:25 DEBUG (MainThread) [custom_components.localtuya.pytuya] [548...51f] Wait was aborted for seqno 1
2020-11-27 18:56:25 WARNING (MainThread) [custom_components.localtuya.common] [548...51f] Failed to retrieve initial state, will re-connect
2020-11-27 18:56:25 DEBUG (MainThread) [custom_components.localtuya.common] [548...51f] Starting connect loop to 192.168.15.162
2020-11-27 18:56:25 DEBUG (MainThread) [custom_components.localtuya.common] [548...51f] Waiting 300 seconds before connecting to 192.168.15.162

debug.log.zip

They reconnect both on the second try.

@ultratoto14
Copy link
Collaborator

So the first connection is closed by a connection reset by peer, this may tell that the device is there but do not answer correctly or closes the connection. We may reset the loop counter in that case ?

@ultratoto14
Copy link
Collaborator

Hi @postlund, i see more and more issues about re-connection not working or loosing status. I also saw that i lost access to some of my smart sockets during the night too, no error log.

I have a bulb that as spare that i can use to test and focus on that one. I also have an HA instance running on a pi that just reports some specific temperature sensors. So if you have ideas about adding a lot of debug logs that may help you, i can use that HA instance with this spare bulb to help you in identifying the connection problems.

What do you think ?

@jfparis
Copy link

jfparis commented Dec 5, 2020

Hi @postlund

I believe the code currently in the branch is a net improvement compared to what we had before.

However my logbook is still full of device unavailable (lasting a couple of second - time for a timeout / reconnect loop) (which would be #136)

Anyway I am keeping this branch running for now. Please @ me if you need to test any further iteration

@ultratoto14
Copy link
Collaborator

Hi @postlund, continuing checking this one. I have a feeling that the problem may be link to the time taken between the switch on and the reconnection. As i already said, the passive device PR always reconnects. This is done in the next seconds (<20) after the switch on. Without the passive_device, the reconnection may occur up to 5 minutes later. Could it be possible that between this 20s and the next minutes the bulb connects to the cloud preventing the integration to connect to it ?

I'll try blocking the internet access of the bulb like in #195 and see if it helps.

@postlund
Copy link
Collaborator Author

postlund commented Dec 7, 2020

Sorry @ultratoto14 (and you others) for not following up properly. I do think this PR brings the intended improvements to the table: make sure that we don't get stuck in a state where we don't re-connect at all. If the re-connects fail or not is a different story. However... I'm starting to think that we should abandon this and move towards making the "passive device" functionality the default and only implementation. The only downside I can think of is that people with more obscure network configurations will have to make sure broadcasts propagate properly (e.g. between subnets or VLANs). But for the gains: a lot less complex code base, much faster re-connects (likely with better results as devices only broadcast when they are ready for a connection), less overhead since re-connects only occurs when needed... I feel that we can probably be ok with that. Any objections?

@ultratoto14
Copy link
Collaborator

@postlund, i agree that passive mode can be the default one.
If you want i can test the passive PR with my "always on" devices too ?

@jfparis
Copy link

jfparis commented Dec 7, 2020

I can already see myself with a violent iptables induced headache :) I have all my plugs on a separate vlan / wireless network given well where they call back home

if this passive mode is supported by those gosund plugs I have happy to help with some test and documentation for router configuration (please do me a favour and keep that current branch as is in git for a while)

@postlund
Copy link
Collaborator Author

postlund commented Dec 7, 2020

@jfparis All tuya devices broadcast their presence every five seconds on UDP port 6668 on the local network (when no one is connected). The "passive" thing is just something I came up with, referring to localtuya passively listening for broadcasts before connecting instead of actively trying to connect.

@jfparis
Copy link

jfparis commented Dec 7, 2020

All right so this would be down to forwarding the broadcast from the IoT vlan to the homeassistant. That should not be too much brain damage

@postlund
Copy link
Collaborator Author

postlund commented Dec 7, 2020

Exactly, simple rule for that port should be sufficient.

@postlund
Copy link
Collaborator Author

postlund commented Dec 7, 2020

I'll leave this PR open for now, but let's call it dead and focus on the passive device support then. There are a few fixes I need to port over and a bunch of code to remove. Keep an eye open in #171 as I will need testers.

@Matssa56
Copy link

Matssa56 commented Dec 8, 2020

Just fyi, might be linked to #226. Tried to close the connection directly at the router and I can confirm that is tries to connect like every minute or so, flooding with DNS requests and it doesn't help the network dropouts, maybe it's even worse than before. However I didn't foward the port on the router directly to my RPI4 with HA, might be worth a try but not really sure what it will do ^^'

@jfparis
Copy link

jfparis commented Dec 10, 2020

@postlund not sure whether this message belongs here on in the passive PR but here it goes

  • Forwarding UDP broadcast datagrams from a properly segregated vlan built on a typical asus router is a pain (as they depends on bridged networks) and usually run on those (now antique) linux 2.6 kernels. (datagram fwd on big iron or 5.0 kernel would be easier but this is different on widely available hardware) => if there was an easy way to have passive and active systems co-existing this would be ideal
  • I am not looking at alternative solutions (to a fwd).
    • I have had to setup avahi (mdns) forward to allow for the discovery of chromecasts by Hass so looking into a similar solution
    • Are you relying on the datagram sender IP address to identify potential candidates or on is it included in the message?

cheers

@postlund
Copy link
Collaborator Author

@jfparis I hear you, would still prefer to not have to maintain this code though. The address is part of the discovery message, so what's in the IP header doesn't matter. If it helps, I can help out and develop a small script you can run on the network where your devices are that can forward discovery messages to another IP. I guess it should work to just point it to Home Assistant, assuming no firewall blocks the packets.

@jfparis
Copy link

jfparis commented Dec 14, 2020 via email

@postlund
Copy link
Collaborator Author

@jfparis Sounds perfect! Feel free to submit a PR when you are done. Preferably you create a directory called scripts in the root at this repo and put it there.

@postlund postlund closed this Dec 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants