-
Notifications
You must be signed in to change notification settings - Fork 574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make re-connect logic more robust #177
Conversation
Hi @postlund having some difficulties to check this one as i was on passive devices (that works so well for my lights 😄). So i switched to this one with some of my devices powered and some not. This is the same hardware as the |
@postlund an additional remark, some bulbs are reported as available but do not respond to commands. Nothing in the logs. No exception. |
@ultratoto14 Hmm, I don't see any apparent issue. Since I haven't touched that code, it should work as before assuming the connection succeeds. Availability is based on if the DP used by ID is present in the datapoints we fetch from the device, which seems to be the case. Maybe you can add some debug logs in |
@postlund ok, I'll try to find time and retest it, I will be pretty busy in the next weeks. |
@postlund found some time, relaunched with this branch at the moment. My devices are configured through yaml. So i removed all the devices from the integration page and restart HA.
These logs happens on the first start. After the third restart, it comes ok, switching physically off and on is working, the time to detect the availability vary, seems normal. I also observed that when having multiple devices physically off during startup may increase the probability to get into trouble. Maybe we face a race during the startup. It seems that when I encounter these problems, HA takes a long time to startup (around 5 minutes instead of less than one). Don't know if all of these help you in any manner. |
@ultratoto14 I will have a look at that, I'm a bit unsure what causes delays to start up but that should hopefully be easy to settle (I hope). The config entry issue is something I thought I fixed. Can you paste your config? |
Just listed the TUYA configs ############ TUYA #################
localtuya:
- host: 192.168.15.160
device_id: ---------------------
local_key: ----------------
friendly_name: RGBW Led Strip LSC
protocol_version: "3.3"
entities:
- platform: light
friendly_name: RGBW Led Strip LSC
id: 20
color_mode: 21
# brightness: 22
# color_temp: 23
color: 24
scene: 25
# brightness_upper: 1000
# color_temp_min_kelvin: 2700
# color_temp_max_kelvin: 6500
# - platform: sensor
# friendly_name: RGBW Led Strip LSC - 25
# id: 25
- host: 192.168.15.161
device_id: --------------------
local_key: ----------------
friendly_name: LSC 1
protocol_version: "3.3"
# passive_device: True
entities:
- platform: light
friendly_name: Bureau 1
id: 1
color_mode: 2
brightness: 3
color_temp: 4
color: 5
scene: 6
brightness_upper: 255
color_temp_min_kelvin: 2700
color_temp_max_kelvin: 6500
- host: 192.168.15.162
device_id: --------------------
local_key: ----------------
friendly_name: LSC 2
protocol_version: "3.3"
# passive_device: True
entities:
- platform: light
friendly_name: Bureau 2
id: 1
color_mode: 2
brightness: 3
color_temp: 4
color: 5
scene: 6
brightness_upper: 255
color_temp_min_kelvin: 2700
color_temp_max_kelvin: 6500
- host: 192.168.15.163
device_id: --------------------
local_key: ----------------
friendly_name: Testtttt LSC
protocol_version: "3.3"
# passive_device: True
entities:
- platform: light
friendly_name: Testtttt
id: 1
color_mode: 2
brightness: 3
color_temp: 4
color: 5
scene: 6
brightness_upper: 255
color_temp_min_kelvin: 2700
color_temp_max_kelvin: 6500
- host: 192.168.15.165
device_id: ---------------------
local_key: ----------------
friendly_name: Calex 1
protocol_version: "3.3"
# passive_device: True
entities:
- platform: light
friendly_name: Salon 1
id: 20
color_mode: 21
brightness: 22
color_temp: 23
color: 24
scene: 25
brightness_upper: 1000
color_temp_min_kelvin: 2200
color_temp_max_kelvin: 4000
- host: 192.168.15.166
device_id: ---------------------
local_key: ----------------
friendly_name: Calex 2
protocol_version: "3.3"
# passive_device: True
entities:
- platform: light
friendly_name: Salon 2
id: 20
color_mode: 21
brightness: 22
color_temp: 23
color: 24
scene: 25
brightness_upper: 1000
color_temp_min_kelvin: 2200
color_temp_max_kelvin: 4000
- host: 192.168.15.167
device_id: --------------------zl
local_key: ----------------
friendly_name: Calex 3
protocol_version: "3.3"
# passive_device: True
entities:
- platform: light
friendly_name: Salon 3
id: 20
color_mode: 21
brightness: 22
color_temp: 23
color: 24
scene: 25
brightness_upper: 1000
color_temp_min_kelvin: 2200
color_temp_max_kelvin: 4000
- host: 192.168.15.168
device_id: ---------------------
local_key: ----------------
friendly_name: Calex 4
protocol_version: "3.3"
# passive_device: True
entities:
- platform: light
friendly_name: Salon 4
id: 20
color_mode: 21
brightness: 22
color_temp: 23
color: 24
scene: 25
brightness_upper: 1000
color_temp_min_kelvin: 2200
color_temp_max_kelvin: 4000
- host: 192.168.15.169
device_id: --------------------xh
local_key: ----------------
friendly_name: Calex Bulb 1
protocol_version: "3.3"
entities:
- platform: light
friendly_name: Calex Bulb Salon 1
id: 20
color_mode: 21
brightness: 22
color_temp: 23
color: 24
scene: 25
brightness_upper: 1000
color_temp_min_kelvin: 2200
color_temp_max_kelvin: 4000
- host: 192.168.15.170
device_id: ---------------------
local_key: ----------------
friendly_name: Calex Bulb 2
protocol_version: "3.3"
entities:
- platform: light
friendly_name: Calex Bulb Salon 2
id: 20
color_mode: 21
brightness: 22
color_temp: 23
color: 24
scene: 25
brightness_upper: 1000
color_temp_min_kelvin: 2200
color_temp_max_kelvin: 4000
- host: 192.168.15.171
device_id: ----------------------
local_key: ----------------
friendly_name: Calex Bulb 3
protocol_version: "3.3"
entities:
- platform: light
friendly_name: Calex Bulb Salon 3
id: 20
color_mode: 21
brightness: 22
color_temp: 23
color: 24
scene: 25
brightness_upper: 1000
color_temp_min_kelvin: 2200
color_temp_max_kelvin: 4000
- host: 192.168.15.172
device_id: ---------------------
local_key: ----------------
friendly_name: Calex Bulb 4
protocol_version: "3.3"
entities:
- platform: light
friendly_name: Calex Bulb Salon 4
id: 20
color_mode: 21
brightness: 22
color_temp: 23
color: 24
scene: 25
brightness_upper: 1000
color_temp_min_kelvin: 2200
color_temp_max_kelvin: 4000
- host: 192.168.15.173
device_id: ---------------------
local_key: ----------------
friendly_name: SP22 Enzo
protocol_version: "3.3"
entities:
- platform: switch
friendly_name: SP22 Enzo
id: 1
- platform: sensor
friendly_name: SP22 Enzo - Tension
id: 20
scaling: 0.1
device_class: voltage
unit_of_measurement: V
- platform: sensor
friendly_name: SP22 Enzo - Courant
id: 18
device_class: current
unit_of_measurement: mA
- platform: sensor
friendly_name: SP22 Enzo - Puissance
id: 19
scaling: 0.1
device_class: power
unit_of_measurement: W
- host: 192.168.15.174
device_id: ---------------------
local_key: ----------------
friendly_name: SP22 Parents
protocol_version: "3.3"
entities:
- platform: switch
friendly_name: SP22 Parents
id: 1
- platform: sensor
friendly_name: SP22 Parents - Tension
id: 20
scaling: 0.1
device_class: voltage
unit_of_measurement: V
- platform: sensor
friendly_name: SP22 Parents - Courant
id: 18
device_class: current
unit_of_measurement: mA
- platform: sensor
friendly_name: SP22 Parents - Puissance
id: 19
scaling: 0.1
device_class: power
unit_of_measurement: W
- host: 192.168.15.175
device_id: ---------------------
local_key: ----------------
friendly_name: SP22 Socket3
protocol_version: "3.3"
entities:
- platform: switch
friendly_name: SP22 Socket3
id: 1
- platform: sensor
friendly_name: SP22 Socket3 - Tension
id: 20
scaling: 0.1
device_class: voltage
unit_of_measurement: V
- platform: sensor
friendly_name: SP22 Socket3 - Courant
id: 18
device_class: current
unit_of_measurement: mA
- platform: sensor
friendly_name: SP22 Socket3 - Puissance
id: 19
scaling: 0.1
device_class: power
unit_of_measurement: W
- host: 192.168.15.176
device_id: --------------------sh
local_key: ----------------
friendly_name: SP22 Socket4
protocol_version: "3.3"
entities:
- platform: switch
friendly_name: SP22 Socket4
id: 1
- platform: sensor
friendly_name: SP22 Socket4 - Tension
id: 20
scaling: 0.1
device_class: voltage
unit_of_measurement: V
- platform: sensor
friendly_name: SP22 Socket4 - Courant
id: 18
device_class: current
unit_of_measurement: mA
- platform: sensor
friendly_name: SP22 Socket4 - Puissance
id: 19
scaling: 0.1
device_class: power
unit_of_measurement: W
sensor:
- platform: integration
source: sensor.sp22_enzo_puissance
name: sp22_enzo_wattage
unit_time: h
round: 1
- platform: integration
source: sensor.sp22_parents_puissance
name: sp22_parents_wattage
unit_time: h
round: 1
utility_meter:
sp22_enzo_energy:
source: sensor.sp22_enzo_wattage
cycle: daily
sp22_parents_energy:
source: sensor.sp22_parents_puissance
cycle: daily
|
@postlund is the yaml configuration is helping ? An additional remark, when I have the problem of one device not discovered or wrong status reported to HA. A simple reload of the integration is making things ok again. |
@ultratoto14 I haven't had the time to look at it yet, will do that soon. The problem isn't that something "hangs" in the way that a profiler would find it. It's more that Home Assistant expects the entities to be reported as added ok. But I will look into it. |
Hi @postlund, as mentioned earlier I will have less time until the end of the year, will try to help as soon as i can. About this PR, I stayed on it for the past two days. I'm sorry to say that but it doesn't seem to help in re-connection. Devices that are switched off for a while then on are randomly discovered, after a while (>15m) they may be all discovered. In the passive device PR, it's working every time (i was not able to have a failing bulb), not sure if the re-connection is the same but it works. As in this one, it's based on time (as master) instead of message received it should work the same when the reconnect is triggered but not. Tell me if I can help in any way. |
@ultratoto14 Please do note that this PR is not about discovery speed or anything like it. It is purely to make the re-connection routine more robust (don't mix up with passive devices, that's something else). What is currently on However... since there's a cap set to five minutes on the backoff time (before re-connecting), it shouldn't take that long for it to re-connect. Do you see any errors in the log? |
@postlund i clearly understand that it's not about discovery time, i just said that some of my bulbs do not reconnect after not only 5 minutes but 25 and 35 minutes. I just made the parallel that the reconnect always work on passive pr and here the reconnect is not working every time. About the logs, it's so verbose, as i have plenty of devices, that at the end it makes my system fail. I have exception trace on each failing reconnect, that is surely does not help in debugging. I may get back to this PR and try to limit the logs to only one or two devices and see. |
Hi I let it run all the day, the ones that you should look at are:
at 18:54 I switched them both on
They reconnect both on the second try. |
So the first connection is closed by a connection reset by peer, this may tell that the device is there but do not answer correctly or closes the connection. We may reset the loop counter in that case ? |
Hi @postlund, i see more and more issues about re-connection not working or loosing status. I also saw that i lost access to some of my smart sockets during the night too, no error log. I have a bulb that as spare that i can use to test and focus on that one. I also have an HA instance running on a pi that just reports some specific temperature sensors. So if you have ideas about adding a lot of debug logs that may help you, i can use that HA instance with this spare bulb to help you in identifying the connection problems. What do you think ? |
Hi @postlund I believe the code currently in the branch is a net improvement compared to what we had before.
However my logbook is still full of device unavailable (lasting a couple of second - time for a timeout / reconnect loop) (which would be #136) Anyway I am keeping this branch running for now. Please @ me if you need to test any further iteration |
Hi @postlund, continuing checking this one. I have a feeling that the problem may be link to the time taken between the switch on and the reconnection. As i already said, the passive device PR always reconnects. This is done in the next seconds (<20) after the switch on. Without the passive_device, the reconnection may occur up to 5 minutes later. Could it be possible that between this 20s and the next minutes the bulb connects to the cloud preventing the integration to connect to it ? I'll try blocking the internet access of the bulb like in #195 and see if it helps. |
Sorry @ultratoto14 (and you others) for not following up properly. I do think this PR brings the intended improvements to the table: make sure that we don't get stuck in a state where we don't re-connect at all. If the re-connects fail or not is a different story. However... I'm starting to think that we should abandon this and move towards making the "passive device" functionality the default and only implementation. The only downside I can think of is that people with more obscure network configurations will have to make sure broadcasts propagate properly (e.g. between subnets or VLANs). But for the gains: a lot less complex code base, much faster re-connects (likely with better results as devices only broadcast when they are ready for a connection), less overhead since re-connects only occurs when needed... I feel that we can probably be ok with that. Any objections? |
@postlund, i agree that passive mode can be the default one. |
I can already see myself with a violent iptables induced headache :) I have all my plugs on a separate vlan / wireless network given well where they call back home if this passive mode is supported by those gosund plugs I have happy to help with some test and documentation for router configuration (please do me a favour and keep that current branch as is in git for a while) |
@jfparis All tuya devices broadcast their presence every five seconds on UDP port 6668 on the local network (when no one is connected). The "passive" thing is just something I came up with, referring to localtuya passively listening for broadcasts before connecting instead of actively trying to connect. |
All right so this would be down to forwarding the broadcast from the IoT vlan to the homeassistant. That should not be too much brain damage |
Exactly, simple rule for that port should be sufficient. |
I'll leave this PR open for now, but let's call it dead and focus on the passive device support then. There are a few fixes I need to port over and a bunch of code to remove. Keep an eye open in #171 as I will need testers. |
Just fyi, might be linked to #226. Tried to close the connection directly at the router and I can confirm that is tries to connect like every minute or so, flooding with DNS requests and it doesn't help the network dropouts, maybe it's even worse than before. However I didn't foward the port on the router directly to my RPI4 with HA, might be worth a try but not really sure what it will do ^^' |
@postlund not sure whether this message belongs here on in the passive PR but here it goes
cheers |
@jfparis I hear you, would still prefer to not have to maintain this code though. The address is part of the discovery message, so what's in the IP header doesn't matter. If it helps, I can help out and develop a small script you can run on the network where your devices are that can forward discovery messages to another IP. I guess it should work to just point it to Home Assistant, assuming no firewall blocks the packets. |
Let's see how it all works (on the passive branch)
I was planning to do that precise script over the holiday (will aim to make
it nice and documented assuming it can be useful to others)
|
@jfparis Sounds perfect! Feel free to submit a PR when you are done. Preferably you create a directory called |
This change fixes a few problems with the current re-connect loop, where no re-connect attempts would be made. This usually happens if a device is already in use by someone else (e.g. app). It also fixes a problem where the heartbeater task would not quit (yielding those pesky
AttributeError: 'NoneType' object has no attribute 'write'
errors). I have only tested it myself, but I would like to get feedback from at least someone else. Gonna ask for help in #160.