-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Terminate scripts with until and while conditions that execute more than 10000 times #115110
Conversation
…than 100000 times - warns at 10000 - terminates at 100000 2024-04-07 02:37:57.336 WARNING (MainThread) [homeassistant.helpers.script] Until condition [{condition: template, value_template: Template<template=({{1 != 1}}) renders=20000>}] in script `loop` is looping more than 10000 times 2024-04-07 02:38:11.902 CRITICAL (MainThread) [homeassistant.helpers.script] Until condition [{condition: template, value_template: Template<template=({{1 != 1}}) renders=200000>}] in script `loop` terminated because it looping more than 100000 times fixes #115042
almost 0400 here so will have to write tests tomorrow |
100000 is likely too high. I can still get a blue to crash before the protection kicks in |
It also depends on what is done in each step |
Yes, and the amount of ram in the system. 100000 was fine because I have ~16GB of ram in my test env, but the blue only has ~4GB |
The system gets unstable at 27000 iterations and hard crashes at Apr 07 18:47:15 homeassistant homeassistant[512]: 2024-04-07 08:46:11.930 INFO (MainThread) [homeassistant.components.automation.loop] loop: Repeating sequence: Iteration 54246 |
Its also having trouble with the trace being so large. I dropped it to a 25000 cap |
Now the system having trouble coming back up because traces are too large
|
25000 is enough to keep the system stable if they only run it once. But since we store multiple traces, if it runs again, we still end up in bad place |
system crashes on shutdown writing traces now |
10000 nodes seems to be the most a trace can handle if we save 5x of them |
with 10000, the traces only get to
|
56.2M still seems like a lot for 5 traces, but it does stay stable on the blue now |
Even 10000 iterations seems like its pushing the limits of the system design, it does seem to be OK. |
Look like a nice limit for a breaking change |
Usually, we don't tag breaking changes for backport but since the impact is the system hanging up and it does so ~153x faster in 2024.4.x, the condition can't become true soon enough to avoid the hang most of the time like it could in earlier versions. This one seems worth making an exception for given the result is a hang of the full system. If someone disagrees feel free to untag the milestone. Removing the milestone could also be justified since the problem could still happen on older versions, but was much less likely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM,
Thnx @bdraco 👍
thanks |
make it configurable + a default value |
Usually, we don't tag breaking changes for backport but since the impact is the system hanging up and it does so ~153x faster in 2024.4.x, the condition can't become true soon enough to avoid the hang most of the time like it could in earlier versions.
This one seems worth making an exception for given the result is a hang of the full system. If someone disagrees feel free to untag the milestone. Removing the milestone could also be justified since the problem could still happen on older versions, but was much less likely.
Breaking change
To prevent the system from locking up when an automation is looping, there is now a hard limit of 10000 loops for repeat
until
andwhile
conditions in scripts and automations. A warning will be logged when the loop reaches 5000 iterations.Proposed change
TLDR: Polling loops (usually without a delay) in automations can hang the system. Previously these loops would execute a lot slower before 2024.4.x and users relied on them finishing in time before the system ran out of memory.
Extremely high numbers were chosen here to minimize the number of use cases affected.
It is expected that this will cause some automation to stop working because we have cases where users expect their automation to check the condition in a loop for many iterations, even if it is consuming nearly all of the system's CPU time. Affected automations would likely crash the system anyway, so the impact should be minimal.
Additionally, we now yield to the event loop between iterations of
until
andwhile
conditions to keep the system responsive while all the CPU time is being consumed and give the user a chance to terminate the script themselves. Yielding to the loop has the unfortunate side effect of making alluntil
andwhile
conditions more latent but is limited to scripts with these conditions.These loop designs are typically created because the automation author copied the code from somewhere or needed to learn how to write better automation. They didn't expect the system to lock up because the automation design could have been better.
Ideally, these types of automation should be replaced with
wait_template
instead, as it generally will not consume all the CPU time while waiting for a condition to be true. However, we need to ensure the system remains stable even with a less-than-ideal automation design since blueprints exist that use loops to poll conditions instead.The python equivalent to what these script are doing is:
2024-04-07 02:37:57.336 WARNING (MainThread) [homeassistant.helpers.script] Until condition [{condition: template, value_template: Template<template=({{1 != 1}}) renders=20000>}] in script
loopis looping more than 10000 times
2024-04-07 02:38:11.902 CRITICAL (MainThread) [homeassistant.helpers.script] Until condition [{condition: template, value_template: Template<template=({{1 != 1}}) renders=200000>}] in script
loopterminated because it looping more than 100000 times
Type of change
Additional information
Checklist
ruff format homeassistant tests
)If user exposed functionality or configuration variables are added/changed:
If the code communicates with devices, web services, or third-party tools:
Updated and included derived files by running:
python3 -m script.hassfest
.requirements_all.txt
.Updated by running
python3 -m script.gen_requirements_all
..coveragerc
.To help with the load of incoming pull requests: