Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminate scripts with until and while conditions that execute more than 10000 times #115110

Merged
merged 14 commits into from
Apr 7, 2024

Conversation

bdraco
Copy link
Member

@bdraco bdraco commented Apr 7, 2024

Usually, we don't tag breaking changes for backport but since the impact is the system hanging up and it does so ~153x faster in 2024.4.x, the condition can't become true soon enough to avoid the hang most of the time like it could in earlier versions.

This one seems worth making an exception for given the result is a hang of the full system. If someone disagrees feel free to untag the milestone. Removing the milestone could also be justified since the problem could still happen on older versions, but was much less likely.

Breaking change

To prevent the system from locking up when an automation is looping, there is now a hard limit of 10000 loops for repeat until and while conditions in scripts and automations. A warning will be logged when the loop reaches 5000 iterations.

Proposed change

TLDR: Polling loops (usually without a delay) in automations can hang the system. Previously these loops would execute a lot slower before 2024.4.x and users relied on them finishing in time before the system ran out of memory.

Extremely high numbers were chosen here to minimize the number of use cases affected.

It is expected that this will cause some automation to stop working because we have cases where users expect their automation to check the condition in a loop for many iterations, even if it is consuming nearly all of the system's CPU time. Affected automations would likely crash the system anyway, so the impact should be minimal.

Additionally, we now yield to the event loop between iterations of until and while conditions to keep the system responsive while all the CPU time is being consumed and give the user a chance to terminate the script themselves. Yielding to the loop has the unfortunate side effect of making all until and while conditions more latent but is limited to scripts with these conditions.

These loop designs are typically created because the automation author copied the code from somewhere or needed to learn how to write better automation. They didn't expect the system to lock up because the automation design could have been better.

Ideally, these types of automation should be replaced with wait_template instead, as it generally will not consume all the CPU time while waiting for a condition to be true. However, we need to ensure the system remains stable even with a less-than-ideal automation design since blueprints exist that use loops to poll conditions instead.

The python equivalent to what these script are doing is:

while True:
  if switch.is_on():
    break

2024-04-07 02:37:57.336 WARNING (MainThread) [homeassistant.helpers.script] Until condition [{condition: template, value_template: Template<template=({{1 != 1}}) renders=20000>}] in script loop is looping more than 10000 times
2024-04-07 02:38:11.902 CRITICAL (MainThread) [homeassistant.helpers.script] Until condition [{condition: template, value_template: Template<template=({{1 != 1}}) renders=200000>}] in script loop terminated because it looping more than 100000 times

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New integration (thank you!)
  • New feature (which adds functionality to an existing integration)
  • Deprecation (breaking change to happen in the future)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • I have followed the perfect PR recommendations
  • The code has been formatted using Ruff (ruff format homeassistant tests)
  • Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

  • The manifest file has all fields filled out correctly.
    Updated and included derived files by running: python3 -m script.hassfest.
  • New or updated dependencies have been added to requirements_all.txt.
    Updated by running python3 -m script.gen_requirements_all.
  • For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.
  • Untested files have been added to .coveragerc.

To help with the load of incoming pull requests:

…than 100000 times

- warns at 10000
- terminates at 100000

2024-04-07 02:37:57.336 WARNING (MainThread) [homeassistant.helpers.script] Until condition [{condition: template, value_template: Template<template=({{1 != 1}}) renders=20000>}] in script `loop` is looping more than 10000 times
2024-04-07 02:38:11.902 CRITICAL (MainThread) [homeassistant.helpers.script] Until condition [{condition: template, value_template: Template<template=({{1 != 1}}) renders=200000>}] in script `loop` terminated because it looping more than 100000 times

fixes #115042
homeassistant/helpers/script.py Outdated Show resolved Hide resolved
homeassistant/helpers/script.py Outdated Show resolved Hide resolved
@bdraco
Copy link
Member Author

bdraco commented Apr 7, 2024

almost 0400 here so will have to write tests tomorrow

@bdraco
Copy link
Member Author

bdraco commented Apr 7, 2024

100000 is likely too high. I can still get a blue to crash before the protection kicks in

@jbouwh
Copy link
Contributor

jbouwh commented Apr 7, 2024

100000 is likely too high. I can still get a blue to crash before the protection kicks in

It also depends on what is done in each step

@bdraco
Copy link
Member Author

bdraco commented Apr 7, 2024

100000 is likely too high. I can still get a blue to crash before the protection kicks in

It also depends on what is done in each step

Yes, and the amount of ram in the system. 100000 was fine because I have ~16GB of ram in my test env, but the blue only has ~4GB

@bdraco
Copy link
Member Author

bdraco commented Apr 7, 2024

The system gets unstable at 27000 iterations and hard crashes at

Apr 07 18:47:15 homeassistant homeassistant[512]: 2024-04-07 08:46:11.930 INFO (MainThread) [homeassistant.components.automation.loop] loop: Repeating sequence: Iteration 54246

@bdraco bdraco changed the title Terminate until and while conditions in scripts if they execute more than 100000 times Terminate until and while conditions in scripts if they execute more than 25000 times Apr 7, 2024
@bdraco
Copy link
Member Author

bdraco commented Apr 7, 2024

Its also having trouble with the trace being so large.

I dropped it to a 25000 cap

@bdraco
Copy link
Member Author

bdraco commented Apr 7, 2024

Now the system having trouble coming back up because traces are too large

homeassistant:/config/.storage# du -sch trace.saved_traces 
112.4M	trace.saved_traces
112.4M	total

@bdraco
Copy link
Member Author

bdraco commented Apr 7, 2024

25000 is enough to keep the system stable if they only run it once. But since we store multiple traces, if it runs again, we still end up in bad place

@bdraco
Copy link
Member Author

bdraco commented Apr 7, 2024

system crashes on shutdown writing traces now

@bdraco bdraco changed the title Terminate until and while conditions in scripts if they execute more than 25000 times Terminate until and while conditions in scripts if they execute more than 10000 times Apr 7, 2024
@bdraco
Copy link
Member Author

bdraco commented Apr 7, 2024

10000 nodes seems to be the most a trace can handle if we save 5x of them

@bdraco
Copy link
Member Author

bdraco commented Apr 7, 2024

with 10000, the traces only get to

56.2M	trace.saved_traces
56.2M	total

@bdraco
Copy link
Member Author

bdraco commented Apr 7, 2024

56.2M still seems like a lot for 5 traces, but it does stay stable on the blue now

@bdraco
Copy link
Member Author

bdraco commented Apr 7, 2024

Even 10000 iterations seems like its pushing the limits of the system design, it does seem to be OK.

@jbouwh
Copy link
Contributor

jbouwh commented Apr 7, 2024

Even 10000 iterations seems like its pushing the limits of the system design, it does seem to be OK.

Look like a nice limit for a breaking change

@bdraco
Copy link
Member Author

bdraco commented Apr 7, 2024

Usually, we don't tag breaking changes for backport but since the impact is the system hanging up and it does so ~153x faster in 2024.4.x, the condition can't become true soon enough to avoid the hang most of the time like it could in earlier versions.

This one seems worth making an exception for given the result is a hang of the full system. If someone disagrees feel free to untag the milestone. Removing the milestone could also be justified since the problem could still happen on older versions, but was much less likely.

@bdraco bdraco added this to the 2024.4.2 milestone Apr 7, 2024
@bdraco bdraco changed the title Terminate until and while conditions in scripts if they execute more than 10000 times Terminate scripts with until and while conditions that execute more than 10000 times Apr 7, 2024
Copy link
Contributor

@jbouwh jbouwh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM,
Thnx @bdraco 👍

@bdraco bdraco marked this pull request as ready for review April 7, 2024 21:02
@bdraco bdraco requested a review from a team as a code owner April 7, 2024 21:02
@bdraco
Copy link
Member Author

bdraco commented Apr 7, 2024

thanks

@bdraco bdraco merged commit 569f54d into dev Apr 7, 2024
38 checks passed
@bdraco bdraco deleted the terminate_until_while_more_than_100000 branch April 7, 2024 21:02
frenck pushed a commit that referenced this pull request Apr 8, 2024
@frenck frenck mentioned this pull request Apr 8, 2024
@50494554524F
Copy link

make it configurable + a default value

@joostlek
Copy link
Member

joostlek commented Apr 8, 2024

workflow.png

@github-actions github-actions bot locked and limited conversation to collaborators Apr 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Infinite loop in automation/script repeat causes Home Assistant to freeze
5 participants