Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] v3005rc2: Highstate slows to a crawl/never terminates with state_aggregate enabled #62439

Open
1 of 6 tasks
frebib opened this issue Aug 8, 2022 · 14 comments · Fixed by #62529
Open
1 of 6 tasks
Labels
Bug broken, incorrect, or confusing behavior info-needed waiting for more info Performance Regression The issue is a bug that breaks functionality known to work in previous releases.

Comments

@frebib
Copy link
Contributor

frebib commented Aug 8, 2022

Description
Highstate on v3005rc2 is marginally slower than v3004, but still within the acceptable range (I measured about ~5% slower). Adding state_aggregate: true to minion conf causes the highstate to freeze for minutes/hours between state calls:

[INFO    ] The service foo is already running
[INFO    ] Completed state [foo] at time 11:02:59.918796 (duration_in_ms=109.407)
[DEBUG   ] Could not LazyLoad service.mod_aggregate: 'service.mod_aggregate' is not available.
[INFO    ] Running state [bar.service] at time 11:06:11.907213
[DEBUG   ] Could not LazyLoad file.mod_aggregate: 'file.mod_aggregate' is not available.
[INFO    ] Running state [/etc/bar/baz.conf] at time 11:07:52.187364

(logs abbreviated and redacted.) Note the timestamps. Later on I saw hangs lasting for at least 3 hours at which point I gave up and killed the job to retrieve the profiling data.

Setup
Our highstate contains ~3.5k states and makes heavy use of various requisites to enforce correctness and ordering, because Salt makes little attempt to do it for us. This is a "standard" minion/master setup with highstates being invoked with ~ salt-call state.highstate. Running with -lall yields nothing more useful over -ldebug

  • on-prem machine
  • VM (Virtualbox, KVM, etc. please specify)
  • VM running on a cloud service, please be explicit and add details
  • container (Kubernetes, Docker, containerd, etc. please specify)
  • or a combination, please be explicit
  • jails if it is FreeBSD

Steps to Reproduce the behavior
Enable state_aggregate: true in a sufficiently large codebase.

Expected behavior
No freezing

Screenshots
If applicable, add screenshots to help explain your problem.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
          Salt: 3005rc2

Dependency Versions:
          cffi: 1.14.5
      dateutil: 2.8.1
        Jinja2: 2.11.3
      M2Crypto: 0.38.0
       msgpack: 1.0.0
     pycparser: 2.20
  pycryptodome: 3.9.7
        Python: 3.9.2 (default, Feb 28 2021, 17:03:44)
        PyYAML: 5.3.1
         PyZMQ: 20.0.0
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: debian 11 bullseye
        locale: utf-8
       machine: x86_64
       release: 5.15.54-cloudflare-2022.7.5
        system: Linux
       version: Debian GNU/Linux 11 bullseye

Additional context
I suspect that this is caused by 8168b25 and exacerbated by #59123
A profile of the running minion job process shows an overwhelming majority of CPU time (95%+) stuck in check_requisite which makes me think it's stuck either in an infinite loop, or something similar.
image

I'm speculating that this could have always been broken and it's only now apparent since state_aggregate was fixed.

@frebib frebib added Bug broken, incorrect, or confusing behavior needs-triage labels Aug 8, 2022
@garethgreenaway garethgreenaway self-assigned this Aug 8, 2022
@garethgreenaway
Copy link
Contributor

@frebib Thanks for the report. Can you provide a small subset of some examples states were you're seeing this behavior? Thanks!

@OrangeDog
Copy link
Contributor

Did it do the same in v3004, or is this new behaviour?

@OrangeDog OrangeDog added the info-needed waiting for more info label Aug 9, 2022
@Ch3LL Ch3LL added this to the Phosphorus v3005.0 milestone Aug 10, 2022
@frebib
Copy link
Contributor Author

frebib commented Aug 12, 2022

@garethgreenaway I'm not sure this is related to any specific SLS configuration, but I will try to dig out a short series of states that execute in our state graph around a point where it freezes. My suspicion is that actually the cause is the considerable use of requisites that causes this, given my experience with the state compiler and what the profile output gives me. I'd expect that this can be reproduced on any sizeable state graph.

@OrangeDog I'll try backporting c9a6217 to v3004 and see if it reproduces. As it stands I have that option enabled on a handful of v3004 minions with no detriment but I also noticed zero improvement/change when enabling it, so it's still plausible that it's broken in 3004. I'll report back

@OrangeDog OrangeDog added Regression The issue is a bug that breaks functionality known to work in previous releases. Performance labels Aug 12, 2022
@frebib
Copy link
Contributor Author

frebib commented Aug 12, 2022

It seems that v3004 is unaffected, so this must be a regression somehow. I am seeing [ERROR ] Recursive requisite found when enabling state_aggregate: true, even on 3004 (with c9a6217 applied) so I'll try to debug that further and see if it's related, or possibly the cause of this

@frebib
Copy link
Contributor Author

frebib commented Aug 12, 2022

Here is a simple reproducer that causes the Recursive requisite found error on 3004 with the backport applied

local:
----------
          ID: git
    Function: pkg.installed
      Result: False
     Comment: Recursive requisite found
     Changes:
----------
          ID: /tmp/foo
    Function: file.managed
      Result: False
     Comment: One or more requisite failed: test.sls.git
     Started: 09:14:09.194697
    Duration: 0.006 ms
     Changes:
----------
          ID: less
    Function: pkg.installed
      Result: False
     Comment: One or more requisite failed: test.sls./tmp/foo
     Started: 09:14:09.194833
    Duration: 0.002 ms
     Changes:

Summary for local
------------
Succeeded: 0
Failed:    3
------------
Total states run:     3
Total run time:   0.008 ms
jgroocock@133m11:~$ cat test.sls
/tmp/foo:
  file.managed:
    - content: This is some content

less:
  pkg.installed:
    - require:
      - file: /tmp/foo
git:
  pkg.installed:
    - require_in:
      - file: /tmp/foo

I guess this is either a bug in the implementation of the pkg mod_aggregate or (more likely IMHO) a fundamental flaw in the mod_aggregate system.

That doesn't seem to cause 3005 to freak out and spin forever though, so it's possibly not the cause of this particular bug.

@MKLeb
Copy link
Contributor

MKLeb commented Aug 16, 2022

Hi @frebib, been looking at this. I think the test.sls you gave is expected to fail due to recursive requisites since your pkg.installed states will be aggregated to both require /tmp/foo and be required in /tmp/foo if I'm understanding correctly.

As for your original report, were you able to gather a small collection of states that reproduces the issue? We are hoping to re-tag with whatever fixes are required for this by tomorrow.

@frebib
Copy link
Contributor Author

frebib commented Aug 17, 2022

due to recursive requisites since your pkg.installed states will be aggregated to both require /tmp/foo and be required in /tmp/foo

So this is precisely the problem that my example is trying to highlight. Without aggregating these states, the produced state graph is a correct DAG: pkg(git) -> file(/tmp/foo) -> pkg(less). The one thing that aggregating states should never do is modify the DAG such that it is no longer acyclic (https://en.wikipedia.org/wiki/Cycle_graph#Directed_cycle_graph) because that makes it impossible to find a start node (one with no incoming edges https://en.wikipedia.org/wiki/Directed_acyclic_graph#Topological_sorting_and_recognition) to begin traversal/execution due to the cycle.

I'm speculating that this phenomenon/fundamental flaw in the aggregation system is possibly the cause of the issue described in this bug report due to the cycles in the state graph that the compiler doesn't catch. My example above isn't sufficiently complex enough to trigger this issue due to the trivial size of the state graph, but one of considerably more complex size, say ~3500 states with 10s of thousands of requisites (and lots of pkg states) may be sufficiently complex enough to cause the state compiler to get stuck either in an infinite loop, or one where it won't complete in any reasonable amount of human time. This is purely speculation, but given the (at this point) very well understood shortcomings of the way that Salt handles requisites (#59123) and the captured profile screenshot above showing the CPU time spent in check_requisite I'm suggesting that they are possibly related and caused by the same area of code.

Inspecting the aggregated high (what does that even mean?) object for an aggregated pkg state in my highstate shows that it adds many recursive requisites with huge loops (with many many states) right at the time that the highstate locks up.
On the basis that our highstate contains a lot of internal & sensitive/confidential data, it's not feasible to share it and it would take weeks of work to unpick it.
I can however say that I'm pretty sure this should be easy enough to reproduce with:

  • a lot of states (1000+), including pkg.installed, file.managed, service.running, cmd.run which make up the vast majority of our highstate.
  • an overwhelmingly large number of requisites to build a sufficiently complex DAG. We use these to enforce ordering mostly, hence the large number.

The behaviour exhibited around the stalls is identical to that in #59123 of which we see normally (in 2019.2.x, 3004.x, without state_aggregate enabled), but for much, much longer. Here's an example from the logs at (-l all!):

Expand for some not very interesting logs 10:20:47,861 [INFO ][743293] Executing command systemctl in directory '/root' 10:20:47,892 [DEBUG ][743293] stdout: * auditd.service - Security Auditing Service Loaded: loaded (/etc/systemd/system/auditd.service; enabled; vendor preset: enabled) 10:20:47,894 [INFO ][743293] Executing command systemctl in directory '/root' 10:20:47,923 [DEBUG ][743293] stdout: active 10:20:47,925 [INFO ][743293] Executing command systemctl in directory '/root' 10:20:47,951 [DEBUG ][743293] stdout: enabled 10:20:47,951 [INFO ][743293] The service auditd is already running 10:20:47,952 [INFO ][743293] Completed state [auditd] at time 10:20:47.952219 (duration_in_ms=94.282)

!! Note the ~7 minute pause here

10:28:03,221 [DEBUG ][743293] Could not LazyLoad service.mod_aggregate: 'service.mod_aggregate' is not available.
10:28:03,222 [INFO ][743293] Running state [redacted.service] at time 10:28:03.221996
10:28:03,222 [INFO ][743293] Executing state service.dead for [redacted.service]
10:28:03,224 [INFO ][743293] Executing command systemctl in directory '/root'
10:28:03,253 [DEBUG ][743293] stdout: Unit redacted.service could not be found.
10:28:03,254 [DEBUG ][743293] retcode: 4
10:28:03,254 [INFO ][743293] The named service redacted.service is not available
10:28:03,255 [INFO ][743293] Completed state [redacted.service] at time 10:28:03.255152 (duration_in_ms=33.155)

!! Note the ~4 minute pause here

10:32:02,744 [DEBUG ][743293] Could not LazyLoad file.mod_aggregate: 'file.mod_aggregate' is not available.
10:32:02,745 [INFO ][743293] Running state [/etc/audit/plugins.d/redacted.conf] at time 10:32:02.745108
<snip.. things happen for a while>
10:32:04,284 [INFO ][743293] The service systemd-timesyncd is already dead
10:32:04,285 [INFO ][743293] Completed state [systemd-timesyncd] at time 10:32:04.285189 (duration_in_ms=93.596)
10:32:04,297 [DEBUG ][743293] Could not LazyLoad service.mod_aggregate: 'service.mod_aggregate' is not available.
10:32:04,297 [INFO ][743293] Running state [ntp] at time 10:32:04.297888
10:32:04,298 [INFO ][743293] Executing state service.dead for [ntp]
10:32:04,299 [INFO ][743293] Executing command systemctl in directory '/root'
10:32:04,322 [DEBUG ][743293] stdout: Unit ntp.service could not be found.
10:32:04,323 [DEBUG ][743293] retcode: 4
10:32:04,323 [INFO ][743293] The named service ntp is not available
10:32:04,324 [INFO ][743293] Completed state [ntp] at time 10:32:04.324223 (duration_in_ms=26.333)

!! Note ~64 minute(!!!!) pause here

11:36:10,200 [salt.utils.lazy :102 ][DEBUG ][743293] Could not LazyLoad file.mod_aggregate: 'file.mod_aggregate' is not available.
11:36:10,201 [salt.state :2159][INFO ][743293] Running state [/var/log/chrony] at time 11:36:10.201216
11:36:10,202 [salt.state :2190][INFO ][743293] Executing state file.absent for [/var/log/chrony]

My final thoughts here are that

  • It's not going to be possible to fix this before the final 3005 release- it's part of a much larger architectural problem in the state compiler
  • state_aggregate doesn't work currently, so I'm going to leave it disabled and carry on
  • State aggregation should happen after generating the highstate but before state execution starts so recursion checks and other sanity checks can be performed (as should requisite calculation)

@garethgreenaway
Copy link
Contributor

I was able to spend a bit of time on this one and was able to reproduce the issue with the small example state. At first the exhibited behavior of producing an error about a recursive requisite seemed correct until I started to dig in more and realize it wasn't actually a recursive error. The issue seems to be the order things are happening when handling aggregation, using the small example state the following patch the states run in the desired manner:

diff --git a/salt/state.py b/salt/state.py
index 65cb7a64d7..ce206bea6b 100644
--- a/salt/state.py
+++ b/salt/state.py
@@ -864,7 +864,7 @@ class State:
             # the state function in the low we're looking at
             # and __agg__ is True, add the requisites from the
             # chunk to those in the low.
-            if chunk["state"] == low["state"] and chunk.get("__agg__"):
+            if chunk["state"] == low["state"] and chunk.get("__agg__") and low["name"] != chunk["name"]:
                 for req in frozenset.union(
                     *[STATE_REQUISITE_KEYWORDS, STATE_REQUISITE_IN_KEYWORDS]
                 ):
@@ -891,8 +891,8 @@ class State:
             agg_fun = "{}.mod_aggregate".format(low["state"])
             if agg_fun in self.states:
                 try:
-                    low = self.states[agg_fun](low, chunks, running)
                     low = self._aggregate_requisites(low, chunks)
+                    low = self.states[agg_fun](low, chunks, running)
                     low["__agg__"] = True
                 except TypeError:
                     log.error("Failed to execute aggregate for state %s", low["state"])

After this patch, any requisites for states that will be effected by aggregation will be calculated first and then the aggregation for the particular state is run. The patch above also ignores the requisite check if the chunk is the same state as the low that was passed in. Because of the aggregation that is happening with the pkg.installed state, the less package will be installed before the file.managed state runs.

@frebib
Copy link
Contributor Author

frebib commented Aug 20, 2022

I can confirm that the patch above certainly improves the situation reducing the 64 minute pause after the ntp.service state to just 3 minutes. Whilst still not ideal, that's near/within the realm of usability.

Aug 20 10:36:20 131dm3 salt-highstate[2182799]: [INFO    ] Completed state [ntp] at time 10:36:20.167489 (duration_in_ms=25.827)
<long (but shorter than before) pause here>
Aug 20 10:39:32 131dm3 salt-highstate[2182799]: [INFO    ] Running state [/var/log/chrony] at time 10:39:32.226601

I'll test this some more and assuming I see no weirdness/brokenness I'll carry this patch on our v3005rc2 branch and re-enable state_aggregate in some nodes as a longer test.
Thanks for digging into this @garethgreenaway

Edit: Above was a different node type to the one where we see the worst impact, and I just checked and there is an improvement, but it's not as good: 64 minutes down to 10 minutes

Aug 20 10:51:12 131m16 salt-highstate[441069]: [INFO    ] Completed state [ntp] at time 10:51:12.947702 (duration_in_ms=64.516)
Aug 20 11:01:33 131m16 salt-highstate[441069]: [INFO    ] Running state [/var/log/chrony] at time 11:01:33.579171

I'm thinking that the pause here is possibly just the slow requisite calculation problem still and actually mod_aggregate just exacerbates that problem by the way it aggregates requisites.
Either way, a big improvement still

Edit again:
The highstate is still running (normally takes a couple minutes). I think this has just moved the delays to other points in the highstate instead. There was a very lengthy delay much later on after the couple minute pause of the ntp service state:

Aug 20 10:42:06 131dm3 salt-highstate[2182799]: [INFO    ] File /etc/resolv.conf is in the correct state
Aug 20 10:42:06 131dm3 salt-highstate[2182799]: [INFO    ] Completed state [/etc/resolv.conf] at time 10:42:06.196400 (duration_in_ms=19.164)
Aug 20 12:21:29 131dm3 salt-highstate[2182799]: [INFO    ] Running state [foo] at time 12:21:29.346176
Aug 20 12:21:29 131dm3 salt-highstate[2182799]: [INFO    ] Executing state user.present for [foo]

@garethgreenaway
Copy link
Contributor

@frebib Thanks for the update and glad to see there is a bit of speed up for this. Are you able to share what the ntp state looks like, I'm curious what that delay could be doing. Does it rely on packages being installed? If there are a lot of them and they're being aggregated this could be taking awhile to run.

@frebib
Copy link
Contributor Author

frebib commented Sep 14, 2022

So I tried out the attached PR a few weeks ago (sorry about the silence, I've been busy with other things, including trying to get v3005 to work)
In my test the change made no difference to the highstate runtime, and it still paused in exactly the same places as before, and for the same lengthy durations. I think the PR is worthwhile as it does appear to bring some small improvements to the state compiler aside from this issue, but it definitely does not solve the problem described in this issue ticket.

As for our configuration, I have eyed up the ntp state after passing it through parts of the state compiler and I can firmly say that showing you the contents of that SLS file would not help because most of the requisites are entirely unrelated due to being pulled in by the aggregation of requisites of many pkg states in many many other files. There is simply this:

# If ntp is running on the machine, kill that service (chrony is meant to replace it).
# Also uninstall the package
ntp:
  service.dead: []
  pkg.removed: []

@frebib
Copy link
Contributor Author

frebib commented Sep 21, 2022

#62529 does not fix this. Can it be re-opened?

@Ch3LL Ch3LL reopened this Sep 22, 2022
@garethgreenaway
Copy link
Contributor

@frebib Thanks for the feedback, but without the states that are causing the issues for you being able to reproduce this issue is going to be very difficult.

@garethgreenaway
Copy link
Contributor

@frebib Revisiting this one, I'm not sure the fix that was associated with this issue was ultimately correct and @MKLeb's comments about the result being the recursive requisite being correct as both the require and require_in requisites would end up associated with the packages state following the aggregation of the packages. I am curious how you saw the end data ending up where the aggregation would work even with this conflicting requisites.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior info-needed waiting for more info Performance Regression The issue is a bug that breaks functionality known to work in previous releases.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants