-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] v3005rc2: Highstate slows to a crawl/never terminates with state_aggregate enabled #62439
Comments
@frebib Thanks for the report. Can you provide a small subset of some examples states were you're seeing this behavior? Thanks! |
Did it do the same in v3004, or is this new behaviour? |
@garethgreenaway I'm not sure this is related to any specific SLS configuration, but I will try to dig out a short series of states that execute in our state graph around a point where it freezes. My suspicion is that actually the cause is the considerable use of requisites that causes this, given my experience with the state compiler and what the profile output gives me. I'd expect that this can be reproduced on any sizeable state graph. @OrangeDog I'll try backporting c9a6217 to v3004 and see if it reproduces. As it stands I have that option enabled on a handful of v3004 minions with no detriment but I also noticed zero improvement/change when enabling it, so it's still plausible that it's broken in 3004. I'll report back |
It seems that v3004 is unaffected, so this must be a regression somehow. I am seeing |
Here is a simple reproducer that causes the
I guess this is either a bug in the implementation of the pkg That doesn't seem to cause 3005 to freak out and spin forever though, so it's possibly not the cause of this particular bug. |
Hi @frebib, been looking at this. I think the As for your original report, were you able to gather a small collection of states that reproduces the issue? We are hoping to re-tag with whatever fixes are required for this by tomorrow. |
So this is precisely the problem that my example is trying to highlight. Without aggregating these states, the produced state graph is a correct DAG: I'm speculating that this phenomenon/fundamental flaw in the aggregation system is possibly the cause of the issue described in this bug report due to the cycles in the state graph that the compiler doesn't catch. My example above isn't sufficiently complex enough to trigger this issue due to the trivial size of the state graph, but one of considerably more complex size, say ~3500 states with 10s of thousands of requisites (and lots of pkg states) may be sufficiently complex enough to cause the state compiler to get stuck either in an infinite loop, or one where it won't complete in any reasonable amount of human time. This is purely speculation, but given the (at this point) very well understood shortcomings of the way that Salt handles requisites (#59123) and the captured profile screenshot above showing the CPU time spent in Inspecting the aggregated
The behaviour exhibited around the stalls is identical to that in #59123 of which we see normally (in 2019.2.x, 3004.x, without Expand for some not very interesting logs10:20:47,861 [INFO ][743293] Executing command systemctl in directory '/root' 10:20:47,892 [DEBUG ][743293] stdout: * auditd.service - Security Auditing Service Loaded: loaded (/etc/systemd/system/auditd.service; enabled; vendor preset: enabled) 10:20:47,894 [INFO ][743293] Executing command systemctl in directory '/root' 10:20:47,923 [DEBUG ][743293] stdout: active 10:20:47,925 [INFO ][743293] Executing command systemctl in directory '/root' 10:20:47,951 [DEBUG ][743293] stdout: enabled 10:20:47,951 [INFO ][743293] The service auditd is already running 10:20:47,952 [INFO ][743293] Completed state [auditd] at time 10:20:47.952219 (duration_in_ms=94.282)!! Note the ~7 minute pause here 10:28:03,221 [DEBUG ][743293] Could not LazyLoad service.mod_aggregate: 'service.mod_aggregate' is not available. !! Note the ~4 minute pause here 10:32:02,744 [DEBUG ][743293] Could not LazyLoad file.mod_aggregate: 'file.mod_aggregate' is not available. !! Note ~64 minute(!!!!) pause here 11:36:10,200 [salt.utils.lazy :102 ][DEBUG ][743293] Could not LazyLoad file.mod_aggregate: 'file.mod_aggregate' is not available. My final thoughts here are that
|
I was able to spend a bit of time on this one and was able to reproduce the issue with the small example state. At first the exhibited behavior of producing an error about a recursive requisite seemed correct until I started to dig in more and realize it wasn't actually a recursive error. The issue seems to be the order things are happening when handling aggregation, using the small example state the following patch the states run in the desired manner: diff --git a/salt/state.py b/salt/state.py
index 65cb7a64d7..ce206bea6b 100644
--- a/salt/state.py
+++ b/salt/state.py
@@ -864,7 +864,7 @@ class State:
# the state function in the low we're looking at
# and __agg__ is True, add the requisites from the
# chunk to those in the low.
- if chunk["state"] == low["state"] and chunk.get("__agg__"):
+ if chunk["state"] == low["state"] and chunk.get("__agg__") and low["name"] != chunk["name"]:
for req in frozenset.union(
*[STATE_REQUISITE_KEYWORDS, STATE_REQUISITE_IN_KEYWORDS]
):
@@ -891,8 +891,8 @@ class State:
agg_fun = "{}.mod_aggregate".format(low["state"])
if agg_fun in self.states:
try:
- low = self.states[agg_fun](low, chunks, running)
low = self._aggregate_requisites(low, chunks)
+ low = self.states[agg_fun](low, chunks, running)
low["__agg__"] = True
except TypeError:
log.error("Failed to execute aggregate for state %s", low["state"]) After this patch, any requisites for states that will be effected by aggregation will be calculated first and then the aggregation for the particular state is run. The patch above also ignores the requisite check if the chunk is the same state as the low that was passed in. Because of the aggregation that is happening with the |
I can confirm that the patch above certainly improves the situation reducing the 64 minute pause after the
I'll test this some more and assuming I see no weirdness/brokenness I'll carry this patch on our v3005rc2 branch and re-enable state_aggregate in some nodes as a longer test. Edit: Above was a different node type to the one where we see the worst impact, and I just checked and there is an improvement, but it's not as good: 64 minutes down to 10 minutes
I'm thinking that the pause here is possibly just the slow requisite calculation problem still and actually mod_aggregate just exacerbates that problem by the way it aggregates requisites. Edit again:
|
@frebib Thanks for the update and glad to see there is a bit of speed up for this. Are you able to share what the |
So I tried out the attached PR a few weeks ago (sorry about the silence, I've been busy with other things, including trying to get v3005 to work) As for our configuration, I have eyed up the
|
#62529 does not fix this. Can it be re-opened? |
@frebib Thanks for the feedback, but without the states that are causing the issues for you being able to reproduce this issue is going to be very difficult. |
@frebib Revisiting this one, I'm not sure the fix that was associated with this issue was ultimately correct and @MKLeb's comments about the result being the recursive requisite being correct as both the |
Description
Highstate on v3005rc2 is marginally slower than v3004, but still within the acceptable range (I measured about ~5% slower). Adding
state_aggregate: true
to minion conf causes the highstate to freeze for minutes/hours between state calls:(logs abbreviated and redacted.) Note the timestamps. Later on I saw hangs lasting for at least 3 hours at which point I gave up and killed the job to retrieve the profiling data.
Setup
Our highstate contains ~3.5k states and makes heavy use of various requisites to enforce correctness and ordering, because Salt makes little attempt to do it for us. This is a "standard" minion/master setup with highstates being invoked with ~
salt-call state.highstate
. Running with-lall
yields nothing more useful over-ldebug
Steps to Reproduce the behavior
Enable
state_aggregate: true
in a sufficiently large codebase.Expected behavior
No freezing
Screenshots
If applicable, add screenshots to help explain your problem.
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)Additional context
I suspect that this is caused by 8168b25 and exacerbated by #59123
A profile of the running minion job process shows an overwhelming majority of CPU time (95%+) stuck in
check_requisite
which makes me think it's stuck either in an infinite loop, or something similar.I'm speculating that this could have always been broken and it's only now apparent since state_aggregate was fixed.
The text was updated successfully, but these errors were encountered: