-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Salt master under load produces duplicate jobs #64060
Comments
does this only happen through salt-api? are you able to replicate it from the cli? |
@whytewolf I doubt I would be able to... The issue occurs only very rarely (It happened 4 or 5 times only so far). I suspect it only can happen if the salt master (meaning the salt-master process(es)) is under extreme load. As this is on a production infrastructure, I cannot play around and try things. But If I can prepare anything (enable some special logging, add a line or two of code), I could do this and wait for the next occurence. |
one thing i would like to see is if you have setproctitle installed. you should be able to track the pid to two separate processes. most likely this is a race condition and two separate workers are picking up the job at the exact same time from the ipc. from the log you did provide i see that there are two pids. 1934 and 1928. if you have setproctitle installed you should see what those processes belong to. This is more likely to happen under a stressed system as the job is likely to take more time to get to the point that it deletes the job off the ipc bus after picking it up. but the load avg hitting 100? that on it's own is suspicious. you should not be hitting a load of 100 with only 10 workers and 400 minions. no matter how many minions return at once. is there any other configs you have in place? do you have any reactors setup? and how does your pillar setup look? |
Yes, setproctitle is installed, but as the master node is being rebooted often (e.g. to scale it up before the maintenance window), the PIDs are long gone. But I will instruct the team to collect the process list whenever it happens again. But what you describe is my suspicion as well: two workers pull the same job from a queue before the first one can acknowledge it. For the high load: I do not even know what a reactor is in salt contect, so I probably do not have one. We could of course forego the scaling-up, thus even increasing pressure on the master VM to make symptoms more prominent. |
so, the pull for git_pillar is not the same as the render. it still needs to go through the process of the render for each call a minion does. you're not doing anything like using pillar.items in your jinja are you? as that forces a new render of the pillar for each call. if you set the master into profile logging it should show the pillar rendering stats. which will give an idea of what pillar is doing. other things to look for esp with the issue is what does the io on the system look like? also can't believe i didn't notice this before but 4 cpus for 10 workers is not enough. for 10 workers you should need 6 cpus at least. but you also shouldn't need 10 workers for 400 minions. see https://docs.saltproject.io/en/latest/topics/tutorials/intro_scale.html |
we do not use pillar.items, but very often Is that also a bad thing? I/O is not an issue (tops out at 100 IOPS and 10%IO-time during highest load), there's also plenty of RAM. and no swapping. Also thanks for the hint about the workers, I will then go down to two workers and see what happens. I will also try out profile logging (the SIGUSR2 procedure). It might take to the week after next for results. |
no, pillar.get works with the built in in memory pillar. i call out pillar.items because it forces a refresh of the pillar on the master for each call. the default of worker_threads of 5 should be enough. and should work with 4 cpus. |
I tried the SIGUSR2 signal today before running a single job, but nothing happened. I set |
The maintenance was today, and with the "regularly-sized" salt master node, it again failed miserably. There were ca. 150 state.apply jobs scheduled. The machine had 4GB RAM and 2 vCPUs (AWS t3.medium). This time, there were 4 worker_threads configured. But within a few minutes, the system load went up to 100, the salt API was then unreachable. I picked out this snipped from the logs (filtered for only one PID), that does not look healthy:
Afterwards, our maintenance team resized the VM to a big 8vCPU machine, and then was successful. Memory/Swapping was never an issue. The profile logging seem not to reveal anything interesting. All I'll check if I can have a "test=true" option added to our tooling, maybe it happens then, too.... |
yes please see if test=True can bring out the same behavior. however this is interesting. given the number of event errors seen you might also want to log the event bus. either manually through this could give us an idea of what is happening. |
I am trying to get "test=true" running. I try not to hijack this bug report as a teach-me-salt-cherrypy-api... but the API docs are a bit vague, I am still trying which data stucture will convey "test=true" to cherrypy... for now The named minion is definitively unique. The VM exists only once (otherwise many other things would go wrong here). Also the minion id exists only once in the output of I have added a cron job to collect the output of I also try to get that event_returner working, after adding this to the master config file, the given file at least was created:
|
During testing the test=true implementation (test environment has of course only a few VMs not 400), I noticed two things so far:
I wonder why the salt master would spawn an lvm process on the master VM? For the high load, I guess I soon have everything to diagnose in place, and will report back when i find something... |
the lvm command is from the pillar spinning up the minionMaster and filling its own grains. the minionMaster is a version of a minion that is used for running modules from within rendering engines that are used in the master. such as orchestration, reactor, or pillar. |
While I did not get around to testing the dry run in time, the maintenance team got lucky(?): Another maintenance wednesday, another job duplicate. Note that this time, the maintenance team again resized the VM to 16GB/8vCPUs prior to the procedure. I now have the event log file. As it its huge and full of sensitive information, I grepped out lines with the two relevant job ids. I also included the processes with PIDs as I found them now. 20230524042638207502 is the "magic duplicate", rejected by the minion.
If you need more info, please let me know. Nevertheless I was not totally idle: on my test environment, while having only a few VMs, I notice the salt master reaching a load avg of up to 0.2 during a single state.highstate job, so maybe I could reproduce it here... |
interesting. the same pid caught both jobs. this wasn't some race condition in picking up the job. it looks like it acted like it was called a second time. also with the second job being started almost six seconds later there must be something else going on here. Do you have any logs in the /var/log/salt/api log around the same timestamps? |
@whytewolf yes, I have the API logs, but they seem useless, as the minion ID is in the body:
But I already took a full packet capture some time ago when it happened (see OP), to verify that only one POST /minions request goes over the wire for the duplicate job. The other POST requests are for other minions. the client service logged the request for this minion (before sending it) at 04:25:51, so it took some half a minute until it got processed in salt. |
how are you doing targeting? from some of the logs it looks like you are triggering against each minion singularly. creating a new event for each minion on its own?. is it possable to switch to targetting where with multiple targets? |
Yes, the jobs are for individual minions, that's part of the service calling salt, scheduling the jobs for individual minions to satisfy certain ordering/scheduling constraints (most of the minions are part of node clusters, where only one node shall be down/under maintenance at the same time). Each job is polled every 10s, and to lighten the load I already implemented a limit of 50 Tasks in the calling service (effectively resulting in 50 concurrent Salt jobs at any given time). |
ok. the output from the api looks like it is only in info. can you turn up the logging to debug? it should have more information about the events being sent from it. trying to figure out if the duplicate events are coming from the master, or from within the api process. |
ok. I have set:
in /etc/salt/master (was "profile" before). If the API process needs specific configuration, too, please let me know. |
nope that should be the same setting. debug is after profile. and the api doesn't have anything that gets profiled. |
Quick note: Today I am present during the maintenance window and took a quick peek. The processes eating up CPU look like this in
So somehow jobs.list_job is very expensive? Some of them run IMHO totally unrelated processes on the salt master, which might be the cause for the load. |
which master_job_cache are you using? the default localfs one? if so then yes it is going to be expensive. you re hitting the master with thousands of jobs each job is a separate file that gets saved to the filesystem. they are kept for 24 hours. so jobs.list_jobs is going to spend it's time looking up each and every job file created on the fileserver. and this CAN have an effect on the ipc transfer. as it transfers through socket files in the fileserver. if the fileserver is overloaded because it has to work hard to build the jobs list. it potentially could cause an os based block on the ipc mode. see https://docs.saltproject.io/en/latest/topics/tutorials/intro_scale.html |
@whytewolf I have not configured anything related to the job cache, so it is the default. I now suspect that "minionMaster" concept you mentioned earlier. If every GET on a job results in running countless child processes on the master (running Sadly (or luckily maybe), we had no duplicate jobs during maintenance this time. Nevertheless I put the log files (including DEBUG) aside... If there's something to look for, please let me know. |
so, no minionMaster is not something that can be turned off it is central to how pillar within salt works. and the only reason i mentioned it was because you asked about the lvm process. it was never a consideration for the issue as the minionMaster is a part of pillar. not how jobs are grabbed off the publish bus. since the logs we care about would be when there are duplicates. let me know if you get some logs with duplicates. |
to note. the 3006.3 version will have some configurations added that should help with blocking processes. which in turn might help reduce the duplicate processes. so once that is released we will want you to test that version. |
Ok. During the last few maintenances, the issue did not appear (the team always provides a bigger VM for the Salt master). I still have the event_file turned on and debug level enabled. For another issue, I tried the current 3006 RC yesterday, but got hit by another bug, so I am stuck on 3005 for now. |
3006 has been out of RC for several months now. what version did you try exactly. |
Huh, some advanced browser-history-digging :) I had this issue during testing something else around here: #62851 But don't mind, all of that had nothing to do with this issue here. |
Ahh that issue was fixed in 3006.1, current latest is 3006.2 which is a CVE release, with 3006.3 right around the corner. |
humm. unfortunately without duplicating the issue. and with a larger vm being able to handle the work without duplicates. and with potential load fixs coming in the next 3006.x release which is around the corner. I think I am going to have to close this. when 3006.3 is released if you give it a try. and if it still has issues open up a new ticket. |
Description
Rarely, when issuing many jobs (salt master machine under heavy load), we observe the phenomenon that one state.highstate job issued via API is internally "duplicated". Salt Master generates two Job IDs. The first ist processed "normally" but the 202 ACCEPTED response contains a second one. Querying for it, the master (rightly) complains that a state.highstate job is already running.
Setup
Some setup facts:
During the procedure, the system load avg quickly reaches 100. That alone is most likely worth a separate issue.
But every now and then, the above phenomenon occurs:
POST /minions
request with state.highstate on a minion is issued.Then querying for the second one obviously results in a message like
The function "state.highstate" is running as PID 28193 and was started at 2023, Apr 12 04:54:15.509763 with jid <first-jid>
At first, I obviously suspected our own service (Java, Apache HTTP Client, known to have a builtin retry funcionality), so I set up tcpdump beforehand. And indeed, only one POST request is seen on the wire.
I sorted all timestamped events into a "time line":
Indeed the two logs from salt master for the two jobs happen after receiving the job via REST API, but before responding to it.
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)The text was updated successfully, but these errors were encountered: