Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Salt master under load produces duplicate jobs #64060

Closed
MartinEmrich opened this issue Apr 12, 2023 · 31 comments
Closed

[BUG] Salt master under load produces duplicate jobs #64060

MartinEmrich opened this issue Apr 12, 2023 · 31 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior needs-triage

Comments

@MartinEmrich
Copy link

Description

Rarely, when issuing many jobs (salt master machine under heavy load), we observe the phenomenon that one state.highstate job issued via API is internally "duplicated". Salt Master generates two Job IDs. The first ist processed "normally" but the 202 ACCEPTED response contains a second one. Querying for it, the master (rightly) complains that a state.highstate job is already running.

Setup

Some setup facts:

  • Salt-Master (and minions) run Salt 3005.1 on CentOS 7 installed from official RPM repo.
  • Master is serving 400 Minions. Everything is on AWS.
  • When we do our scheduled update window, we scale the Master VM to 4 vCPUs and 16GB RAM (the latter is not used up).
  • For that update procedure, a service issues state.highstate jobs via Salt REST API and polls for the job statuses regularly, limited to 50 concurrent jobs.
  • That regular poll (GET /jobs/) is done with a 10s interval, thus resulting on average in ca. 5 requests/s.
  • Performance/AP-relevant salt master config:
interface: 0.0.0.0
timeout: 30
worker_threads: 10
top_file_merging_strategy: merge

rest_cherrypy:
  port: 8000
  debug: True
  disable_ssl: True
...

During the procedure, the system load avg quickly reaches 100. That alone is most likely worth a separate issue.
But every now and then, the above phenomenon occurs:

  • POST /minions request with state.highstate on a minion is issued.
  • Salt Master generates (and logs) two Job IDs.
  • The first one is actually executed on the minion
  • the 202 ACCEPTED response contains the second one.

Then querying for the second one obviously results in a message like The function "state.highstate" is running as PID 28193 and was started at 2023, Apr 12 04:54:15.509763 with jid <first-jid>

At first, I obviously suspected our own service (Java, Apache HTTP Client, known to have a builtin retry funcionality), so I set up tcpdump beforehand. And indeed, only one POST request is seen on the wire.

I sorted all timestamped events into a "time line":

Time (CEST)     Source of time stamp          Message/Comment

06:53:43        (API client)                  starting highstate on minion
06:53:43.680303 (pcap timestamp)              POST /minions request
06:53:43        (HTTP response Date header)   HTTP response for POST received later has a Date: header with this time

06:54:15.509763 (Salt Master response body)   Start time of job 20230412045415509763
06:54:15.724    (Salt Master log)             2023-04-12 04:54:15,724 [salt.master      :2479][INFO    ][1934] User api Published command state.highstate with jid 20230412045415509763
06:54:15.759    (Salt Master log)             2023-04-12 04:54:15,759 [salt.master      :2479][INFO    ][1928] User api Published command state.highstate with jid 20230412045415567009
06:54:16.029    (Salt Master log)             2023-04-12 04:54:16,029 [salt.utils.job   :85  ][INFO    ][1935] Got return from <minion-id> for job 20230412045415567009
06:54:16.289338 (pcap timestamp)              Response Saltmaster 202 ACCEPTED for job 20230412045415567009
06:54:16        (API client)                  receives and logs response Job 20230412045415567009 has been accepted.
06:54:16.290551 (pcap timestamp)              first Request from API client for GET /jobs/20230412045415567009 (regular polling for job status)
06:54:16        (HTTP response Date header)   HTTP response Date: header for GET /jobs/20230412045415567009

06:54:48.817459 (pcap timestamp)              Response of Salt-Master to the GET request (including ["The function \"state.highstate\" is running as PID 28193 and was started at 2023, Apr 12 04:54:15.509763 with jid 20230412045415509763"] )
06:54:48        (API client)                  logs error due to already running job.

Indeed the two logs from salt master for the two jobs happen after receiving the job via REST API, but before responding to it.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
          Salt: 3005.1

Dependency Versions:
          cffi: 1.15.0
      cherrypy: unknown
      dateutil: 2.8.2
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 2.11.1
       libgit2: 0.28.4
      M2Crypto: 0.35.2
          Mako: Not Installed
       msgpack: 0.6.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.21
      pycrypto: 2.6.1
  pycryptodome: Not Installed
        pygit2: 1.0.3
        Python: 3.6.8 (default, Nov 16 2020, 16:55:22)
  python-gnupg: Not Installed
        PyYAML: 3.13
         PyZMQ: 18.0.1
         smmap: Not Installed
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.1.4

System Versions:
          dist: centos 7 Core
        locale: UTF-8
       machine: x86_64
       release: 3.10.0-1160.88.1.el7.x86_64
        system: Linux
       version: CentOS Linux 7 Core
@MartinEmrich MartinEmrich added Bug broken, incorrect, or confusing behavior needs-triage labels Apr 12, 2023
@anilsil anilsil added this to the Sulfur v3006.1 milestone Apr 18, 2023
@whytewolf
Copy link
Collaborator

does this only happen through salt-api? are you able to replicate it from the cli?

@MartinEmrich
Copy link
Author

@whytewolf I doubt I would be able to... The issue occurs only very rarely (It happened 4 or 5 times only so far). I suspect it only can happen if the salt master (meaning the salt-master process(es)) is under extreme load. As this is on a production infrastructure, I cannot play around and try things.

But If I can prepare anything (enable some special logging, add a line or two of code), I could do this and wait for the next occurence.

@whytewolf
Copy link
Collaborator

one thing i would like to see is if you have setproctitle installed. you should be able to track the pid to two separate processes. most likely this is a race condition and two separate workers are picking up the job at the exact same time from the ipc. from the log you did provide i see that there are two pids. 1934 and 1928. if you have setproctitle installed you should see what those processes belong to.

This is more likely to happen under a stressed system as the job is likely to take more time to get to the point that it deletes the job off the ipc bus after picking it up.

but the load avg hitting 100? that on it's own is suspicious. you should not be hitting a load of 100 with only 10 workers and 400 minions. no matter how many minions return at once. is there any other configs you have in place?

do you have any reactors setup? and how does your pillar setup look?

@MartinEmrich
Copy link
Author

Yes, setproctitle is installed, but as the master node is being rebooted often (e.g. to scale it up before the maintenance window), the PIDs are long gone. But I will instruct the team to collect the process list whenever it happens again.

But what you describe is my suspicion as well: two workers pull the same job from a queue before the first one can acknowledge it.

For the high load:

I do not even know what a reactor is in salt contect, so I probably do not have one.
Pillar-wise, we have two sources: a git-backed pillar, and an ext_pillar python script collecting minion-specific data from an HTTP API endpoint.
But according to the logs, the git pillar is pulled only every 60s, and the ext_pillar script is executed only once or twice for each minion.
The minion config is completely stock except rejectedRetry: true (should be irrelevant with registered minions) and the master IP.

We could of course forego the scaling-up, thus even increasing pressure on the master VM to make symptoms more prominent.

@whytewolf
Copy link
Collaborator

so, the pull for git_pillar is not the same as the render. it still needs to go through the process of the render for each call a minion does. you're not doing anything like using pillar.items in your jinja are you? as that forces a new render of the pillar for each call. if you set the master into profile logging it should show the pillar rendering stats. which will give an idea of what pillar is doing.

other things to look for esp with the issue is what does the io on the system look like?

also can't believe i didn't notice this before but 4 cpus for 10 workers is not enough. for 10 workers you should need 6 cpus at least. but you also shouldn't need 10 workers for 400 minions.

see https://docs.saltproject.io/en/latest/topics/tutorials/intro_scale.html

@MartinEmrich
Copy link
Author

we do not use pillar.items, but very often {{ salt['pillar.get']("some:optional:pillar","default-value") }}, as `{{ pillar.some.optional.pillar }} will fail to render if the pillar is not set.

Is that also a bad thing?

I/O is not an issue (tops out at 100 IOPS and 10%IO-time during highest load), there's also plenty of RAM. and no swapping.

Also thanks for the hint about the workers, I will then go down to two workers and see what happens. I will also try out profile logging (the SIGUSR2 procedure). It might take to the week after next for results.

@whytewolf
Copy link
Collaborator

no, pillar.get works with the built in in memory pillar. i call out pillar.items because it forces a refresh of the pillar on the master for each call.

the default of worker_threads of 5 should be enough. and should work with 4 cpus.

@MartinEmrich
Copy link
Author

I tried the SIGUSR2 signal today before running a single job, but nothing happened. I set log_level: profile, then some PROFILE messages appeared in the log file.
I also set the workers to 4. After this week's maintenance window, I will report back if something interesting comes up.

@MartinEmrich
Copy link
Author

The maintenance was today, and with the "regularly-sized" salt master node, it again failed miserably. There were ca. 150 state.apply jobs scheduled. The machine had 4GB RAM and 2 vCPUs (AWS t3.medium). This time, there were 4 worker_threads configured.

But within a few minutes, the system load went up to 100, the salt API was then unreachable.

I picked out this snipped from the logs (filtered for only one PID), that does not look healthy:

2023-05-10 04:15:29,238 [salt.utils.event :912 ][ERROR   ][24288] Event iteration failed with exception: 'str' object has no attribute 'items'
2023-05-10 04:15:29,240 [salt.utils.job   :85  ][INFO    ][24288] Got return from someminion513 for job 20230510041313211738
2023-05-10 04:15:29,490 [salt.utils.event :912 ][ERROR   ][24288] Event iteration failed with exception: 'str' object has no attribute 'items'
2023-05-10 04:15:29,494 [salt.utils.job   :85  ][INFO    ][24288] Got return from someminion425 for job 20230510041315620062
2023-05-10 04:15:29,774 [salt.utils.event :912 ][ERROR   ][24288] Event iteration failed with exception: 'str' object has no attribute 'items'
2023-05-10 04:15:29,792 [salt.utils.job   :85  ][INFO    ][24288] Got return from someminion557 for job 20230510041324261744
2023-05-10 04:15:29,958 [salt.utils.event :912 ][ERROR   ][24288] Event iteration failed with exception: 'str' object has no attribute 'items'
2023-05-10 04:15:30,109 [salt.master      :1643][ERROR   ][24288] Received minion error from [someminion466]: The minion function caused an exception
2023-05-10 04:15:30,111 [salt.utils.job   :85  ][INFO    ][24288] Got return from someminion388 for job 20230510041319015497
2023-05-10 04:15:30,340 [salt.utils.event :912 ][ERROR   ][24288] Event iteration failed with exception: 'str' object has no attribute 'items'
2023-05-10 04:15:30,449 [salt.master      :1643][ERROR   ][24288] Received minion error from [someminion448]: The minion function caused an exception
2023-05-10 04:15:30,646 [salt.master      :1643][ERROR   ][24288] Received minion error from [someminion155]: The minion function caused an exception
2023-05-10 04:15:30,832 [salt.master      :1643][ERROR   ][24288] Received minion error from [someminion210]: The minion function caused an exception
2023-05-10 04:15:30,835 [salt.utils.job   :85  ][INFO    ][24288] Got return from someminion155 for job 20230510041312553864
2023-05-10 04:15:30,985 [salt.utils.event :912 ][ERROR   ][24288] Event iteration failed with exception: 'str' object has no attribute 'items'
2023-05-10 04:15:30,987 [salt.utils.job   :85  ][INFO    ][24288] Got return from someminion513 for job 20230510041313211738
2023-05-10 04:15:31,110 [salt.utils.event :912 ][ERROR   ][24288] Event iteration failed with exception: 'str' object has no attribute 'items'
2023-05-10 04:15:31,110 [salt.loaded.int.returner.local_cache:150 ][ERROR   ][24288] An extra return was detected from minion someminion513, please verify the minion, this could be a replay attack
2023-05-10 04:15:31,112 [salt.utils.job   :85  ][INFO    ][24288] Got return from someminion425 for job 20230510041315620062
2023-05-10 04:15:31,244 [salt.utils.event :912 ][ERROR   ][24288] Event iteration failed with exception: 'str' object has no attribute 'items'
2023-05-10 04:15:31,244 [salt.loaded.int.returner.local_cache:150 ][ERROR   ][24288] An extra return was detected from minion someminion425, please verify the minion, this could be a replay attack
2023-05-10 04:15:31,246 [salt.utils.job   :85  ][INFO    ][24288] Got return from someminion490 for job 20230510041319676792
2023-05-10 04:15:31,407 [salt.utils.event :912 ][ERROR   ][24288] Event iteration failed with exception: 'str' object has no attribute 'items'
2023-05-10 04:15:32,656 [salt.master      :1643][ERROR   ][24288] Received minion error from [someminion490]: The minion function caused an exception
2023-05-10 04:15:32,657 [salt.utils.job   :85  ][INFO    ][24288] Got return from someminion270 for job 20230510041318100936
2023-05-10 04:15:32,933 [salt.utils.event :912 ][ERROR   ][24288] Event iteration failed with exception: 'str' object has no attribute 'items'
2023-05-10 04:15:35,994 [salt.utils.job   :85  ][INFO    ][24288] Got return from someminion388 for job 20230510041319015497
2023-05-10 04:15:36,232 [salt.utils.event :912 ][ERROR   ][24288] Event iteration failed with exception: 'str' object has no attribute 'items'
2023-05-10 04:15:36,233 [salt.loaded.int.returner.local_cache:150 ][ERROR   ][24288] An extra return was detected from minion someminion388, please verify the minion, this could be a replay attack
2023-05-10 04:15:37,372 [salt.utils.job   :85  ][INFO    ][24288] Got return from someminion490 for job 20230510041319676792
2023-05-10 04:15:37,585 [salt.utils.event :912 ][ERROR   ][24288] Event iteration failed with exception: 'str' object has no attribute 'items'
2023-05-10 04:15:37,586 [salt.loaded.int.returner.local_cache:150 ][ERROR   ][24288] An extra return was detected from minion someminion490, please verify the minion, this could be a replay attack

Afterwards, our maintenance team resized the VM to a big 8vCPU machine, and then was successful. Memory/Swapping was never an issue.

The profile logging seem not to reveal anything interesting. All PROFILE log entries relate to pillars/ pillar gitfs, and all timings are way below 1s (most under 0.01s).

I'll check if I can have a "test=true" option added to our tooling, maybe it happens then, too....

@whytewolf
Copy link
Collaborator

yes please see if test=True can bring out the same behavior.

however this is interesting. An extra return was detected from minion someminion513, please verify the minion, this could be a replay attack this is the minion returning the same jid twice. this is different then the job running twice this is the minions trying to return the information twice. this normally only happens if there are more than one minions with the same key and id.

given the number of event errors seen you might also want to log the event bus. either manually through salt-run state.event, or through a event_returner such as https://docs.saltproject.io/en/latest/ref/returners/all/salt.returners.rawfile_json.html#module-salt.returners.rawfile_json

this could give us an idea of what is happening.

@MartinEmrich
Copy link
Author

I am trying to get "test=true" running. I try not to hijack this bug report as a teach-me-salt-cherrypy-api... but the API docs are a bit vague, I am still trying which data stucture will convey "test=true" to cherrypy... for now "kwargs": { "test": true } is ignored, "true" and "True" as well. I'll keep trying other variants ("kwarg" w/o "s", positional arg, ...)

The named minion is definitively unique. The VM exists only once (otherwise many other things would go wrong here). Also the minion id exists only once in the output of salt-key -L, and it's green, too.
I also saw no strange things (in fact, nothing) in the minion local logfile.

I have added a cron job to collect the output of salt-run state.event during the next maintenance.

I also try to get that event_returner working, after adding this to the master config file, the given file at least was created:

event_return:
  - rawfile_json

rawfile_json.filename: /var/log/salt/event_rawfile.log

@MartinEmrich
Copy link
Author

During testing the test=true implementation (test environment has of course only a few VMs not 400), I noticed two things so far:

  • System load during processing one instance at a time still rises to 0.15 to 0.2.
  • hitting ps aux regularly, by chance I notices this process outside of the salt-master process tree, like here (last two lines):
root      1499  0.0  3.2 468204 63992 ?        Ss   May10   0:13 /usr/bin/python3 /usr/bin/salt-api
root      1593  0.6  3.8 2557616 76020 ?       Sl   May10  54:56  \_ /usr/bin/python3 /usr/bin/salt-api
root      6659  0.0  3.8 645652 75868 ?        Ss   May15   0:12 /usr/bin/python3 /usr/bin/salt-master MainProcess
root      6793  0.0  2.8 640868 56636 ?        Sl   May15   0:00  \_ /usr/bin/python3 /usr/bin/salt-master PubServerChannel._publish_daemon
root      6794  0.0  2.9 678488 59080 ?        S    May15   0:00  \_ /usr/bin/python3 /usr/bin/salt-master EventPublisher
root      6797  0.4  3.4 652532 69360 ?        S    May15   4:48  \_ /usr/bin/python3 /usr/bin/salt-master Maintenance
root      6798  0.0  2.6 639112 53636 ?        S    May15   0:01  \_ /usr/bin/python3 /usr/bin/salt-master ReqServer ReqServer_ProcessManager
root      6799  0.0  2.8 876696 56040 ?        Sl   May15   0:02  |   \_ /usr/bin/python3 /usr/bin/salt-master ReqServer MWorkerQueue
root      6801  0.0  4.4 794624 87792 ?        Sl   May15   0:10  |   \_ /usr/bin/python3 /usr/bin/salt-master ReqServer MWorker-0
root      6802  0.0  4.3 784880 86544 ?        Sl   May15   0:08  |   \_ /usr/bin/python3 /usr/bin/salt-master ReqServer MWorker-1
root      6809  0.0  4.5 791032 90852 ?        Sl   May15   0:08  |   \_ /usr/bin/python3 /usr/bin/salt-master ReqServer MWorker-2
root      6810  0.0  4.5 801008 90156 ?        Rl   May15   0:09  |   \_ /usr/bin/python3 /usr/bin/salt-master ReqServer MWorker-3
root      6812  0.0  4.5 801328 89800 ?        Sl   May15   0:08  |   \_ /usr/bin/python3 /usr/bin/salt-master ReqServer MWorker-4
root       907  0.6  3.4 674864 68736 ?        Sl   09:51   0:00  \_ /usr/bin/python3 /usr/bin/salt-master FileserverUpdate
root      3130 13.0  3.7 763892 74208 ?        S    09:52   0:00 /usr/bin/python3 /usr/bin/salt-master ReqServer MWorker-0 ProcessFunc(AsyncClientMixin._proc_function, fun=jobs.list_job jid=
root      3192  0.0  0.3  58804  7780 ?        D    09:52   0:00  \_ /usr/sbin/lvm vgs -o vg_name --noheadings

I wonder why the salt master would spawn an lvm process on the master VM?

For the high load, I guess I soon have everything to diagnose in place, and will report back when i find something...

@whytewolf
Copy link
Collaborator

the lvm command is from the pillar spinning up the minionMaster and filling its own grains. the minionMaster is a version of a minion that is used for running modules from within rendering engines that are used in the master. such as orchestration, reactor, or pillar.

@MartinEmrich
Copy link
Author

While I did not get around to testing the dry run in time, the maintenance team got lucky(?): Another maintenance wednesday, another job duplicate. Note that this time, the maintenance team again resized the VM to 16GB/8vCPUs prior to the procedure.

I now have the event log file. As it its huge and full of sensitive information, I grepped out lines with the two relevant job ids. I also included the processes with PIDs as I found them now.

20230524042638207502 is the "magic duplicate", rejected by the minion.
20230524042632117957 is the "real" state.highstate job actually run on the minion.

# master

salt-master-20230524.log:2023-05-24 04:26:32,129 [salt.master      :2479][INFO    ][2044] User cds Published command state.highstate with jid 20230524042632117957
salt-master-20230524.log:2023-05-24 04:26:38,243 [salt.master      :2479][INFO    ][2044] User cds Published command state.highstate with jid 20230524042638207502
salt-master-20230524.log:2023-05-24 04:26:38,468 [salt.utils.job   :85  ][INFO    ][2042] Got return from someminion489 for job 20230524042638207502
salt-master-20230524.log:2023-05-24 04:28:15,547 [salt.utils.job   :85  ][INFO    ][2043] Got return from someminion489 for job 20230524042632117957

# event_rawfile

salt-master-20230524-event_rawfile.log:{"data": {"minions": ["someminion489"], "_stamp": "2023-05-24T04:26:32.118374"}, "tag": "20230524042632117957"}
salt-master-20230524-event_rawfile.log:{"data": {"jid": "20230524042632117957", "tgt_type": "glob", "tgt": "someminion489", "user": "cds", "fun": "state.highstate", "arg": [], "minions": ["someminion489"], "missing": [], "_stamp": "2023-05-24T04:26:32.125620"}, "tag": "salt/job/20230524042632117957/new"}
salt-master-20230524-event_rawfile.log:{"data": {"minions": ["someminion489"], "_stamp": "2023-05-24T04:26:38.207993"}, "tag": "20230524042638207502"}
salt-master-20230524-event_rawfile.log:{"data": {"jid": "20230524042638207502", "tgt_type": "glob", "tgt": "someminion489", "user": "cds", "fun": "state.highstate", "arg": [], "minions": ["someminion489"], "missing": [], "_stamp": "2023-05-24T04:26:38.220949"}, "tag": "salt/job/20230524042638207502/new"}
salt-master-20230524-event_rawfile.log:{"data": {"cmd": "_return", "id": "someminion489", "success": false, "return": ["The function \"state.highstate\" is running as PID 1473 and was started at 2023, May 24 04:26:32.117957 with jid 20230524042632117957"], "retcode": 1, "jid": "20230524042638207502", "fun": "state.highstate", "fun_args": [], "out": "highstate", "_stamp": "2023-05-24T04:26:38.469375"}, "tag": "salt/job/20230524042638207502/ret/someminion489"}
salt-master-20230524-event_rawfile.log:{"data": {"fun": "runner.jobs.list_job", "jid": "20230524042638663334", "user": "cds", "fun_args": [{"jid": "20230524042638207502"}], "_stamp": "2023-05-24T04:26:41.535423"}, "tag": "salt/run/20230524042638663334/new"}
salt-master-20230524-event_rawfile.log:{"data": {"fun": "runner.jobs.list_job", "jid": "20230524042638663334", "user": "cds", "fun_args": [{"jid": "20230524042638207502"}], "_stamp": "2023-05-24T04:26:43.241007", "return": {"jid": "20230524042638207502", "Function": "state.highstate", "Arguments": [], "Target": "someminion489", "Target-type": "glob", "User": "cds", "Minions": ["someminion489"], "StartTime": "2023, May 24 04:26:38.207502", "Result": {"someminion489": {"return": ["The function \"state.highstate\" is running as PID 1473 and was started at 2023, May 24 04:26:32.117957 with jid 20230524042632117957"], "retcode": 1, "success": false, "out": "highstate"}}}, "success": true}, "tag": "salt/run/20230524042638663334/ret"}
salt-master-20230524-event_rawfile.log:{"data": {"cmd": "_return", "id": "someminion489", "success": true, "return": {XXX}, "retcode": 0, "jid": "20230524042632117957", "fun": "state.highstate", "fun_args": [], "out": "highstate", "_stamp": "2023-05-24T04:28:15.547914"}, "tag": "salt/job/20230524042632117957/ret/someminion489"}

# processes (ps aux)

root      1080  0.0  0.4 571340 70332 ?        Ss   03:54   0:03 /usr/bin/python3 /usr/bin/salt-master MainProcess
root      1923  0.0  0.5 575036 86184 ?        Sl   03:54   0:03  \_ /usr/bin/python3 /usr/bin/salt-master PubServerChannel._publish_daemon
root      1927  0.2  0.7 629532 118332 ?       S    03:54   0:25  \_ /usr/bin/python3 /usr/bin/salt-master EventPublisher
root      1968  0.4  0.4 589604 74504 ?        S    03:54   0:46  \_ /usr/bin/python3 /usr/bin/salt-master Maintenance
root      2035  0.3  0.4 585624 69548 ?        Sl   03:54   0:33  \_ /usr/bin/python3 /usr/bin/salt-master EventReturn
root      2036  0.0  0.3 568588 58488 ?        S    03:54   0:00  \_ /usr/bin/python3 /usr/bin/salt-master ReqServer ReqServer_ProcessManager
root      2041  0.3  0.7 724384 113708 ?       Sl   03:54   0:39  |   \_ /usr/bin/python3 /usr/bin/salt-master ReqServer MWorkerQueue
root      2042  1.5  0.5 722424 92748 ?        Sl   03:54   2:39  |   \_ /usr/bin/python3 /usr/bin/salt-master ReqServer MWorker-0
root      2043  1.6  0.5 722364 92876 ?        Sl   03:54   2:47  |   \_ /usr/bin/python3 /usr/bin/salt-master ReqServer MWorker-1
root      2044  1.5  0.5 722392 93084 ?        Sl   03:54   2:40  |   \_ /usr/bin/python3 /usr/bin/salt-master ReqServer MWorker-2
root      2050  1.5  0.5 722432 92848 ?        Sl   03:54   2:43  |   \_ /usr/bin/python3 /usr/bin/salt-master ReqServer MWorker-3
root      4291  0.7  0.4 602000 69012 ?        Sl   06:45   0:01  \_ /usr/bin/python3 /usr/bin/salt-master FileserverUpdate
root      1081  0.0  0.3 400660 61024 ?        Ss   03:54   0:01 /usr/bin/python3 /usr/bin/salt-api MainProcess
root      1662  4.2  1.7 5446128 272132 ?      Sl   03:54   7:23  \_ /usr/bin/python3 /usr/bin/salt-api RunNetapi(salt.loaded.int.netapi.rest_cherrypy)

If you need more info, please let me know.

Nevertheless I was not totally idle: on my test environment, while having only a few VMs, I notice the salt master reaching a load avg of up to 0.2 during a single state.highstate job, so maybe I could reproduce it here...

@whytewolf
Copy link
Collaborator

interesting. the same pid caught both jobs. this wasn't some race condition in picking up the job. it looks like it acted like it was called a second time. also with the second job being started almost six seconds later there must be something else going on here.

Do you have any logs in the /var/log/salt/api log around the same timestamps? 2023-05-24 04:26:32,129 & 2023-05-24 04:26:38,243

@MartinEmrich
Copy link
Author

@whytewolf yes, I have the API logs, but they seem useless, as the minion ID is in the body:

2023-05-24 04:26:31,715 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:31] "GET /jobs/20230524042419585899 HTTP/1.1" 200 281 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:31,777 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:31] "GET /jobs/20230524042334935911 HTTP/1.1" 200 283 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:32,113 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:32] "GET /jobs/20230524042504797899 HTTP/1.1" 200 287 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:32,817 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:32] "GET /jobs/20230524042442458484 HTTP/1.1" 200 287 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:34,269 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:34] "POST /minions HTTP/1.1" 202 150 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:36,111 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:36] "GET /jobs/20230524042444981041 HTTP/1.1" 200 283 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:37,224 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:37] "GET /jobs/20230524042517378915 HTTP/1.1" 200 287 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:37,328 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:37] "GET /jobs/20230524042534207528 HTTP/1.1" 200 283 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:37,402 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:37] "GET /jobs/20230524042541460397 HTTP/1.1" 200 287 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:37,436 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:37] "GET /jobs/20230524042404907278 HTTP/1.1" 200 283 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:37,555 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:37] "GET /jobs/20230524042509058599 HTTP/1.1" 200 287 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:37,583 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:37] "GET /jobs/20230524042452054385 HTTP/1.1" 200 287 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:37,883 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:37] "GET /jobs/20230524042342656387 HTTP/1.1" 200 283 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:37,915 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:37] "GET /jobs/20230524042228790675 HTTP/1.1" 200 281 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:38,107 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:38] "GET /jobs/20230524042325413814 HTTP/1.1" 200 283 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:38,264 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:38] "POST /minions HTTP/1.1" 202 150 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:38,325 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:38] "GET /jobs/20230524042504458050 HTTP/1.1" 200 283 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:38,403 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:38] "GET /jobs/20230524042532956903 HTTP/1.1" 200 287 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:38,485 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:38] "GET /jobs/20230524042558503019 HTTP/1.1" 200 287 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:38,815 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:38] "GET /jobs/20230524042232395517 HTTP/1.1" 200 281 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:39,562 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:39] "POST /minions HTTP/1.1" 202 150 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:40,354 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:40] "GET /jobs/20230524042418896635 HTTP/1.1" 200 281 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:40,524 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:40] "GET /jobs/20230524042446722365 HTTP/1.1" 200 287 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"
2023-05-24 04:26:40,954 [cherrypy.access.140256773553680:282 ][INFO    ][1662] x.x.x.x - - [24/May/2023:04:26:40] "GET /jobs/20230524042447441576 HTTP/1.1" 200 283 "" "Apache-HttpClient/4.5.13 (Java/17.0.7)"

But I already took a full packet capture some time ago when it happened (see OP), to verify that only one POST /minions request goes over the wire for the duplicate job. The other POST requests are for other minions.

the client service logged the request for this minion (before sending it) at 04:25:51, so it took some half a minute until it got processed in salt.

@whytewolf
Copy link
Collaborator

how are you doing targeting? from some of the logs it looks like you are triggering against each minion singularly. creating a new event for each minion on its own?. is it possable to switch to targetting where with multiple targets?

@MartinEmrich
Copy link
Author

Yes, the jobs are for individual minions, that's part of the service calling salt, scheduling the jobs for individual minions to satisfy certain ordering/scheduling constraints (most of the minions are part of node clusters, where only one node shall be down/under maintenance at the same time).

Each job is polled every 10s, and to lighten the load I already implemented a limit of 50 Tasks in the calling service (effectively resulting in 50 concurrent Salt jobs at any given time).
Thus the average of 5 polls per second.

@whytewolf
Copy link
Collaborator

ok. the output from the api looks like it is only in info. can you turn up the logging to debug? it should have more information about the events being sent from it. trying to figure out if the duplicate events are coming from the master, or from within the api process.

@MartinEmrich
Copy link
Author

ok. I have set:

log_level: debug
log_level_logfile: debug

in /etc/salt/master (was "profile" before). If the API process needs specific configuration, too, please let me know.

@whytewolf
Copy link
Collaborator

nope that should be the same setting. debug is after profile. and the api doesn't have anything that gets profiled.

@MartinEmrich
Copy link
Author

MartinEmrich commented Jul 5, 2023

Quick note: Today I am present during the maintenance window and took a quick peek. The processes eating up CPU look like this in ps aux:

root     26432 58.0  0.2 723580 84024 ?        R    04:55   0:00 /usr/bin/python3 /usr/bin/salt-master ReqServer MWorker-0 ProcessFunc(AsyncClientMixin._proc_function, fun=jobs.list_job jid=20230705045525099
root     26434 60.0  0.2 723664 84540 ?        R    04:55   0:00 /usr/bin/python3 /usr/bin/salt-master ReqServer MWorker-2 ProcessFunc(AsyncClientMixin._proc_function, fun=jobs.list_job jid=20230705045525082
root     26734 51.0  0.2 723408 84152 ?        S    04:55   0:00 /usr/bin/python3 /usr/bin/salt-master ReqServer MWorker-2 ProcessFunc(AsyncClientMixin._proc_function, fun=jobs.list_job jid=20230705045525289
root     26793 47.0  0.2 723408 84148 ?        S    04:55   0:00 /usr/bin/python3 /usr/bin/salt-master ReqServer MWorker-2 ProcessFunc(AsyncClientMixin._proc_function, fun=jobs.list_job jid=20230705045525340
root     27647  0.0  0.0 176152  3004 ?        R    04:55   0:00  \_ rpm --eval %{_host_cpu}
root     27153  0.0  0.2 714320 82704 ?        R    04:55   0:00 /usr/bin/python3 /usr/bin/salt-master ReqServer MWorker-3 ProcessFunc(AsyncClientMixin._proc_function, fun=jobs.list_job jid=20230705045525659
root     27174  0.0  0.2 713512 81824 ?        S    04:55   0:00 /usr/bin/python3 /usr/bin/salt-master ReqServer MWorker-0 ProcessFunc(AsyncClientMixin._proc_function, fun=jobs.list_job jid=20230705045525707
root     27693  0.0  0.0  15932  1788 ?        R    04:55   0:00  \_ /usr/bin/man zpool-features
root     27230  0.0  0.2 713596 82192 ?        S    04:55   0:00 /usr/bin/python3 /usr/bin/salt-master ReqServer MWorker-2 ProcessFunc(AsyncClientMixin._proc_function, fun=jobs.list_job jid=20230705045525790
root     27237  0.0  0.2 714120 82556 ?        R    04:55   0:00 /usr/bin/python3 /usr/bin/salt-master ReqServer MWorker-1 ProcessFunc(AsyncClientMixin._proc_function, fun=jobs.list_job jid=20230705045525797

So somehow jobs.list_job is very expensive?

Some of them run IMHO totally unrelated processes on the salt master, which might be the cause for the load.
(of course this itself would not necessarily produce the duplicate jobs themselves)

@whytewolf
Copy link
Collaborator

which master_job_cache are you using? the default localfs one?

if so then yes it is going to be expensive. you re hitting the master with thousands of jobs each job is a separate file that gets saved to the filesystem. they are kept for 24 hours. so jobs.list_jobs is going to spend it's time looking up each and every job file created on the fileserver.

and this CAN have an effect on the ipc transfer. as it transfers through socket files in the fileserver. if the fileserver is overloaded because it has to work hard to build the jobs list. it potentially could cause an os based block on the ipc mode.

see https://docs.saltproject.io/en/latest/topics/tutorials/intro_scale.html

@MartinEmrich
Copy link
Author

@whytewolf I have not configured anything related to the job cache, so it is the default.
But there are only a few hundreds of jobs (unless listing/looking up a job counts as a new one).
Storage and I/O are barely used (they peak at 300 write IOPS, 3MB/s on AWS's SSD-based EBS volumes).

I now suspect that "minionMaster" concept you mentioned earlier. If every GET on a job results in running countless child processes on the master (running rpm, looking at local LVM, reading manpages that are not installed, etc.), that would be the point I would look at next. Can I turn this off to compare?

Sadly (or luckily maybe), we had no duplicate jobs during maintenance this time. Nevertheless I put the log files (including DEBUG) aside... If there's something to look for, please let me know.

@whytewolf
Copy link
Collaborator

so, no minionMaster is not something that can be turned off it is central to how pillar within salt works. and the only reason i mentioned it was because you asked about the lvm process. it was never a consideration for the issue as the minionMaster is a part of pillar. not how jobs are grabbed off the publish bus.

since the logs we care about would be when there are duplicates. let me know if you get some logs with duplicates.

@whytewolf
Copy link
Collaborator

to note. the 3006.3 version will have some configurations added that should help with blocking processes. which in turn might help reduce the duplicate processes. so once that is released we will want you to test that version.

@MartinEmrich
Copy link
Author

Ok. During the last few maintenances, the issue did not appear (the team always provides a bigger VM for the Salt master). I still have the event_file turned on and debug level enabled.

For another issue, I tried the current 3006 RC yesterday, but got hit by another bug, so I am stuck on 3005 for now.

@whytewolf
Copy link
Collaborator

3006 has been out of RC for several months now. what version did you try exactly.

@MartinEmrich
Copy link
Author

Huh, some advanced browser-history-digging :)

I had this issue during testing something else around here: #62851
So I tried these RC builds: https://repo.saltproject.io/salt_rc/salt/py3/amazon/2/x86_64/latest/
This fixed the first issue, but with them, I got this issue: #64118 (fix seems already committed somewhere), so I rolled back, and rolled back that python module.

But don't mind, all of that had nothing to do with this issue here.

@whytewolf
Copy link
Collaborator

Huh, some advanced browser-history-digging :)

I had this issue during testing something else around here: #62851 So I tried these RC builds: https://repo.saltproject.io/salt_rc/salt/py3/amazon/2/x86_64/latest/ This fixed the first issue, but with them, I got this issue: #64118 (fix seems already committed somewhere), so I rolled back, and rolled back that python module.

But don't mind, all of that had nothing to do with this issue here.

Ahh that issue was fixed in 3006.1, current latest is 3006.2 which is a CVE release, with 3006.3 right around the corner.

@whytewolf
Copy link
Collaborator

Ok. During the last few maintenances, the issue did not appear (the team always provides a bigger VM for the Salt master). I still have the event_file turned on and debug level enabled.

humm. unfortunately without duplicating the issue. and with a larger vm being able to handle the work without duplicates. and with potential load fixs coming in the next 3006.x release which is around the corner. I think I am going to have to close this. when 3006.3 is released if you give it a try. and if it still has issues open up a new ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior needs-triage
Projects
None yet
Development

No branches or pull requests

3 participants