Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] After Upgrading, Windows Minion RAM Usage 7GB+ #62706

Closed
Silvenga opened this issue Sep 17, 2022 · 16 comments
Closed

[BUG] After Upgrading, Windows Minion RAM Usage 7GB+ #62706

Silvenga opened this issue Sep 17, 2022 · 16 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior cannot-reproduce cannot be replicated with info/context provided Performance Regression The issue is a bug that breaks functionality known to work in previous releases. Windows

Comments

@Silvenga
Copy link

Description
After upgrading to 3005 from 3002.9 we are seeing heavy Salt minion memory usage across all our Windows servers. On this server with enough memory (96GB), total RAM usage of the Salt Minion is around 7GB. On more memory constrained Windows instances, the Salt Minion eats nearly all the available memory and starts paging. For example, on a 4GB server, Salt RAM usage is around 3GB.

Get-Process | ? ProcessName -eq python| select @{Name="RAMUsage";Expression={$_.WorkingSet64 / 1MB}} | measure -Sum RAMUsage

Count    : 55
Average  :
Sum      : 7331.484375
Maximum  :
Minimum  :
Property : RAMUsage

On this server there are 55 salt-minion processes, each using about 130MB of ram each:

Get-Process | ? ProcessName -eq python| select name, id, starttime, @{Name="RAMUsage";Expression={$_
.WorkingSet64 / 1MB}}

Name      Id StartTime                 RAMUsage
----      -- ---------                 --------
python  1008 9/11/2022 2:49:49 PM    69.3203125
python  1784 9/12/2022 1:58:43 AM   136.7578125
python  2084 9/11/2022 2:20:25 PM  147.26953125
python  2396 9/13/2022 2:54:43 PM   137.4765625
python  3344 9/11/2022 2:40:27 PM      69.15625
python  3908 9/11/2022 7:44:28 PM  136.02734375
python  3980 9/12/2022 11:14:46 AM 136.01171875
python  4632 9/11/2022 2:07:02 PM   147.6015625
python  4964 9/12/2022 5:21:47 PM   136.1171875
python  5052 9/11/2022 1:36:55 PM     136.96875
python  6884 9/13/2022 5:56:07 PM     137.84375
python  7472 9/11/2022 2:30:02 PM     69.234375
python  7824 9/11/2022 1:36:21 PM     96.453125
python  8440 9/11/2022 10:49:07 PM 136.05078125
python  8500 9/12/2022 4:59:36 AM   133.9453125
python  9096 9/11/2022 2:54:43 PM     68.984375
python 10164 9/11/2022 2:44:14 PM   69.44921875
python 11224 9/11/2022 2:56:26 PM    135.015625
python 11256 9/11/2022 4:37:44 PM      135.6875
python 11328 9/14/2022 3:23:16 PM  138.13671875
python 11424 9/14/2022 6:10:12 AM  137.94921875
python 11588 9/12/2022 11:36:54 PM  134.8203125
python 11924 9/14/2022 9:11:45 AM  136.58984375
python 12084 9/12/2022 2:15:28 PM  136.20703125
python 12652 9/14/2022 3:09:44 AM  138.05859375
python 13232 9/12/2022 8:08:16 AM  135.95703125
python 13348 9/12/2022 8:30:22 PM   135.6484375
python 13364 9/13/2022 8:59:03 PM    137.234375
python 13444 9/13/2022 2:40:25 AM  136.33203125
python 13860 9/13/2022 5:47:04 AM   137.9609375
python 14012 9/13/2022 8:47:31 AM   137.5078125
python 14044 9/14/2022 12:05:50 AM   137.703125
python 14560 9/13/2022 11:53:10 AM 137.48046875
python 14832 9/14/2022 12:14:59 PM 137.90234375
python 14964 9/15/2022 9:47:06 AM    139.140625
python 15376 9/15/2022 3:40:28 AM   138.2890625
python 15592 9/15/2022 6:57:40 PM  142.01171875
python 16932 9/14/2022 6:27:59 PM  138.32421875
python 17332 9/14/2022 9:28:52 PM  138.66796875
python 17432 9/15/2022 6:46:26 AM  138.46484375
python 17440 9/15/2022 12:53:42 PM   139.921875
python 17464 9/15/2022 12:31:12 AM 136.81640625
python 17700 9/16/2022 10:25:45 AM  148.3984375
python 18484 9/15/2022 10:01:28 PM 141.93359375
python 18524 9/15/2022 3:55:10 PM  141.26953125
python 19492 9/16/2022 7:35:29 PM     151.53125
python 19716 9/16/2022 7:17:40 AM  145.18359375
python 19808 9/16/2022 1:03:17 AM  143.23828125
python 19904 9/16/2022 1:30:48 PM    148.578125
python 19916 9/16/2022 4:11:25 AM  143.73046875
python 20848 9/17/2022 4:52:24 AM      155.0625
python 20952 9/16/2022 4:31:28 PM   149.3046875
python 21808 9/17/2022 1:44:41 AM  152.70703125
python 22220 9/16/2022 10:41:58 PM 151.81640625
python 22304 9/17/2022 7:56:50 AM    156.234375
Get-Date
Saturday, September 17, 2022 10:18:59 AM

Setup

Windows 2016 instances (both bare metal and virtualized via Hyper-V). Version 3005, via the MSI installer.

Steps to Reproduce the behavior

Nothing interesting in the logs.

Minion configuration is close to defaults:

schedule:
  __mine_interval: {enabled: true, function: mine.update, jid_include: true, maxrunning: 2,
    minutes: 60, return_job: false, run_on_start: true}
grains_cache: true

# Not documented.
# https://github.com/saltstack/salt/issues/48773
disable_grains:
  - esxi

disable_modules:
  - vsphere
  # Just annoying:
  - boto3_elasticsearch
  - victorops
  - ifttt
  - pushbullet
  - zfs

Expected behavior

RAM usage should be constrained.

Screenshots

These servers are Core instances, so not much of a UI. This is a screenshot from a Hypervisor (96GB of ram, 7GB used by Salt).
image

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
C:\Program Files\Salt Project\Salt\bin\lib\site-packages\_distutils_hack\__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
Salt Version:
          Salt: 3005

Dependency Versions:
          cffi: 1.14.6
      cherrypy: 18.6.1
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: 4.0.7
     gitpython: Not Installed
        Jinja2: 3.1.0
       libgit2: Not Installed
      M2Crypto: Not Installed
          Mako: 1.1.4
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.10.1
        pygit2: Not Installed
        Python: 3.8.13 (tags/v3.8.13:ea67321, Aug 22 2022, 17:16:10) [MSC v.1929 64 bit (AMD64)]
  python-gnupg: 0.4.8
        PyYAML: 5.4.1
         PyZMQ: 19.0.0
         smmap: 4.0.0
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.2

System Versions:
          dist:
        locale: cp1252
       machine: AMD64
       release: 2016Server
        system: Windows
       version: 2016Server 10.0.14393 SP0 Multiprocessor Free

Additional context

All Windows instances appear to be impacted.

@Silvenga Silvenga added Bug broken, incorrect, or confusing behavior needs-triage labels Sep 17, 2022
@Silvenga
Copy link
Author

A little more context before I start restarting minions:

salt-call --local saltutil.running
local:

(Same server as above)

@OrangeDog OrangeDog added Regression The issue is a bug that breaks functionality known to work in previous releases. Performance labels Sep 20, 2022
@OrangeDog
Copy link
Contributor

Did you upgrade through each major version, or jump straight to 3005?

If you install setproctitle you may get better process names to aid investigation.

@Silvenga
Copy link
Author

This was a clean uninstall/install to move to the new directory structure (so complete removal of the C:\salt directory). A direct jump to 3005 from 3002.9. No cache should have carried over, there were new keys, etc. This also happened with a new Windows instance that was just just built out (never had salt installed).

The salt-minion installer broke when attempting to upgrade (spinned for days), so manual removal and reinstall (we also moved from the EXE to MSI, the later is much easier to debug).

How would one use setproctitle under Windows?

https://github.com/dvarrazzo/py-setproctitle#module-status
Note that on Windows there is no way to change the process string: what the module does is to create a Named Object whose value can be read using a tool such as Process Explorer (contribution of a more useful tool to be used together with setproctitle would be well accepted).

@OrangeDog
Copy link
Contributor

How would one use setproctitle under Windows?

As the quote says, probably by using Process Explorer instead of the Windows Process Manager.

@Silvenga
Copy link
Author

Silvenga commented Sep 20, 2022

So setproctitle is already being used by salt?

@Silvenga
Copy link
Author

Ah, found it, it creates a handle via a system mutex.

Schedule(name=Highstate, jid=20220920040540366404)
Schedule(name=Highstate, jid=20220920070650373582)

Looks like the processes used for scheduled runs keep running after completion. I don't see these jid's being reported back to the salt-master.

@OrangeDog
Copy link
Contributor

There was a change in schedule config at some point, though I don't remember specifically when.
Might be worth a check:

Are you setting the schedule via a state or pillar?
Ensure job_args is a list and job_kwargs is a dict, and possibly purge the schedule and re-create.

@Silvenga
Copy link
Author

We are setting schedule via pillar:

schedule:
  Highstate:
    function: state.highstate
    minutes: 180
    splay: 600

(you can tell how old this is by usage of state.highstate - this file hasn't been changed in 5 years, existing before we moved to gitfs! 😸)

I want to say this looks valid per https://docs.saltproject.io/salt/user-guide/en/latest/topics/scheduler.html

Highstate jobs are still executing at the expected 180 minute interval on these nodes (I can see the job return data on the salt-master). I'm not sure where the Schedule(name=Highstate, jid=20220920040540366404) jid is, there's no job return for that id. The job isn't in the master's cache (although, older highstate jobs are for this node, judging when a service.restart salt-minion job was executed last night).

@cmcmarrow
Copy link
Contributor

Hi @Silvenga we are trying to reproduce your process leak and are having no luck. Can you help us by telling what states and modules your using? Do you know what your doing when salt minion is making the process that is hanging? For example are you running some kind of state file around the same time the hanging process spawns?

@Silvenga
Copy link
Author

what states and modules your using

Does this help?

Highstate output for a Windows node (Hyper-V host)
----------
          ID: Ensure enhanced telemetry is disabled
    Function: reg.present
        Name: HKLM\Software\Microsoft\Windows\CurrentVersion\Policies\DataCollection
      Result: True
     Comment: AllowTelemetry in HKLM\Software\Microsoft\Windows\CurrentVersion\Policies\DataCollection is already present
     Started: 20:59:29.078658
    Duration: 15.628 ms
     Changes:
----------
          ID: Ensure Remote management via rdp is enabled
    Function: rdp.enabled
      Result: True
     Comment: RDP is enabled
     Started: 20:59:29.094286
    Duration: 631.372 ms
     Changes:
----------
          ID: Ensure hostname is correct
    Function: system.hostname
        Name: gh2
      Result: True
     Comment: Hostname is already set to 'gh2'
     Started: 20:59:29.741304
    Duration: 79.351 ms
     Changes:
----------
          ID: Ensure disk performance counters are enabled
    Function: cmd.run
        Name: diskperf -Y
      Result: True
     Comment: onlyif condition is false
     Started: 20:59:29.820655
    Duration: 8587.052 ms
     Changes:
----------
          ID: Ensure Server Manager performace counters are enabled at startup
    Function: cmd.run
        Name: Enable-ScheduledTask -TaskName 'Server Manager Performance Monitor' -TaskPath '\Microsoft\Windows\PLA\'
      Result: True
     Comment: unless condition is true
     Started: 20:59:38.407707
    Duration: 1934.249 ms
     Changes:
----------
          ID: Ensure Server Manager performace counters are running
    Function: cmd.run
        Name: logman start "Server Manager Performance Monitor"
      Result: True
     Comment: unless condition is true
     Started: 20:59:40.341956
    Duration: 829.823 ms
     Changes:
----------
          ID: Ensure the pcm Telegraf service grain is set
    Function: grains.list_present
        Name: telegraf
      Result: True
     Comment: Value pcm is already in grain telegraf
     Started: 20:59:41.171779
    Duration: 0.0 ms
     Changes:
----------
          ID: Ensure Timezone
    Function: timezone.system
        Name: America/Chicago
      Result: True
     Comment: Timezone America/Chicago already set
     Started: 20:59:41.171779
    Duration: 15.626 ms
     Changes:
----------
          ID: Ensure minion performance options are set
    Function: file.managed
        Name: C:\ProgramData\Salt Project\Salt\conf\minion.d\performance.conf
      Result: True
     Comment: File C:\ProgramData\Salt Project\Salt\conf\minion.d\performance.conf is in the correct state
     Started: 20:59:41.187405
    Duration: 219.879 ms
     Changes:
----------
          ID: Ensure the SaltStack minion is installed
    Function: pkg.installed
        Name: salt-minion-py3
      Result: True
     Comment: All specified packages are already installed and are at the desired version
     Started: 20:59:41.425375
    Duration: 6475.776 ms
     Changes:
----------
          ID: Ensure MP exclusion for [C:\Program Files\Salt Project\Salt] exists
    Function: cmd.script
        Name: salt://infra/salt/files/update-mp-exclusion.ps1
      Result: True
     Comment:
     Started: 20:59:47.901151
    Duration: 1600.777 ms
     Changes:
----------
          ID: Ensure smartmontools are installed
    Function: pkg.installed
        Name: smartmontools
      Result: True
     Comment: All specified packages are already installed and are at the desired version
     Started: 20:59:49.517557
    Duration: 174.049 ms
     Changes:
----------
          ID: Ensure smartd.conf is correct
    Function: file.managed
        Name: C:\Program Files\smartmontools\bin\smartd.conf
      Result: True
     Comment: File C:\Program Files\smartmontools\bin\smartd.conf is in the correct state
     Started: 20:59:49.691606
    Duration: 220.005 ms
     Changes:
----------
          ID: Ensure the smartd service is installed
    Function: cmd.run
        Name: & "C:\Program Files\smartmontools\bin\smartd.exe" install -c "C:\Program Files\smartmontools\bin\smartd.conf"
      Result: True
     Comment: No changes detected
     Started: 20:59:50.285359
    Duration: 0.0 ms
     Changes:
----------
          ID: Ensure smartd is enabled and is running
    Function: service.running
        Name: smartd
      Result: True
     Comment: The service smartd is already running
     Started: 20:59:50.285359
    Duration: 193.729 ms
     Changes:
----------
          ID: Ensure duplicati dependencies are installed
    Function: pkg.installed
        Name: vcredist-2015
      Result: True
     Comment: All specified packages are already installed
     Started: 20:59:50.479088
    Duration: 109.372 ms
     Changes:
----------
          ID: Ensure duplicati is installed
    Function: pkg.installed
        Name: duplicati
      Result: True
     Comment: All specified packages are already installed and are at the desired version
     Started: 20:59:50.588460
    Duration: 173.041 ms
     Changes:
----------
          ID: Ensure local db directory exists
    Function: file.directory
        Name: C:\Program Files\Duplicati 2\local-configs
      Result: True
     Comment: The directory C:\Program Files\Duplicati 2\local-configs is in the correct state
     Started: 20:59:50.761501
    Duration: 15.624 ms
     Changes:
----------
          ID: Ensure nightly backup script exists
    Function: file.managed
        Name: C:\Program Files\Duplicati 2\local-configs\backup-nightly.ps1
      Result: True
     Comment: File C:\Program Files\Duplicati 2\local-configs\backup-nightly.ps1 is in the correct state
     Started: 20:59:50.777125
    Duration: 214.42 ms
     Changes:
----------
          ID: Ensure .Net 2.0 and 3.5 are installed
    Function: win_servermanager.installed
        Name: Net-Framework-Core
      Result: True
     Comment: Installed the following:
              - Net-Framework-Core
     Started: 20:59:50.991545
    Duration: 3634.698 ms
     Changes:
----------
          ID: Ensure Icinga2 is installed
    Function: pkg.installed
        Name: icinga2
      Result: True
     Comment: All specified packages are already installed and are at the desired version
     Started: 20:59:54.626243
    Duration: 253.537 ms
     Changes:
----------
          ID: Ensure Icinga2 Powershell modules is installed
    Function: file.managed
        Name: C:\Program Files\WindowsPowerShell\Modules\Icinga2Agent\Icinga2Agent.psm1
      Result: True
     Comment: File C:\Program Files\WindowsPowerShell\Modules\Icinga2Agent\Icinga2Agent.psm1 is in the correct state
     Started: 20:59:54.881778
    Duration: 96.994 ms
     Changes:
----------
          ID: Ensure check_windows_updates.ps1 is installed
    Function: file.managed
        Name: C:\Program Files\ICINGA2\sbin\check_windows_updates.ps1
      Result: True
     Comment: File C:\Program Files\ICINGA2\sbin\check_windows_updates.ps1 is in the correct state
     Started: 20:59:54.978772
    Duration: 42.074 ms
     Changes:
----------
          ID: Ensure check_pending_reboot.ps1 is installed
    Function: file.managed
        Name: C:\Program Files\ICINGA2\sbin\check_pending_reboot.ps1
      Result: True
     Comment: File C:\Program Files\ICINGA2\sbin\check_pending_reboot.ps1 is in the correct state
     Started: 20:59:55.020846
    Duration: 187.512 ms
     Changes:
----------
          ID: Remove conflicting icinga2 config file apt.conf
    Function: file.absent
        Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\apt.conf
      Result: True
     Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\apt.conf is not present
     Started: 20:59:55.208358
    Duration: 15.769 ms
     Changes:
----------
          ID: Remove conflicting icinga2 config file downtimes.conf
    Function: file.absent
        Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\downtimes.conf
      Result: True
     Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\downtimes.conf is not present
     Started: 20:59:55.224127
    Duration: 1.069 ms
     Changes:
----------
          ID: Remove conflicting icinga2 config file hosts.conf
    Function: file.absent
        Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\hosts.conf
      Result: True
     Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\hosts.conf is not present
     Started: 20:59:55.225196
    Duration: 0.0 ms
     Changes:
----------
          ID: Remove conflicting icinga2 config file satellite.conf
    Function: file.absent
        Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\satellite.conf
      Result: True
     Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\satellite.conf is not present
     Started: 20:59:55.225196
    Duration: 0.0 ms
     Changes:
----------
          ID: Remove conflicting icinga2 config file templates.conf
    Function: file.absent
        Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\templates.conf
      Result: True
     Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\templates.conf is not present
     Started: 20:59:55.225196
    Duration: 0.0 ms
     Changes:
----------
          ID: Remove conflicting icinga2 config file users.conf
    Function: file.absent
        Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\users.conf
      Result: True
     Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\users.conf is not present
     Started: 20:59:55.225196
    Duration: 0.0 ms
     Changes:
----------
          ID: Remove conflicting icinga2 config file app.conf
    Function: file.absent
        Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\app.conf
      Result: True
     Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\app.conf is not present
     Started: 20:59:55.225196
    Duration: 0.0 ms
     Changes:
----------
          ID: Remove conflicting icinga2 config file commands.confgroups.conf
    Function: file.absent
        Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\commands.confgroups.conf
      Result: True
     Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\commands.confgroups.conf is not present
     Started: 20:59:55.225196
    Duration: 0.0 ms
     Changes:
----------
          ID: Remove conflicting icinga2 config file notifications.conf
    Function: file.absent
        Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\notifications.conf
      Result: True
     Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\notifications.conf is not present
     Started: 20:59:55.225196
    Duration: 0.0 ms
     Changes:
----------
          ID: Remove conflicting icinga2 config file services.conf
    Function: file.absent
        Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\services.conf
      Result: True
     Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\services.conf is not present
     Started: 20:59:55.225196
    Duration: 15.637 ms
     Changes:
----------
          ID: Remove conflicting icinga2 config file timeperiods.conf
    Function: file.absent
        Name: C:\ProgramData\icinga2\etc\icinga2\conf.d\timeperiods.conf
      Result: True
     Comment: File C:\ProgramData\icinga2\etc\icinga2\conf.d\timeperiods.conf is not present
     Started: 20:59:55.240833
    Duration: 0.0 ms
     Changes:
----------
          ID: Complete Installation
    Function: file.managed
        Name: C:\ProgramData\icinga2\etc\icinga2\salt-managed
      Result: True
     Comment: File C:\ProgramData\icinga2\etc\icinga2\salt-managed is in the correct state
     Started: 20:59:55.240833
    Duration: 15.645 ms
     Changes:
----------
          ID: Setup Icinga2
    Function: cmd.script
      Result: True
     Comment: State was not run because none of the onchanges reqs changed
     Started: 20:59:55.256478
    Duration: 0.0 ms
     Changes:
----------
          ID: Ensure Icinga2 CA is trusted
    Function: win_pki.import_cert
        Name: salt://infra/icinga2/files/ca.crt
      Result: True
     Comment: Certificate 'ED4B02846DA3ECDFCD1BC2D0D3353E8AACEAFB08' already contained in store: Cert:\LocalMachine\Root
     Started: 20:59:55.256478
    Duration: 3001.838 ms
     Changes:
----------
          ID: Icinga2 failed Installation
    Function: file.managed
        Name: C:\ProgramData\icinga2\etc\icinga2\salt-managed
      Result: True
     Comment: State was not run because onfail req did not change
     Started: 20:59:58.258316
    Duration: 0.0 ms
     Changes:
----------
          ID: Restart Icinga2
    Function: service.running
        Name: icinga2
      Result: True
     Comment: The service icinga2 is already running
     Started: 20:59:58.273943
    Duration: 31.445 ms
     Changes:
----------
          ID: Ensure host definition is correct
    Function: file.serialize
        Name: C:\ProgramData\icinga2\etc\icinga2\director-definition.json
      Result: True
     Comment: File C:\ProgramData\icinga2\etc\icinga2\director-definition.json is in the correct state
     Started: 20:59:58.305388
    Duration: 31.061 ms
     Changes:
----------
          ID: Ensure host has been declared in the Icinga2 director
    Function: cmd.script
        Name: salt://infra/icinga2/files/director-windows.ps1
      Result: True
     Comment:
     Started: 20:59:58.352089
    Duration: 1703.114 ms
     Changes:
----------
          ID: Ensure negative DNS resolutions are not cached
    Function: cmd.run
        Name: C:\Windows\System32\reg.exe add HKLM\SYSTEM\CurrentControlSet\Services\Dnscache\Parameters /v MaxNegativeCacheTtl /t REG_DWORD /d 0 /f

      Result: True
     Comment: unless condition is true
     Started: 21:00:00.055203
    Duration: 107.085 ms
     Changes:
----------
          ID: Ensure DNS Client is restarted
    Function: module.wait
        Name: service.restart
      Result: True
     Comment: State was not run because none of the onchanges reqs changed
     Started: 21:00:00.162288
    Duration: 0.0 ms
     Changes:
----------
          ID: Ensure Salt Managed Config Exists
    Function: file.managed
        Name: C:\Program Files\Consul\Config\salt.json
      Result: True
     Comment: File C:\Program Files\Consul\Config\salt.json is in the correct state
     Started: 21:00:00.162288
    Duration: 46.259 ms
     Changes:
----------
          ID: Ensure Consul is installed
    Function: pkg.installed
        Name: consul
      Result: True
     Comment: All specified packages are already installed and are at the desired version
     Started: 21:00:00.208547
    Duration: 245.748 ms
     Changes:
----------
          ID: Ensure Telegraf is extracted
    Function: archive.extracted
        Name: C:\Program Files\Telegraf\1.15.2
      Result: True
     Comment: Path C:\Program Files\Telegraf\1.15.2\telegraf-1.15.2 exists
     Started: 21:00:00.454295
    Duration: 0.0 ms
     Changes:
----------
          ID: Ensure current Telegraf version exists
    Function: file.symlink
        Name: C:\Program Files\Telegraf\current
      Result: True
     Comment: Set ownership of symlink C:\Program Files\Telegraf\current to mlopez
     Started: 21:00:00.454295
    Duration: 15.619 ms
     Changes:
              ----------
              ownership:
                  mlopez
----------
          ID: Telegraf config
    Function: file.managed
        Name: C:\Program Files\Telegraf\telegraf.conf
      Result: True
     Comment: File C:\Program Files\Telegraf\telegraf.conf is in the correct state
     Started: 21:00:00.469914
    Duration: 282.547 ms
     Changes:
----------
          ID: Telegraf servcies config
    Function: file.directory
        Name: C:\Program Files\Telegraf\telegraf.d
      Result: True
     Comment: The directory C:\Program Files\Telegraf\telegraf.d is in the correct state
     Started: 21:00:00.752461
    Duration: 15.624 ms
     Changes:
----------
          ID: Telegraf Windows config
    Function: file.managed
        Name: C:\Program Files\Telegraf\telegraf.d\inputs.conf
      Result: True
     Comment: File C:\Program Files\Telegraf\telegraf.d\inputs.conf is in the correct state
     Started: 21:00:00.768085
    Duration: 246.454 ms
     Changes:
----------
          ID: Ensure incorrectly configured telegraf services are removed
    Function: cmd.run
        Name: & "C:\Program Files\Telegraf\current\telegraf.exe" -service stop ; & "C:\Program Files\Telegraf\current\telegraf.exe" -service uninstall
      Result: True
     Comment: onlyif condition is true
              unless condition is true
     Started: 21:00:01.019533
    Duration: 1100.691 ms
     Changes:
----------
          ID: Ensure the Telegraf service is installed
    Function: cmd.run
        Name: & "C:\Program Files\Telegraf\current\telegraf.exe" -service install --config "C:\Program Files\Telegraf\telegraf.conf" --config-directory "C:\Program Files\Telegraf\telegraf.d"
      Result: True
     Comment: unless condition is true
     Started: 21:00:02.120224
    Duration: 474.633 ms
     Changes:
----------
          ID: Ensure the Telegraf service is enabled and running
    Function: service.running
        Name: telegraf
      Result: True
     Comment: The service telegraf is already running
     Started: 21:00:02.594857
    Duration: 32.377 ms
     Changes:
----------
          ID: Ensure vcredist-2015 is installed
    Function: pkg.installed
        Name: vcredist-2015
      Result: True
     Comment: All specified packages are already installed
     Started: 21:00:02.638893
    Duration: 93.765 ms
     Changes:
----------
          ID: Ensure PCM is installed
    Function: pkg.installed
        Name: pcm
      Result: True
     Comment: All specified packages are already installed and are at the desired version
     Started: 21:00:02.732658
    Duration: 236.392 ms
     Changes:
----------
          ID: Ensure Telegraf for pcm is configured
    Function: file.managed
        Name: C:\Program Files\Telegraf\telegraf.d\salt-pcm.conf
      Result: True
     Comment: File C:\Program Files\Telegraf\telegraf.d\salt-pcm.conf is in the correct state
     Started: 21:00:02.970052
    Duration: 20.011 ms
     Changes:
----------
          ID: Ensure Telegraf for pcm running
    Function: service.running
        Name: telegraf
      Result: True
     Comment: The service telegraf is already running
     Started: 21:00:02.991114
    Duration: 31.003 ms
     Changes:
----------
          ID: Ensure Telegraf for hyper-v is configured
    Function: file.managed
        Name: C:\Program Files\Telegraf\telegraf.d\salt-hyper-v.conf
      Result: True
     Comment: File C:\Program Files\Telegraf\telegraf.d\salt-hyper-v.conf is in the correct state
     Started: 21:00:03.022117
    Duration: 31.266 ms
     Changes:
----------
          ID: Ensure Telegraf for hyper-v running
    Function: service.running
        Name: telegraf
      Result: True
     Comment: The service telegraf is already running
     Started: 21:00:03.053383
    Duration: 31.245 ms
     Changes:
----------
          ID: Ensure HyperV features are installed
    Function: win_servermanager.installed
      Result: True
     Comment: The following features are already installed:
              - Hyper-V
              - RSAT-Hyper-V-Tools
              - RSAT-AD-PowerShell
     Started: 21:00:03.084628
    Duration: 1146.602 ms
     Changes:
----------
          ID: Ensure migration is enabled
    Function: cmd.run
        Name: Enable-VMMigration
      Result: True
     Comment: unless condition is true
     Started: 21:00:04.231230
    Duration: 1110.606 ms
     Changes:
----------
          ID: Ensure migration auth is kerberos
    Function: cmd.run
        Name: Set-VMHost -VirtualMachineMigrationAuthenticationType Kerberos
      Result: True
     Comment: unless condition is true
     Started: 21:00:05.357467
    Duration: 852.183 ms
     Changes:
----------
          ID: Ensure migration performance is compression
    Function: cmd.run
        Name: Set-VMHost -VirtualMachineMigrationPerformanceOption Compression
      Result: True
     Comment: unless condition is true
     Started: 21:00:06.209650
    Duration: 913.1 ms
     Changes:
----------
          ID: Ensure enchanced session mode is enabled
    Function: cmd.run
        Name: Set-VMHost -EnableEnhancedSessionMode $True
      Result: True
     Comment: unless condition is true
     Started: 21:00:07.122750
    Duration: 997.013 ms
     Changes:
----------
          ID: Ensure schedule type is core
    Function: cmd.run
        Name: bcdedit /set HypervisorSchedulerType core
      Result: True
     Comment: unless condition is true
     Started: 21:00:08.119763
    Duration: 816.798 ms
     Changes:
----------
          ID: Ensure the Telegraf grain is set
    Function: grains.list_present
        Name: telegraf
      Result: True
     Comment: Value hyper-v is already in grain telegraf
     Started: 21:00:08.936561
    Duration: 0.0 ms
     Changes:
----------
          ID: Ensure host allows credential delegration
    Function: cmd.script
        Name: salt://stack/hyper-v/files/ensure-delegration.ps1
      Result: True
     Comment:
     Started: 21:00:08.936561
    Duration: 2799.39 ms
     Changes:
----------
          ID: Ensure machine is domain joined
    Function: system.join_domain
        Name: <blah>
      Result: True
     Comment: Computer already added to '<blah>'
     Started: 21:00:11.735951
    Duration: 110.529 ms
     Changes:

Summary for local
-------------
Succeeded: 69 (changed=1)
Failed:     0
-------------
Total states run:     69
Total run time:   42.277 s

Do you know what your doing when salt minion is making the process that is hanging?

I don't think any processes are hanging, at least there's no impact to any operations. The process details suggest it's a scheduled highstate.

For example are you running some kind of state file around the same time the hanging process spawns?

It appears this occurs just with Highstate, no other execution are normally occurring on these servers.


Aside, I am hoping these servers will be upgraded to Windows Server 2022 soon'ish. There's already a migration to upgrade to Ubuntu 22.04 - while refactoring the existing Salt states (migrating to onedir and removing some legacy workarounds for random Salt problems over the may years).

@cmcmarrow cmcmarrow added the cannot-reproduce cannot be replicated with info/context provided label Apr 26, 2023
@cmcmarrow
Copy link
Contributor

@Silvenga thanks for the ^ info. Can I also see your minion config? Are you sure you did not make any changes to your states? Salt is eating up some much ram because it has a lot of sub processes for some reason. My guess is your cmd states are not exiting right.

@anilsil anilsil removed this from the Sulfur v3006.1 milestone Apr 26, 2023
@anilsil anilsil added this to the Sulfur v3006.2 milestone Apr 26, 2023
@Silvenga
Copy link
Author

Silvenga commented Apr 26, 2023

The combined minion configuration for a Hyper-V Windows node. Nothing super exotic, the disable_modules are from some performance testing done years ago, so they might not be useful anymore. The restart-minion was from the workaround above that's only applied to Windows nodes. I don't remember why grains_cache was enabled.

This is mostly the standard configuration deployed to both Linux and Windows nodes.

master: <blah>
id: <blah>

schedule:
  __mine_interval: {enabled: true, function: mine.update, jid_include: true, maxrunning: 2,
    minutes: 60, return_job: false, run_on_start: true}
  restart-minion:
    args: [salt-minion]
    cron: 28 6 * * *
    enabled: true
    function: service.restart
    jid_include: true
    maxrunning: 1
    name: restart-minion

grains_cache: true

# Not documented.
# https://github.com/saltstack/salt/issues/48773
disable_grains:
  - esxi

disable_modules:
  - vsphere
  # Just annoying:
  - boto3_elasticsearch
  - victorops
  - ifttt
  - pushbullet
  - zfs

My guess is your cmd states are not exiting right.

Oh, that's interesting. I'll see if I can get a process tree before the minion is restarted. I should note that there are no processes hanging around after the service restart (e.g. process count is mostly constant on these nodes as measured by a Telegraf agent). Icinga also monitors the number of processes, and hasn't detected anything unconstrained.

I'm sorry to say that some of these Hyper-V nodes aren't exactly rebooted on a monthly cadence like we should - mostly because we don't have N+1 resources in some of the Hyper-V clusters to facilitate live migrations during rolling reboots. So, what I mean to say, some of these nodes have quite an uptime, suggesting the process count is constrained.

EDIT: 66 days of uptime is the current record across the clusters.

@cmcmarrow
Copy link
Contributor

Thanks for the fast response, I'll keep looking if I can reproduce.

@Silvenga
Copy link
Author

Thanks for the fast response, I'll keep looking if I can reproduce.

TBH, if you can't repo, I think the onus should be on the reporter (me), we should upgrade to 3006 just to confirm this hasn't been accidently fixed or something. From there, I'm hoping that this stuff get's moved over to the new Salt cluster with re-written states, which includes migrating to Windows Server 2022.

@cmcmarrow
Copy link
Contributor

@Silvenga if would be great if you could find that state causing this problem. I would appreciate the help.

@cmcmarrow
Copy link
Contributor

Closing Issues due to inactivity. Feel free to re-open if deemed necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior cannot-reproduce cannot be replicated with info/context provided Performance Regression The issue is a bug that breaks functionality known to work in previous releases. Windows
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants