Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Maintenance process leaking memory in 3005.1 #63747

Closed
2 of 9 tasks
denschub opened this issue Feb 21, 2023 · 16 comments
Closed
2 of 9 tasks

[BUG] Maintenance process leaking memory in 3005.1 #63747

denschub opened this issue Feb 21, 2023 · 16 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior pending-close Sulfur v3006.0 release code name and version

Comments

@denschub
Copy link

denschub commented Feb 21, 2023

Description
This appears to be a regression in 3005.1, as I didn't see this before. The "maintenance" process grows relatively quickly, until it eventually gets OOM'ed.

Setup
(Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)

Please be as specific as possible and give set-up details.

  • on-prem machine
  • VM (KVM)
  • VM running on a cloud service, please be explicit and add details
  • container (Kubernetes, Docker, containerd, etc. please specify)
  • or a combination, please be explicit
  • jails if it is FreeBSD
  • classic packaging (via Archlinux repo)
  • onedir packaging
  • used bootstrap to install

Steps to Reproduce the behavior
Nothing special. It's just a master running. Nothing of note in the logs either.

Expected behavior
n/a

Screenshots
Screenshot 2023-02-21 at 03 06 21

(green is memory in use, yellow is disk caches, rest is.. rest.)

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
          Salt: 3005.1

Dependency Versions:
          cffi: 1.15.1
      cherrypy: Not Installed
      dateutil: Not Installed
     docker-py: Not Installed
         gitdb: 4.0.10
     gitpython: 3.1.29
        Jinja2: 3.1.2
       libgit2: 1.5.0
      M2Crypto: 0.38.0
          Mako: Not Installed
       msgpack: 1.0.4
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.12.0
        pygit2: 1.11.1
        Python: 3.10.9 (main, Dec 19 2022, 17:35:49) [GCC 12.2.0]
  python-gnupg: Not Installed
        PyYAML: 6.0
         PyZMQ: 24.0.1
         smmap: 5.0.0
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: arch
        locale: utf-8
       machine: x86_64
       release: 6.1.12-1-lts
        system: Linux
       version: Arch Linux

Additional context
This appears to be a regression in 3005.1, but I can't 100% verify this right now :/

@denschub denschub added Bug broken, incorrect, or confusing behavior needs-triage labels Feb 21, 2023
@welcome
Copy link

welcome bot commented Feb 21, 2023

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar.
If you have additional questions, email us at [email protected]. We’re glad you’ve joined our community and look forward to doing awesome things with you!

@denschub
Copy link
Author

Might be related to #58791, but that one is so old, I felt like it's better to file a new one.

@OrangeDog
Copy link
Contributor

More likely related to #62706.

@OrangeDog OrangeDog added Regression The issue is a bug that breaks functionality known to work in previous releases. Phosphorus v3005.0 Release code name and version labels Feb 21, 2023
@garethgreenaway garethgreenaway added the Sulfur v3006.0 release code name and version label Mar 6, 2023
@garethgreenaway garethgreenaway self-assigned this Mar 6, 2023
@Ch3LL Ch3LL added this to the Sulfur v3006.0 milestone Mar 20, 2023
@whytewolf
Copy link
Collaborator

I have not been able to replicate this. What settings are setup in the master?

@whytewolf
Copy link
Collaborator

here is a list of items in the maintenance process. that might help with knowing which settings to share.

  1. aes key rotation. rotates the aes encryption every 24 hours
  2. key_cache evaluation. creates a msgpack file with a list of accepted keys
  3. git_pillar fetch. run a fetch remotes on git_pillar systems
  4. scheduler, runs the master scheduler.
  5. presence, fires presence events. if presence events are setup.

This is done on loop_interval.

@denschub
Copy link
Author

So, there's a couple of things. I'm using ext_pillar, where one is a local git repo, and another one is actually a cmd_json. I also have three gitfs_remotes. It's ahrd to disable any of them because that effectively breaks everything -- but if you want me to disable something to check if the memory issue still happens, I can make that happen.

Full master config for reference
default_include: master.d/*.conf
interface: '::'
ipv6: True
state_output: mixed

file_roots:
  base:
    - /srv/salt

fileserver_backend:
  - git
  - roots

gitfs_pubkey: '/root/.ssh/id_ed25519.pub'
gitfs_privkey: '/root/.ssh/id_ed25519'

gitfs_remotes:
  - file:///var/lib/gitolite/repositories/states.git
  - https://github.com/saltstack-formulas/openssh-formula.git
  - https://github.com/saltstack-formulas/sudoers-formula.git

ext_pillar:
  - git:
    - master file:///var/lib/gitolite/repositories/pillar.git
  - cmd_json: /etc/letsencrypt/tools/salty-letsencrypt-pillar %s

git_pillar_pubkey: '/root/.ssh/id_ed25519.pub'
git_pillar_privkey: '/root/.ssh/id_ed25519'

# /etc/salt/master.d/reactor.conf

reactor:
  - 'salt/fileserver/gitfs/update':
    - /srv/reactor/update_fileserver.sls

@whytewolf
Copy link
Collaborator

Thank you. I"ll see what i can do with these. the only one that should matter with the maintenance processes is the git_pillar. all the others are handled through other processes.

speaking of which. how large is the local repo for your git_pillar?

for clarity, how many minions are connected to the master? in a breakdown of acceptance would be best.

@whytewolf
Copy link
Collaborator

humm. one thing i am noticing. you are not using ssh for any of your gits so you might be able to change library without much issue.

I noticed you setup pubkey and privkey even though your not using ssh with gitfs or git_pillar. you can remove those settings as those are only for ssh based git. and ignored otherwise.

since you are not actually using the more advanced git authentication methods currently. can you try switching the git library you use. you might need to install the other library for this. and set the config to use it. if your using pygit2 switch to GitPython or vise versa.

if the problem remains it most likely would be something in git_pillar. if it doesn't it is the library in use. either way please update us.

@denschub
Copy link
Author

Thanks for your help so far. Much appreciated!

how large is the local repo for your git_pillar?

A clone of the entire repo is ~550KiB. 18 files, 530'ish lines.

how many minions are connected to the master? in a breakdown of acceptance would be best.

18 minions, all accepted. No denied/unaccepted/rejected.

I noticed you setup pubkey and privkey even though your not using ssh with gitfs or git_pillar.

That is true. I used to use ssh for gitfs, but had an issue with that a while ago, and switched to just pointing it to the local directory. However, that issue is no longer valid as far as I know. I have just pointed the gitfs back to ssh://, and will report if that resolves the issue.

If it doesn't, I'll switch from pygit2 to GitPython and report back the results of that!

@whytewolf
Copy link
Collaborator

humm. defiantly shouldn't be using that much memory. and to test i setup a local git server to run a small pillar though and then set the git_pillar update interval to 2 seconds. and i don't seem to be having any kind of increase in mem. :/ that worries me more. as that might mean it is something else.
are you using any jinja in those pillar? doing any file importing?

@denschub
Copy link
Author

No Jinja, no file importing. It's all just pretty boring YAML. :/ Looking at my memory graph just now, it's clear that switching back to ssh didn't work:

Screenshot 2023-03-21 at 23 28 11

I flipped back my git_pillar to file:// and switched to gitpython. Will report back in 12 hours or so!

@denschub
Copy link
Author

Okay, I'm flabbergasted. I switched to gitpython yesterday, and memory usage has been stable. To verify, I switched back to pygit2 at 16:00 UTC, and sure enough, it's immediately back at eating memory:

Screenshot 2023-03-22 at 19 45 44

So the issue is either in libgit2, pygit2, or in the Salt code calling it. :/ There currently is an update to libgit2 1.6.3 in Archlinux' testing, which I'll update to as soon as I can. At this moment, however, libgit2 1.5.1 and pygit2 1.11.1 are the versions this reproduces on, and the newest I have available.

@whytewolf
Copy link
Collaborator

interesting. my own testing system is running salt 3006rc2 with libgit 1.5.0 and pygit2 1.11.1 and I can't replicate it.

looking at the changelog for libgit2 there were a couple of mem leak fixes in 1.6.1.

but that makes me wonder how am i not seeing it. unless the mem leak was introduced in 1.5.1

@denschub
Copy link
Author

My initial bug report was filed with libgit2 1.5.0, so it's not a 1.5.1 regression I'm afraid. Maybe 3006 fixed it by accident? That's unlikely, but heh. Given that I can work around this just fine for myself by switching to gitpython, I'm happy to wait until that's the 1.6.1 libgit2 update is available for me to test. I'd love to be able to provide more useful information, but I'm sadly not enough into Python to know an approach to get useful memory traces...

@whytewolf
Copy link
Collaborator

:/ humm I don't know then. I doubt 3006 changed anything that would fix it. the git_pillar code hasn't changed in almost 2 years and the only change to the utils.gitfs code which git_pillar uses was about version information. everything else is much older than the 2 years. there has to be another variable in play that we are overlooking.

The fact it is happening in the maintenance thread means it is localized to the git fetch.

dwoz added a commit to dwoz/salt that referenced this issue Mar 25, 2023
@dwoz dwoz added pending-close and removed Regression The issue is a bug that breaks functionality known to work in previous releases. labels Mar 25, 2023
dwoz added a commit to dwoz/salt that referenced this issue Mar 29, 2023
@garethgreenaway garethgreenaway removed the Phosphorus v3005.0 Release code name and version label Apr 10, 2023
@cmcmarrow cmcmarrow self-assigned this Apr 10, 2023
@garethgreenaway garethgreenaway linked a pull request Apr 11, 2023 that will close this issue
3 tasks
@cmcmarrow cmcmarrow mentioned this issue Apr 13, 2023
3 tasks
s0undt3ch pushed a commit that referenced this issue Apr 13, 2023
@cmcmarrow
Copy link
Contributor

cmcmarrow commented Apr 13, 2023

@denschub I believe #64072 should fix your leak. If you find that you still see the leak please reopen the ticket. Thank you for your patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior pending-close Sulfur v3006.0 release code name and version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants