Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snimpy Memory Leak #33

Open
hanynowsky opened this issue Jun 12, 2015 · 12 comments
Open

Snimpy Memory Leak #33

hanynowsky opened this issue Jun 12, 2015 · 12 comments

Comments

@hanynowsky
Copy link

Using Snimpy in some Celery Tasks, the server RAM consumption grows non-stop to the point kernel starts killing Celery workers process (oom).
After running Valgrind against a simple Snimpy Script, Memory leak is detected:

valgrind --leak-check=yes --track-origins=yes python sdsl.py

==10762== LEAK SUMMARY:
==10762==    definitely lost: 609 bytes in 45 blocks
==10762==    indirectly lost: 0 bytes in 0 blocks
==10762==      possibly lost: 3,456 bytes in 6 blocks
==10762==    still reachable: 1,192,945 bytes in 5,951 blocks
==10762==         suppressed: 0 bytes in 0 blocks
==10762== Reachable blocks (those to which a pointer was found) are not shown.
==10762== To see them, rerun with: --leak-check=full --show-reachable=yes
==10762== 
==10762== For counts of detected and suppressed errors, rerun with: -v
==10762== ERROR SUMMARY: 4499 errors from 134 contexts (suppressed: 103 from 9)

#!/usr/bin/python
#import traceback
import os
import sys
from snimpy.manager import Manager as M
from snimpy.manager import load

def sdsl_test(hostname, community):
        mib_path = os.path.join('/home/masac/masac', 'scripts', 'mib')
        load(os.path.join(mib_path, 'SNMPv2-SMI.txt'))
        load(os.path.join(mib_path, 'SNMPv2-TC.txt'))
        load(os.path.join(mib_path, 'SNMPv2-CONF.txt'))
        load(os.path.join(mib_path, 'SNMPv2-MIB.txt'))
        load(os.path.join(mib_path, 'IANAifType-MIB.txt'))
        load(os.path.join(mib_path, 'IF-MIB.txt'))
        """ If you put cache to a value other than 0, you get RuntimeError relative to Dictionary size change"""
        m = M(host=hostname, retries=0, community=community, version=2, timeout=1, cache=0)
        uptime = 0
        try:
                with m:
                        uptime = m.sysUpTime
                        try:
                                for ix in m.ifDescr:
                                        desc = m.ifDescr[ix]
                                        metric = m.ifHCInOctets[ix]
                                        print(desc + " " + str(metric))
                        except Exception as e:
                                print(e)
        except Exception as e:
                print(e)
        finally:
                if uptime:
                        print(uptime)
sdsl_test('Router.IP','my-community')
sys.exit()

If needed, I can provide code of the Celery Task using Snimpy. basically the task loop over a list of network routers, fetches metrics and send them to a server.

@vincentbernat
Copy link
Owner

Memory leaks are expected as I did use Snimpy only in one-time scripts. I will look at that and fix them. Thanks for reporting!

I see that you noticed another bug, don't hesitate to open an issue for it.

vincentbernat added a commit that referenced this issue Jun 13, 2015
@vincentbernat
Copy link
Owner

Just before sys.exit(), could you add:

import snimpy.mib
snimpy.mib._smi.smiExit()

This should remove any potential leaks from libsmi (but won't solve your problem).

What are your Celery tasks lifetime? Is it like enclosing your code in a for loop? Or is it doing something like reloading modules?

@hanynowsky
Copy link
Author

Thank you fore your reactiveness.
Well, as you said, the _smi.smiExit() does not help. The nature of Python garbage collection makes this leak happen I guess And celery being an asynchronous task processor, does not help neither :)
The celery Task is crontabbed each minute or each five minutes. Here is the source:

from __future__ import absolute_import
from celery import shared_task
from celery.task import task
from django.core.management.base import BaseCommand, CommandError
from celery.decorators import periodic_task
from celery.schedules import crontab
import platform
import socket
import time
import sys 
import os
import requests
import json
from django.conf import settings
from snimpy.manager import Manager as M
from snimpy.manager import load
from snimpy.snmp import SNMPNoSuchInstance
from snimpy.snmp import SNMPException
from snimpy.mib import SMIException
from socket import gaierror
from pyasn1.type.error import ValueConstraintError
import os
import traceback
from masac.feed.models import CiscoEquipment
from masac.feed.models import Routers
from masac.feed.models import Switches
import pickle
from copy import deepcopy
from copy import copy
import sqlite3 as lite
import inspect
from os.path import expanduser

LOG_DEBUG = settings.LOG_DEBUG
CARBON_SERVER = settings.CARBON_SERVER # Server where Cyanite is running
CARBON_PORT = settings.CARBON_PORT
SDSL_TIMEOUT = settings.SDSL_TIMEOUT
SWITCH_TIMEOUT = settings.SWITCH_TIMEOUT
ROUTER_TIMEOUT = settings.ROUTER_TIMEOUT
@periodic_task(run_every=(crontab(hour="*", minute="*/3", day_of_week="*")), queue='masac.switch', options={'queue': 'masac.switch', 'routing_key' : 'masac.switch'})
def send_switch_metrics():
    try:
        node = platform.node().replace('.', '-')
        timestamp = int(time.time())
        switch_list =  Switches.objects.filter(marque='switch', banned=0)
        DELAY = 1
        #sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock = socket.socket()
        sock.connect((CARBON_SERVER, CARBON_PORT))
        message = ''
        mib_path = os.path.join(settings.BASE_DIR, 'scripts', 'mib')
        load(os.path.join(mib_path, 'SNMPv2-SMI.txt'))
        load(os.path.join(mib_path, 'SNMPv2-TC.txt'))
        load(os.path.join(mib_path, 'SNMPv2-CONF.txt'))
        load(os.path.join(mib_path, 'SNMPv2-MIB.txt'))
        load(os.path.join(mib_path, 'IANAifType-MIB.txt'))
        load(os.path.join(mib_path, 'IF-MIB.txt'))

        list_count = len(switch_list) 
        if list_count < 1:
            print("Warning: List of switches is empty. Re-feed....")
            # get_switches()
        for device in switch_list:
            hostname = str(device.login)
            ip = str(device.ip)
            comm = str(device.snmp_community)
            if LOG_DEBUG == True:
                print("Testing Switch: " + hostname + " " + ip + " " + comm)
            try:
                m = M(host=hostname, retries=0, community=comm, version=2, timeout=SWITCH_TIMEOUT, cache=False) 
            except gaierror as ge:
                # print(traceback.format_exc())
                if LOG_DEBUG == True:
                    print("Snimpy Manager Exception: gaierror")

            metrics = {}
            try:
                for index in m.ifDescr:  
                    metrics[m.ifDescr[index]] = {}
                    if "thernet" in str(m.ifDescr[index]) and "unrouted" not in str(m.ifDescr[index]) and "Vlan" not in str(m.ifDescr[index]) and "Null" not in str(m.ifDescr[index]):
                        metrics[m.ifDescr[index]]['ifHCOutOctets'] = m.ifHCOutOctets[index]
                        metrics[m.ifDescr[index]]['ifHCInOctets'] = m.ifHCInOctets[index]
                                # print(index, repr(m.ifDescr[index]), repr(m.ifHCOutOctets[index]), repr(m.ifHCInOctets[index]))
            except SNMPNoSuchInstance as nie:
                # print(traceback.format_exc())
                print("SNMPNoSuchInstance for " + hostname + " " + ip + " " + comm)
            except SNMPException as snmpe:
                Switches.objects.filter(login=hostname, marque='switch').update(banned=1)
                print("SNMPException for " + hostname + " " + ip + " " + comm + " | List Count: " + str(list_count))
                # print(traceback.format_exc())
                pass
            except ValueConstraintError as e:
                print("ValueConstraintError for " +  hostname + " " + ip + " " + comm)
                # print(traceback.format_exc())
            lines= []
            for if_descr in list(metrics.keys()):
                for key in list(metrics[if_descr].keys()):
                    line = 'masac.switch.%s.%s.%s %d %d' % (hostname.replace(".","_"), if_descr.replace("/","_").replace(" ","_"), key, metrics[if_descr][key], timestamp)
                    lines.append(line)                  
            message = '\n'.join(lines) + '\n'
            if LOG_DEBUG == True:
                print("MESSAGE for " + hostname + " : " + message)
            sock.sendall(message.encode('UTF-8'))
        sock.close()
        time.sleep(DELAY)
    except Exception as e:
        print(traceback.format_exc())
  • Django DEBUG is set to False

@hanynowsky
Copy link
Author

Hello @vincentbernat Any news about this issue?

@vincentbernat
Copy link
Owner

I think the valgrind output is misleading and the memory leak doesn't happen in the CFFI module. I have fixed a small memory leak in the CFFI module but I think this is not related to what you are saying. I still need to find some time to check the issue. I suppose that the issue could be tested by running the first snippet you provided into a loop and use some tool to find the memory leak.

vincentbernat added a commit that referenced this issue Jun 18, 2015
The initialization of such an object is expensive (so, Snimpy is slow)
and there seems to trigger a memory leak. Just use a singleton instead.

Related to #33.
@vincentbernat
Copy link
Owner

I have pushed a new commit that should fix the problem. For the record, here is the code I have used to "trace" memory leaks.

#!/usr/bin/python3
from snimpy.manager import Manager as M
from snimpy.manager import load
import tracemalloc
import gc

def sdsl_test(hostname, community):
    load('IF-MIB')
    load('SNMPv2-MIB')
    m = M(host=hostname, retries=0, community=community, version=2,
          timeout=1, cache=0)
    uptime = m.sysUpTime
    print(uptime)
    for ix in m.ifDescr:
        desc = m.ifDescr[ix]
        metric = m.ifHCInOctets[ix]
        print(desc + " " + str(metric))


tracemalloc.start()
sdsl_test('127.0.0.1', 'public')
gc.collect()
snapshot1 = tracemalloc.take_snapshot()
for i in range(1000):
    sdsl_test('127.0.0.1', 'public')
gc.collect()
snapshot2 = tracemalloc.take_snapshot()

top_stats = snapshot2.compare_to(snapshot1, 'lineno')
print("[ Top 10 differences ]")
for stat in top_stats[:10]:
    print(stat)

@hanynowsky
Copy link
Author

Ok great. I am going to test again.
I guess it's 0.8.6 now? Did it land under pip ?

@vincentbernat
Copy link
Owner

I didn't do a release, in case there is something else to fix. You can also just apply the patch manually. It is pretty simple: cce0501

@hanynowsky
Copy link
Author

Sounds like the magic patch reduced the memory leak from 100% to 1%. That's great.
Thanks for your responsiveness and great work.
masac

@vincentbernat
Copy link
Owner

Due to #35, I have reverted the fix. I have asked for help in the PySNMP mailing list and hopefully will come with a proper fix. Otherwise, I'll try to hack around.

@vincentbernat vincentbernat reopened this Jun 26, 2015
@vincentbernat
Copy link
Owner

I forgot about this bug. The current situation with 0.8.8 is that only one command generator per thread will be used for SNMPv2. This is not the case with SNMPv3 as the command generator cannot be reused between sessions in this case. So, you shouldn't get a memory leak with 0.8.8 either in your case.

@hanynowsky
Copy link
Author

That is excellent news. I am gonna put that to test in my next iteration. Thanks for the great job!
Snimpy by the way is still in production in our company and it's rock solid!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants