Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clearing out expired ObjectChanges can cause HTTP worker memory consumption to balloon #6056

Closed
mpalmer opened this issue Mar 26, 2021 · 2 comments
Assignees
Labels
status: accepted This issue has been accepted for implementation type: bug A confirmed report of unexpected behavior in the application

Comments

@mpalmer
Copy link
Contributor

mpalmer commented Mar 26, 2021

NetBox version

v2.10.6

Python version

3.8

Steps to Reproduce

  1. Use Netbox as normal.
  2. Make a significant number of changing requests.
  3. At some random time (with 0.1% probability) this line of code will be executed:
    ObjectChange.objects.filter(time__lt=cutoff).delete()
  4. If there are a lot of objects whose time is before the cutoff, then the memory required to load all those objects before deleting them will cause the HTTP worker process to use a lot of memory, potentially causing all manner of unpleasantness.

Expected Behavior

Preferably, log trimming would be moved out of the HTTP service path altogether, because running a potentially expensive query (only sometimes) when you're trying to respond quickly to HTTP requests is a great way to make your p99 stats look really bad. At the very least, though, the query needs to be a straight-up DELETE FROM ... WHERE ..., rather than a load-then-delete.

Observed Behavior

Much memory. Very OOM.

@mpalmer mpalmer added the type: bug A confirmed report of unexpected behavior in the application label Mar 26, 2021
@davekempe
Copy link

davekempe commented Mar 26, 2021

Oh Nice work @mpalmer

Ahh I think this explains why we needed this script:

#!/bin/bash
# threshold in seconds
threshold=$1
seconds=$2
re='^[0-9]+([.][0-9]+)?$'
while true; do
        mem=`ps aux | grep gunicorn | grep -v grep | awk '{print $4}' | sort -n |  tail -n1`
        echo -n "`date` - "
        if [[ $mem =~ $re ]] ; then
                echo $mem
                if [ `printf "%.0f" "$mem"` -gt $threshold ]; then
                        logger "Netbox restarted with memory value $mem"
                        service netbox restart
                        echo "Netbox restarted with memory value $mem"
                        sleep 120
                else
                        echo "Netbox is doing fine ($mem)"
                fi
        else
                echo "Number not found ($mem)"
                echo `ps aux | grep gunicorn | grep -v grep`
        fi
        sleep $seconds
done

Hopefully we can get this one sorted out and not need to restart Netbox regularly when it gets stuck. Note that ours is a large netbox instance (120K devices+ with many many interfaces), and it limps along with script keeping it going.

@jeremystretch
Copy link
Member

jeremystretch commented Mar 26, 2021

AFAICT delete() should just be executing a single DELETE SQL query on the matching objects, however debugging shows that it's loading all matching objects first and then deleting them by unique ID through a series of DELETE queries. Per the Django docs:

Django needs to fetch objects into memory to send signals and handle cascades. However, if there are no cascades and no signals, then Django may take a fast-path and delete objects without fetching into memory. For large deletes this can result in significantly reduced memory usage. The amount of executed queries can be reduced, too.

It's possible to side-step this by fetching only the relevant PKs ourselves (to avoid loading objects into memory) and deleting them directly. However, I'd like to figure out why Django isn't taking the fast-path automatically.

@jeremystretch jeremystretch added the status: under review Further discussion is needed to determine this issue's scope and/or implementation label Mar 26, 2021
@jeremystretch jeremystretch added status: accepted This issue has been accepted for implementation and removed status: under review Further discussion is needed to determine this issue's scope and/or implementation labels Apr 13, 2021
@jeremystretch jeremystretch self-assigned this Apr 13, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 13, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
status: accepted This issue has been accepted for implementation type: bug A confirmed report of unexpected behavior in the application
Projects
None yet
Development

No branches or pull requests

3 participants