Clearing out expired ObjectChanges can cause HTTP worker memory consumption to balloon #6056

mpalmer · 2021-03-26T03:07:58Z

NetBox version

v2.10.6

Python version

3.8

Steps to Reproduce

Use Netbox as normal.
Make a significant number of changing requests.
At some random time (with 0.1% probability) this line of code will be executed:
```
ObjectChange.objects.filter(time__lt=cutoff).delete()
```
If there are a lot of objects whose time is before the cutoff, then the memory required to load all those objects before deleting them will cause the HTTP worker process to use a lot of memory, potentially causing all manner of unpleasantness.

Expected Behavior

Preferably, log trimming would be moved out of the HTTP service path altogether, because running a potentially expensive query (only sometimes) when you're trying to respond quickly to HTTP requests is a great way to make your p99 stats look really bad. At the very least, though, the query needs to be a straight-up DELETE FROM ... WHERE ..., rather than a load-then-delete.

Observed Behavior

Much memory. Very OOM.

The text was updated successfully, but these errors were encountered:

davekempe · 2021-03-26T03:21:54Z

Oh Nice work @mpalmer

Ahh I think this explains why we needed this script:

#!/bin/bash
# threshold in seconds
threshold=$1
seconds=$2
re='^[0-9]+([.][0-9]+)?$'
while true; do
        mem=`ps aux | grep gunicorn | grep -v grep | awk '{print $4}' | sort -n |  tail -n1`
        echo -n "`date` - "
        if [[ $mem =~ $re ]] ; then
                echo $mem
                if [ `printf "%.0f" "$mem"` -gt $threshold ]; then
                        logger "Netbox restarted with memory value $mem"
                        service netbox restart
                        echo "Netbox restarted with memory value $mem"
                        sleep 120
                else
                        echo "Netbox is doing fine ($mem)"
                fi
        else
                echo "Number not found ($mem)"
                echo `ps aux | grep gunicorn | grep -v grep`
        fi
        sleep $seconds
done

Hopefully we can get this one sorted out and not need to restart Netbox regularly when it gets stuck. Note that ours is a large netbox instance (120K devices+ with many many interfaces), and it limps along with script keeping it going.

jeremystretch · 2021-03-26T13:30:23Z

AFAICT delete() should just be executing a single DELETE SQL query on the matching objects, however debugging shows that it's loading all matching objects first and then deleting them by unique ID through a series of DELETE queries. Per the Django docs:

Django needs to fetch objects into memory to send signals and handle cascades. However, if there are no cascades and no signals, then Django may take a fast-path and delete objects without fetching into memory. For large deletes this can result in significantly reduced memory usage. The amount of executed queries can be reduced, too.

It's possible to side-step this by fetching only the relevant PKs ourselves (to avoid loading objects into memory) and deleting them directly. However, I'd like to figure out why Django isn't taking the fast-path automatically.

mpalmer added the type: bug A confirmed report of unexpected behavior in the application label Mar 26, 2021

jeremystretch added the status: under review Further discussion is needed to determine this issue's scope and/or implementation label Mar 26, 2021

jeremystretch added status: accepted This issue has been accepted for implementation and removed status: under review Further discussion is needed to determine this issue's scope and/or implementation labels Apr 13, 2021

jeremystretch self-assigned this Apr 13, 2021

jeremystretch closed this as completed in cc43338 Apr 13, 2021

jeremystretch mentioned this issue Apr 15, 2021

Release v2.10.10 #6172

Merged

jeremystretch mentioned this issue Jun 11, 2021

Introduce a nightly maintenance command to handle recurring housekeeping tasks #6590

Closed

github-actions bot locked as resolved and limited conversation to collaborators Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clearing out expired ObjectChanges can cause HTTP worker memory consumption to balloon #6056

Clearing out expired ObjectChanges can cause HTTP worker memory consumption to balloon #6056

mpalmer commented Mar 26, 2021

davekempe commented Mar 26, 2021 •

edited

Loading

jeremystretch commented Mar 26, 2021 •

edited

Loading

Clearing out expired ObjectChanges can cause HTTP worker memory consumption to balloon #6056

Clearing out expired ObjectChanges can cause HTTP worker memory consumption to balloon #6056

Comments

mpalmer commented Mar 26, 2021

NetBox version

Python version

Steps to Reproduce

Expected Behavior

Observed Behavior

davekempe commented Mar 26, 2021 • edited Loading

jeremystretch commented Mar 26, 2021 • edited Loading

davekempe commented Mar 26, 2021 •

edited

Loading

jeremystretch commented Mar 26, 2021 •

edited

Loading