Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Machine Tracker DNS searches eat all available file descriptors and crashes #2669

Closed
lunkwill42 opened this issue Sep 1, 2023 · 4 comments · Fixed by #2917
Closed
Assignees
Labels

Comments

@lunkwill42
Copy link
Member

Describe the bug

When running a Machine Tracker IP search for a subnet, while checking the DNS checkbox, Machine Tracker will perform bunch of DNS lookups in parallel, in order to return the search results as fast as possible.

However, it seems Machine Tracker parallelizes too much: Each resolver consumes a UDP socket, and the number of sockets uses seems to grow at least linearly with the size of the searched subnet.

Once the number of simultaneously open file descriptors reach the limit imposed by the OS (typically 1024), further attempts to open a file descriptor will crash with an unhandled OSError, which in turn results in a series of chained errors, as Django tries to produce a 500 response, but is unable to open the 500.html template because there are no more available file descriptors - and so Django tries to produce a 500 error for that error, and so on.

The end result (at least when running NAV under uwsgi) is that none of the opened file descriptors get closed, and the worker process stays around with too many open file descriptors. The worker process is now unable to respond properly to further requests, and all requests handled by it result in 500 Internal Server errors (until, in the case of our uwsgi setup, when the worker process has been asked to handle a configured number of requests, the uwsgi master process kills it and starts a new worker).

To Reproduce

Steps to reproduce the behavior:

  1. Go to Machine Tracker in the NAV toolbox
  2. Input a value like 10.0.0.0/20 in the IP address search field
  3. Check the Both filter radio button
  4. Check the Dns checkbox in the Columns section.
  5. Click the Search button.
  6. See error.

Expected behavior

Machine Tracker should respond in a timely manner for large subnet searches, just as with small ones, without crashing or leaking a bunch of resources.

Tracebacks

A typical traceback logged by a worker process that crashes may look like this (the exception chain is very long, so this is cut off as soon as it starts repeating itself):

Traceback (most recent call last):
  File "/opt/venvs/nav/lib/python3.7/site-packages/twisted/internet/udp.py", line 195, in _bindSocket
  File "/opt/venvs/nav/lib/python3.7/site-packages/twisted/internet/base.py", line 1199, in createInternetSocket
  File "/usr/lib/python3.7/socket.py", line 151, in __init__
OSError: [Errno 24] Too many open files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venvs/nav/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
  File "/opt/venvs/nav/lib/python3.7/site-packages/django/core/handlers/base.py", line 181, in _get_response
  File "/opt/venvs/nav/lib/python3.7/site-packages/nav/web/machinetracker/views.py", line 244, in mac_search
  File "/opt/venvs/nav/lib/python3.7/site-packages/nav/web/machinetracker/views.py", line 317, in mac_do_search
  File "/opt/venvs/nav/lib/python3.7/site-packages/nav/web/machinetracker/utils.py", line 162, in track_mac
  File "/opt/venvs/nav/lib/python3.7/site-packages/nav/asyncdns.py", line 56, in reverse_lookup
  File "/opt/venvs/nav/lib/python3.7/site-packages/nav/asyncdns.py", line 86, in resolve
  File "/opt/venvs/nav/lib/python3.7/site-packages/nav/asyncdns.py", line 170, in lookup
  File "/opt/venvs/nav/lib/python3.7/site-packages/twisted/names/common.py", line 122, in lookupPointer
  File "/opt/venvs/nav/lib/python3.7/site-packages/twisted/names/client.py", line 416, in _lookup
  File "/opt/venvs/nav/lib/python3.7/site-packages/twisted/names/client.py", line 317, in queryUDP
  File "/opt/venvs/nav/lib/python3.7/site-packages/twisted/names/client.py", line 281, in _query
  File "/opt/venvs/nav/lib/python3.7/site-packages/twisted/names/client.py", line 226, in _connectedProtocol
  File "/opt/venvs/nav/lib/python3.7/site-packages/twisted/internet/posixbase.py", line 369, in listenUDP
  File "/opt/venvs/nav/lib/python3.7/site-packages/twisted/internet/udp.py", line 178, in startListening
  File "/opt/venvs/nav/lib/python3.7/site-packages/twisted/internet/udp.py", line 198, in _bindSocket
twisted.internet.error.CannotListenError: Couldn't listen on any:39735: [Errno 24] Too many open files.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venvs/nav/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
  File "/opt/venvs/nav/lib/python3.7/site-packages/django/utils/deprecation.py", line 117, in __call__
  File "/opt/venvs/nav/lib/python3.7/site-packages/django/core/handlers/exception.py", line 49, in inner
  File "/opt/venvs/nav/lib/python3.7/site-packages/django/core/handlers/exception.py", line 114, in response_for_exception
  File "/opt/venvs/nav/lib/python3.7/site-packages/django/core/handlers/exception.py", line 153, in handle_uncaught_exception
  File "/opt/venvs/nav/lib/python3.7/site-packages/nav/django/views.py", line 28, in custom_500
  File "/opt/venvs/nav/lib/python3.7/site-packages/django/template/loader.py", line 15, in get_template
  File "/opt/venvs/nav/lib/python3.7/site-packages/django/template/backends/django.py", line 34, in get_template
  File "/opt/venvs/nav/lib/python3.7/site-packages/django/template/engine.py", line 143, in get_template
  File "/opt/venvs/nav/lib/python3.7/site-packages/django/template/engine.py", line 125, in find_template
  File "/opt/venvs/nav/lib/python3.7/site-packages/django/template/loaders/cached.py", line 58, in get_template
  File "/opt/venvs/nav/lib/python3.7/site-packages/django/template/loaders/base.py", line 24, in get_template
  File "/opt/venvs/nav/lib/python3.7/site-packages/django/template/loaders/cached.py", line 27, in get_contents
  File "/opt/venvs/nav/lib/python3.7/site-packages/django/template/loaders/filesystem.py", line 23, in get_contents
OSError: [Errno 24] Too many open files: '/etc/nav/templates/500.html'

During handling of the above exception, another exception occurred:

…

Environment (please complete the following information):

  • NAV version installed: 5.6.0

Additional context

The open file descriptors of a uwsgi process can be inspected using the lsof command line tool:

# lsof -c uwsgi
COMMAND PID   USER      FD      TYPE           DEVICE        SIZE/OFF NODE NAME

uwsgi   28655 navcron   25u     IPv4           74193792      0t0      UDP *:38307
uwsgi   28655 navcron   26u     IPv4           74193793      0t0      UDP *:9768
uwsgi   28655 navcron   27u     IPv4           74193794      0t0      UDP *:9912
uwsgi   28655 navcron   28u     IPv4           74193795      0t0      UDP *:61020
uwsgi   28655 navcron   29u     IPv4           74193796      0t0      UDP *:4058
uwsgi   28655 navcron   30u     IPv4           74193797      0t0      UDP *:35534
uwsgi   28655 navcron   31u     IPv4           74193798      0t0      UDP *:47860
uwsgi   28655 navcron   32u     IPv4           74193799      0t0      UDP *:13710
uwsgi   28655 navcron   33u     IPv4           74193800      0t0      UDP *:13162
uwsgi   28655 navcron   34u     IPv4           74193801      0t0      UDP *:18315
uwsgi   28655 navcron   35u     IPv4           74193802      0t0      UDP *:16694
uwsgi   28655 navcron   36u     IPv4           74193803      0t0      UDP *:2774
uwsgi   28655 navcron   37u     IPv4           74193804      0t0      UDP *:3512
uwsgi   28655 navcron   38u     IPv4           74193805      0t0      UDP *:34299
uwsgi   28655 navcron   39u     IPv4           74193806      0t0      UDP *:3645
uwsgi   28655 navcron   40u     IPv4           74193807      0t0      UDP *:19686
uwsgi   28655 navcron   41u     IPv4           74193808      0t0      UDP *:17175
uwsgi   28655 navcron   42u     IPv4           74193809      0t0      UDP *:36872
uwsgi   28655 navcron   43u     IPv4           74193810      0t0      UDP *:62605
uwsgi   28655 navcron   44u     IPv4           74193811      0t0      UDP *:58840
uwsgi   28655 navcron   45u     IPv4           74193812      0t0      UDP *:63603
uwsgi   28655 navcron   46u     IPv4           74193813      0t0      UDP *:35849
uwsgi   28655 navcron   47u     IPv4           74193814      0t0      UDP *:21509
uwsgi   28655 navcron   48u     IPv4           74193815      0t0      UDP *:4923
uwsgi   28655 navcron   49u     IPv4           74193816      0t0      UDP *:30172
uwsgi   28655 navcron   50u     IPv4           74193817      0t0      UDP *:9860
uwsgi   28655 navcron   51u     IPv4           74193818      0t0      UDP *:11707
uwsgi   28655 navcron   52u     IPv4           74193819      0t0      UDP *:60734
uwsgi   28655 navcron   53u     IPv4           74193820      0t0      UDP *:4932
@lunkwill42 lunkwill42 added the bug label Sep 1, 2023
@lunkwill42
Copy link
Member Author

There may be problems reproducing this on systems that allow more than 1024 file descriptors per process. When running this in the docker-compose based developement environment, I'm unable to reproduce it, as the container (at least on my machine) seems to allow for much more than 1024 file descriptors.

@lunkwill42
Copy link
Member Author

Inside the web service container on my setup, the limit for file descriptors is insanely high by default:

# ulimit -n
1048576

This can be forced to more "normal" limits by adding something like the following to docker-compose.override.yml and re-running docker compose up -d:

version: '2'
services:
  web:
    ulimits:
      nofile:
        soft: "1024"
        hard: "1024"

@lunkwill42
Copy link
Member Author

I don't think I put this in writing anywhere, but I guess the most likely course of action to resolve this is to impose some limit to how many simultaneous requests will be in-flight at any give time. E.g. an unlimited amount is untenable. 1024 would also be too high. 100 simultaneous requests might be a good (although arbitrary) number to ensure we don't bump against the usual limits.

@lunkwill42
Copy link
Member Author

Fixed by #2917

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants