Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Search for Devicegroup with many members crashes #2345

Closed
ingeborgoh opened this issue Feb 10, 2022 · 3 comments · Fixed by #2434
Closed

[BUG] Search for Devicegroup with many members crashes #2345

ingeborgoh opened this issue Feb 10, 2022 · 3 comments · Fixed by #2434
Assignees
Labels

Comments

@ingeborgoh
Copy link
Contributor

ingeborgoh commented Feb 10, 2022

NAV 5.2.1

Devicegroup StorGruppe have 2500 members. If I try to search for the group the page crashes due to no response from the Graphite-server.

Any workaround?

Internal Server Error: /search/devicegroup/StorGruppe

GraphiteUnreachableError at /search/devicegroup/StorGruppe
http://192.168.10.143:8000/ is unreachable (HTTP Error 400: Bad Request)

Request Method: GET
Request URL: https://nav.uit.no/search/devicegroup/StorGruppe
Django Version: 2.2.17
Python Executable: 
Python Version: 3.7.12
Python Path: ['/usr/local/lib/python37.zip', '/usr/local/lib/python3.7', '/usr/local/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/site-packages', '/usr/local/lib/python3.7/site-packages', '/etc/nav/python']
Server time: Thu, 10 Feb 2022 10:50:45 +0100
Installed Applications:
('nav.models',
 'nav.web',
 'nav.django',
 'django.contrib.staticfiles',
 'django.contrib.sessions',
 'django.contrib.humanize',
 'django_filters',
 'crispy_forms',
 'crispy_forms_foundation',
 'rest_framework',
 'nav.auditlog',
 'nav.web.macwatch',
 'nav.web.geomap',
 'nav.portadmin.napalm',
 'nav.web.portadmin',
 'django.contrib.postgres')
Installed Middleware:
('django.middleware.common.CommonMiddleware',
 'django.contrib.sessions.middleware.SessionMiddleware',
 'nav.web.auth.AuthenticationMiddleware',
 'nav.web.auth.AuthorizationMiddleware',
 'nav.django.legacy.LegacyCleanupMiddleware',
 'django.contrib.messages.middleware.MessageMiddleware')


Traceback:

File "/usr/local/lib/python3.7/site-packages/nav/metrics/data.py" in get_metric_data
  123.         response = urlopen(req)

File "/usr/local/lib/python3.7/urllib/request.py" in urlopen
  222.     return opener.open(url, data, timeout)

File "/usr/local/lib/python3.7/urllib/request.py" in open
  531.             response = meth(req, response)

File "/usr/local/lib/python3.7/urllib/request.py" in http_response
  641.                 'http', request, response, code, msg, hdrs)

File "/usr/local/lib/python3.7/urllib/request.py" in error
  569.             return self._call_chain(*args)

File "/usr/local/lib/python3.7/urllib/request.py" in _call_chain
  503.             result = func(*args)

File "/usr/local/lib/python3.7/urllib/request.py" in http_error_default
  649.         raise HTTPError(req.full_url, code, msg, hdrs, fp)

During handling of the above exception (HTTP Error 400: Bad Request), another exception occurred:

File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py" in inner
  34.             response = get_response(request)

File "/usr/local/lib/python3.7/site-packages/django/core/handlers/base.py" in _get_response
  115.                 response = self.process_exception_by_middleware(e, request)

File "/usr/local/lib/python3.7/site-packages/django/core/handlers/base.py" in _get_response
  113.                 response = wrapped_callback(request, *callback_args, **callback_kwargs)

File "/usr/local/lib/python3.7/site-packages/nav/web/info/netboxgroup/views.py" in group_detail
  99.         netboxes, data_sources=['availability'], time_frames=['week', 'month']

File "/usr/local/lib/python3.7/site-packages/nav/metrics/data.py" in get_netboxes_availability
  192.         populate_for_time_frame(result, targets, netboxes, time_frames)

File "/usr/local/lib/python3.7/site-packages/nav/metrics/data.py" in populate_for_time_frame
  219.         avg = get_metric_average(targets, start="-1%s" % time_frame)

File "/usr/local/lib/python3.7/site-packages/nav/metrics/data.py" in get_metric_average
  49.     data = get_metric_data(target, start, end)

File "/usr/local/lib/python3.7/site-packages/nav/metrics/data.py" in get_metric_data
  134.         raise errors.GraphiteUnreachableError("{0} is unreachable".format(base), err)

Exception Type: GraphiteUnreachableError at /search/devicegroup/StorGruppe
Exception Value: http://192.168.10.143:8000/ is unreachable (HTTP Error 400: Bad Request)
Request information:
USER: [unable to retrieve the current user]

GET: No GET data

POST: No POST data
@ingeborgoh
Copy link
Contributor Author

Possible workaraound: If #members in DeviceGroup is larger than N, do not populate Availability columns.

@johannaengland johannaengland self-assigned this Jun 10, 2022
@lunkwill42
Copy link
Member

Possible workaraound: If #members in DeviceGroup is larger than N, do not populate Availability columns.

I guess there are multiple possible workarounds. Some considerations:

Front-end fetch vs. back-end fetch

The front-end could populate the availability column dynamically through calls to a back-end API. NAV already does this in some parts of the front-end where we expect data fetches to be a lot slower than the actual page render. I would assume the main point of the device group page is to just list the group members. We don't need the availability data immediately, so it would be ok that this column takes a bit more time to populate.

Batched fetches

The Graphite server responds with 400 Bad Request, so it's not simply a timeout issue. It might be that the produced request just contains too many metric paths to be properly parsed by the Graphite web server (assuming the request is sent as an HTTP GET request, there is often a character limit to the URL, at which point the web server will chop off the remainder).

As is done elsewhere in parts of the code, requests like these are batched. Instead of sending one mega-request, send batches of requests that fetch maybe 100 metrics each, then combine the results in the end.

Proper error handling

If the Graphite server does error out, that really shouldn't lead the whole view function to crash. Rather, as is done in other parts of NAV, the availability data should be filled with an error message about Graphite not being reachable or erroring out.

One path that should be taken is: Verify what happens in the devicegroup view page if the Graphite server is unavailable. Is that error handled at all? If so, other types of errors should likely be handled in the same part of the code.

@lunkwill42
Copy link
Member

@johannaengland You might also be able to reproduce the original issue by fuzzing some data into the database. Just having a group of 2500 faked netboxes may still produce the same error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants