Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to v3.6-beta2 - migration causes OOM of the database process #13605

Closed
cybarox opened this issue Aug 30, 2023 · 9 comments
Closed

Upgrade to v3.6-beta2 - migration causes OOM of the database process #13605

cybarox opened this issue Aug 30, 2023 · 9 comments
Assignees
Labels
beta Concerns a bug/feature in a beta release severity: medium Results in substantial degraded or broken functionality for specfic workflows status: accepted This issue has been accepted for implementation type: bug A confirmed report of unexpected behavior in the application

Comments

@cybarox
Copy link
Contributor

cybarox commented Aug 30, 2023

NetBox version

v3.5.9

Python version

3.10

Steps to Reproduce

  1. dump production database
  2. start clean vagrant vm (2 Cores, 4GB Ram) ubuntu 22.04
  3. setup NetBox v3.5.9 like install instructions
  4. create database from dump postgesql v14.9
  5. netbox function as normal
  6. change repo to beta2
  7. run sudo /opt/netbox/upgrade.sh
  8. upgrade get stuck on django db dcim migrations and fails because postgresql.service is killed of OOM

We have over 18000 devices in NetBox. Maybe it has something to do with the high number of devices.

Expected Behavior

NetBox upgrades to v3.6-beta2

Observed Behavior

Upgrade fails due to terminated database process:

Operations to perform:
  Apply all migrations: account, admin, auth, circuits, contenttypes, core, dcim, django_rq, extras, ipam, sessions, social_django, taggit, tenancy, users, virtualization, wireless
Running migrations:
  Applying users.0004_netboxgroup_netboxuser... OK
  Applying account.0001_initial... OK
  Applying dcim.0173_remove_napalm_fields... OK
  Applying dcim.0174_device_latitude_device_longitude... OK
  Applying dcim.0174_rack_starting_unit... OK
  Applying dcim.0175_device_oob_ip... OK
  Applying dcim.0176_device_component_counters...Traceback (most recent call last):
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
  File "/opt/netbox/venv/lib/python3.10/site-packages/psycopg/cursor.py", line 737, in execute
    raise ex.with_traceback(None)
psycopg.OperationalError: consuming input failed: EOF detected

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/migrations/executor.py", line 252, in apply_migration
    state = migration.apply(state, schema_editor)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/migrations/migration.py", line 132, in apply
    operation.database_forwards(
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/migrations/operations/special.py", line 193, in database_forwards
    self.code(from_state.apps, schema_editor)
  File "/opt/netbox/netbox/dcim/migrations/0176_device_component_counters.py", line 34, in recalculate_device_counts
    Device.objects.bulk_update(devices, [
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/models/manager.py", line 87, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/models/query.py", line 892, in bulk_update
    rows_updated += queryset.filter(pk__in=pks).update(**update_kwargs)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/models/query.py", line 1206, in update
    rows = query.get_compiler(self.db).execute_sql(CURSOR)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/models/sql/compiler.py", line 1984, in execute_sql
    cursor = super().execute_sql(result_type)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/models/sql/compiler.py", line 1562, in execute_sql
    cursor.execute(sql, params)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/backends/utils.py", line 67, in execute
    return self._execute_with_wrappers(
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/backends/utils.py", line 80, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/backends/utils.py", line 84, in _execute
    with self.db.wrap_database_errors:
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/utils.py", line 91, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
  File "/opt/netbox/venv/lib/python3.10/site-packages/psycopg/cursor.py", line 737, in execute
    raise ex.with_traceback(None)
django.db.utils.OperationalError: consuming input failed: EOF detected

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/backends/base/base.py", line 289, in ensure_connection
    self.connect()
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/backends/base/base.py", line 270, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/backends/postgresql/base.py", line 275, in get_new_connection
    connection = self.Database.connect(**conn_params)
  File "/opt/netbox/venv/lib/python3.10/site-packages/psycopg/connection.py", line 729, in connect
    raise ex.with_traceback(None)
psycopg.OperationalError: connection failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/netbox/netbox/manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
    utility.execute()
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/core/management/__init__.py", line 436, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/core/management/base.py", line 412, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/core/management/base.py", line 458, in execute
    output = self.handle(*args, **options)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/core/management/base.py", line 106, in wrapper
    res = handle_func(*args, **kwargs)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/core/management/commands/migrate.py", line 356, in handle
    post_migrate_state = executor.migrate(
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/migrations/executor.py", line 135, in migrate
    state = self._migrate_all_forwards(
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/migrations/executor.py", line 167, in _migrate_all_forwards
    state = self.apply_migration(
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/migrations/executor.py", line 249, in apply_migration
    with self.connection.schema_editor(
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/backends/base/schema.py", line 168, in __exit__
    self.atomic.__exit__(exc_type, exc_value, traceback)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/transaction.py", line 307, in __exit__
    connection.set_autocommit(True)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/backends/base/base.py", line 483, in set_autocommit
    self.ensure_connection()
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/backends/base/base.py", line 288, in ensure_connection
    with self.wrap_database_errors:
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/utils.py", line 91, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/backends/base/base.py", line 289, in ensure_connection
    self.connect()
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/backends/base/base.py", line 270, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
  File "/opt/netbox/venv/lib/python3.10/site-packages/django/db/backends/postgresql/base.py", line 275, in get_new_connection
    connection = self.Database.connect(**conn_params)
  File "/opt/netbox/venv/lib/python3.10/site-packages/psycopg/connection.py", line 729, in connect
    raise ex.with_traceback(None)
django.db.utils.OperationalError: connection failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?
@cybarox cybarox added the type: bug A confirmed report of unexpected behavior in the application label Aug 30, 2023
@abhi1693
Copy link
Member

Thank you for opening a bug report. I was unable to reproduce the reported behavior on NetBox v3.6-beta2. Please re-confirm the reported behavior on the current stable release and adjust your post above as necessary. Remember to provide detailed steps that someone else can follow using a clean installation of NetBox to reproduce the issue. Remember to include the steps taken to create any initial objects or other data.

@abhi1693 abhi1693 added the status: revisions needed This issue requires additional information to be actionable label Aug 30, 2023
@cybarox
Copy link
Contributor Author

cybarox commented Aug 30, 2023

I can understand that the steps to reproduce the error are not really meaningful. With the demo data, the migration also works without problems. We have in our database over 18k devices, 90k interfaces, 19k front ports, 16k rear ports and 3k console ports. The DB dump is around 450mb in size. I have no idea how to reproduce this.

@abhi1693
Copy link
Member

It's most likely a configuration issue with your PostgreSQL server. It's possible that it's killing the session when it goes beyond a certain time maybe 5 minutes.

@cybarox
Copy link
Contributor Author

cybarox commented Aug 30, 2023

It is not an PostgreSQL issue, the process is killed by oom-killer

[ 2529.555076] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=system-postgresql.slice,mems_allowed=0,global_oom,task_memcg=/system.slice/system-postgresql.slice/[email protected],task=postgres,pid=5310,uid=114
[ 2529.555086] Out of memory: Killed process 5310 (postgres) total-vm:5199092kB, anon-rss:3583912kB, file-rss:2580kB, shmem-rss:5500kB, UID:114 pgtables:9416kB oom_score_adj:0

See also:
https://netdev-community.slack.com/files/U01Q2UCRPRP/F05Q42F3L4V/image.png

@abhi1693
Copy link
Member

I believe it's still a configuration issue if your VM doesn't have resources to process the data. However, I'll leave this for another maintainer to take a look if they find any optimisation for the counter migration.

@abhi1693 abhi1693 added status: under review Further discussion is needed to determine this issue's scope and/or implementation beta Concerns a bug/feature in a beta release and removed status: revisions needed This issue requires additional information to be actionable labels Aug 30, 2023
@jeremystretch
Copy link
Member

This can probably be addressed by defining a batch size for the bulk update operation in migration 0176_device_component_counters.

@jeremystretch jeremystretch self-assigned this Aug 30, 2023
@jeremystretch jeremystretch added status: accepted This issue has been accepted for implementation severity: medium Results in substantial degraded or broken functionality for specfic workflows and removed status: under review Further discussion is needed to determine this issue's scope and/or implementation labels Aug 30, 2023
@jeremystretch
Copy link
Member

@cybarox are you able to test the migration using the 13605-optimize-migration branch I just created? (git checkout 13605-optimize-migration and run upgrade.sh again)

@cybarox
Copy link
Contributor Author

cybarox commented Aug 30, 2023

The batch_size value solved the problem. All migrations were applied successfully. Thank you @jeremystretch !

@jeremystretch
Copy link
Member

Excellent! Thanks for the quick confirmation @cybarox.

jeremystretch added a commit that referenced this issue Aug 30, 2023
* Specify batch size for cached counter migrations

* Remove list() casting of querysets
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
beta Concerns a bug/feature in a beta release severity: medium Results in substantial degraded or broken functionality for specfic workflows status: accepted This issue has been accepted for implementation type: bug A confirmed report of unexpected behavior in the application
Projects
None yet
Development

No branches or pull requests

3 participants