Timeout migrations that take too long to run (#11704)

pypi · Jun 28, 2022 · 39daea1 · 39daea1
1 parent 6d39d8d
commit 39daea1
Show file tree

Hide file tree

Showing 3 changed files with 49 additions and 1 deletion.
diff --git a/docs/development/database-migrations.rst b/docs/development/database-migrations.rst
@@ -25,3 +25,33 @@ them in over time (for example, to rename a column you must add the column in
 one migration + start writing to that column/reading from both, then you must
 make a migration that backfills all of the data, then switch the code to stop
 using the old column all together, then finally you can remove the old column).
+
+To help protect against an accidentally long running migration from taking down
+PyPI, by default a migration will timeout if it is waiting more than 4s to
+acquire a lock, or if any individual statement takes more than 5s.
+
+The lock timeout helps to protect against the case where a long running app
+query is blocking the migration, and then the migration itself ends up
+blocking short running app queries that would otherwise have been able to
+run concurrently with the long running app query.
+
+The statement timeout helps to protect against locking the database for an
+extended period of time (often for data migrations).
+
+It is possible to override these values inside of a migration, to do so you can
+add:
+
+.. code-block:: python
+
+    op.execute("SET statement_timeout = 5000")
+    op.execute("SET lock_timeout = 4000")
+
+To your migration.
+
+
+For more information on what kind of operations are safe in a high availability
+environment like PyPI, there is related reading available at:
+
+- `PostgreSQL at Scale: Database Schema Changes Without Downtime <https://medium.com/paypal-tech/postgresql-at-scale-database-schema-changes-without-downtime-20d3749ed680>`_
+- `Move fast and migrate things: how we automated migrations in Postgres <https://benchling.engineering/move-fast-and-migrate-things-how-we-automated-migrations-in-postgres-d60aba0fc3d4>`_
+- `PgHaMigrations <https://github.com/braintree/pg_ha_migrations>`_
diff --git a/warehouse/migrations/env.py b/warehouse/migrations/env.py
@@ -50,6 +50,9 @@ def run_migrations_online():
         connectable = create_engine(url, poolclass=pool.NullPool)
 
     with connectable.connect() as connection:
+        connection.execute("SET statement_timeout = 5000")
+        connection.execute("SET lock_timeout = 4000")
+
         context.configure(
             connection=connection,
             target_metadata=db.metadata,

diff --git a/warehouse/migrations/script.py.mako b/warehouse/migrations/script.py.mako
@@ -17,8 +17,9 @@ Revises: ${down_revision}
 Create Date: ${create_date}
 """
 
-from alembic import op
 import sqlalchemy as sa
+
+from alembic import op
 ${imports if imports else ""}
 
 revision = ${repr(up_revision)}
@@ -32,6 +33,20 @@ down_revision = ${repr(down_revision)}
 #       up and running. Thus backwards incompatible changes must be broken up
 #       over multiple migrations inside of multiple pull requests in order to
 #       phase them in over multiple deploys.
+#
+#       By default, migrations cannot wait more than 4s on acquiring a lock
+#       and each individual statement cannot take more than 5s. This helps
+#       prevent situations where a slow migration takes the entire site down.
+#
+#       If you need to increase this timeout for a migration, you can do so
+#       by adding:
+#
+#           op.execute("SET statement_timeout = 5000")
+#           op.execute("SET lock_timeout = 4000")
+#
+#       To whatever values are reasonable for this migration as part of your
+#       migration.
+
 
 def upgrade():
     ${upgrades if upgrades else "pass"}