-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace SQLAlchemy Migrate with Alembic #13108
Conversation
17622a1
to
180ead0
Compare
f557222
to
c5e699f
Compare
I will test this against test.galaxyproject.org and then merge |
@mvdbeek conflicts resolved. |
In its current form this is also a breaking change for the galaxy role:
obviously this can be adapted, but to merge this into 22.01 I think we'd need some time to prepare and work out the up/downgrade. |
This works nicely in my hand. @ic4f can you please resolve the merge conflicts and then I think we should get this in. |
@bgruening Thank you! I'll rebase today. |
If the database is combined, running `./manage_db.sh upgrade` should upgrade BOTH models: gxy and tsi. However, if tsi uses a separate database, only the galaxy model should be updated. Also, see this for context: galaxyproject#13108 (comment) See inline comments for more details. The new tests are examples of new behavior. All refactoring is directly related to the changes in this commit.
If the database is combined, running `./manage_db.sh upgrade` should upgrade BOTH models: gxy and tsi. However, if tsi uses a separate database, only the galaxy model should be updated. Also, see this for context: galaxyproject#13108 (comment) See inline comments for more details. The new tests are examples of new behavior. All refactoring is directly related to the changes in this commit.
If the database is combined, running `./manage_db.sh upgrade` should upgrade BOTH models: gxy and tsi. However, if tsi uses a separate database, only the galaxy model should be updated. Also, see this for context: galaxyproject#13108 (comment) See inline comments for more details. The new tests are examples of new behavior. All refactoring is directly related to the changes in this commit.
If the database is combined, running `./manage_db.sh upgrade` should upgrade BOTH models: gxy and tsi. However, if tsi uses a separate database, only the galaxy model should be updated. Also, see this for context: galaxyproject#13108 (comment) See inline comments for more details. The new tests are examples of new behavior. All refactoring is directly related to the changes in this commit.
If the database is combined, running `./manage_db.sh upgrade` should upgrade BOTH models: gxy and tsi. However, if tsi uses a separate database, only the galaxy model should be updated. Also, see this for context: galaxyproject#13108 (comment) See inline comments for more details. The new tests are examples of new behavior. All refactoring is directly related to the changes in this commit.
If the database is combined, running `./manage_db.sh upgrade` should upgrade BOTH models: gxy and tsi. However, if tsi uses a separate database, only the galaxy model should be updated. Also, see this for context: galaxyproject#13108 (comment) See inline comments for more details. The new tests are examples of new behavior. All refactoring is directly related to the changes in this commit.
If the database is combined, running `./manage_db.sh upgrade` should upgrade BOTH models: gxy and tsi. However, if tsi uses a separate database, only the galaxy model should be updated. Also, see this for context: galaxyproject#13108 (comment) See inline comments for more details. The new tests are examples of new behavior. All refactoring is directly related to the changes in this commit.
If the database is combined, running `./manage_db.sh upgrade` should upgrade BOTH models: gxy and tsi. However, if tsi uses a separate database, only the galaxy model should be updated. Also, see this for context: galaxyproject#13108 (comment) See inline comments for more details. The new tests are examples of new behavior. All refactoring is directly related to the changes in this commit.
If the database is combined, running `./manage_db.sh upgrade` should upgrade BOTH models: gxy and tsi. However, if tsi uses a separate database, only the galaxy model should be updated. Also, see this for context: galaxyproject#13108 (comment) See inline comments for more details. The new tests are examples of new behavior. All refactoring is directly related to the changes in this commit.
If the database is combined, running `./manage_db.sh upgrade` should upgrade BOTH models: gxy and tsi. However, if tsi uses a separate database, only the galaxy model should be updated. Also, see this for context: galaxyproject#13108 (comment) See inline comments for more details. The new tests are examples of new behavior. All refactoring is directly related to the changes in this commit.
If the database is combined, running `./manage_db.sh upgrade` should upgrade BOTH models: gxy and tsi. However, if tsi uses a separate database, only the galaxy model should be updated. Also, see this for context: galaxyproject#13108 (comment) See inline comments for more details. The new tests are examples of new behavior. All refactoring is directly related to the changes in this commit.
If the database is combined, running `./manage_db.sh upgrade` should upgrade BOTH models: gxy and tsi. However, if tsi uses a separate database, only the galaxy model should be updated. Also, see this for context: galaxyproject#13108 (comment) See inline comments for more details. The new tests are examples of new behavior. All refactoring is directly related to the changes in this commit.
If the database is combined, running `./manage_db.sh upgrade` should upgrade BOTH models: gxy and tsi. However, if tsi uses a separate database, only the galaxy model should be updated. Also, see this for context: galaxyproject#13108 (comment) See inline comments for more details. The new tests are examples of new behavior. All refactoring is directly related to the changes in this commit.
If the database is combined, running `./manage_db.sh upgrade` should upgrade BOTH models: gxy and tsi. However, if tsi uses a separate database, only the galaxy model should be updated. Also, see this for context: galaxyproject#13108 (comment) See inline comments for more details. The new tests are examples of new behavior. All refactoring is directly related to the changes in this commit.
If the database is combined, running `./manage_db.sh upgrade` should upgrade BOTH models: gxy and tsi. However, if tsi uses a separate database, only the galaxy model should be updated. Also, see this for context: galaxyproject#13108 (comment) See inline comments for more details. The new tests are examples of new behavior. All refactoring is directly related to the changes in this commit.
If the database is combined, running `./manage_db.sh upgrade` should upgrade BOTH models: gxy and tsi. However, if tsi uses a separate database, only the galaxy model should be updated. Also, see this for context: galaxyproject#13108 (comment) See inline comments for more details. The new tests are examples of new behavior. All refactoring is directly related to the changes in this commit.
If the database is combined, running `./manage_db.sh upgrade` should upgrade BOTH models: gxy and tsi. However, if tsi uses a separate database, only the galaxy model should be updated. Also, see this for context: galaxyproject#13108 (comment) See inline comments for more details. The new tests are examples of new behavior. All refactoring is directly related to the changes in this commit.
If the database is combined, running `./manage_db.sh upgrade` should upgrade BOTH models: gxy and tsi. However, if tsi uses a separate database, only the galaxy model should be updated. Also, see this for context: galaxyproject#13108 (comment) See inline comments for more details. The new tests are examples of new behavior. All refactoring is directly related to the changes in this commit.
This is superseded by #13513. The new branch has a different sequence of commits: (1) some of the old commits have been rearranged and/or squashed, and (2) new commits have been added after a rebase as a manual merge - both as steps in resolving the conflicts above. This branch preserves the original commit sequence, PR description and code review/discussion. |
NOTE: This branch has been superseded by #13513. However, the description of the migration system provided with this PR is accurate and can be used as implementation reference for #13513.
Ref #10369.
Supersedes #12869.
This is a clean branch with a cleaned-up commit history; some commits are large, but all changes are logically grouped together. (the commit history in #12869 shows most of the previous iterations)
Migrations system overview
Galaxy's data model is split into the galaxy model and the install model. These two models may be persisted in one combined database or two separate databases. To accommodate this setup, as well as a seamless transition from a combined database to separate databases and vice-versa, the proposed new Alembic-based migrations system is built on top of branches.
A branch, in this context, is a versioning lineage (or migration stream) that starts at a common root (base revision) and represents some part of the data model. These branches are permanent, are identified by labels, and may share the same Alembic version table (if they share the same database; otherwise, each database has its own version table).
We use 2 branches:
gxy
(galaxy) andtsi
(tool shed install). These IDs were selected for their brevity and visual consistency: they are referenced throughout the system as version directory names and branch labels (by Alembic), and model identifiers. In addition, the term "install" in "install model" may be ambiguous in some contexts (it's easy to mistake it for a verb, etc.).Each branch has its own version history, represented by revision scripts located in the branch version directory (e.g. revisions for the galaxy model are located in the
gxy
version directory, whereas revisions for the install model are located in thetsi
version directory). The version directories contain labeled base revisions, which were created with the corresponding branch labels. When we create a new revision, we only need to indicate the branch label: the system will know where to place the new revision file.Alembic requires its own configuration file (
alembic.ini
); we use it for required settings only: the location of the alembic script and the version directories, and the default Alembic logging configuration. All other configuration information, including how the two models are persisted, is programmatically set at run time based on Galaxy's own configuration, which remains the one source of truth. There are no changes to how we indicate separate vs. combined database storage: if theinstall_database_connection
option is set to a value other than the value ofdatabase_connection
, two databases will be used instead of a combined database. If the setting is changed for an existing database, our system will know how to handle this (see tests for such scenarios).See relevant Alembic documentation:
https://alembic.sqlalchemy.org/en/latest/branches.html#working-with-branches
and especially here:
https://alembic.sqlalchemy.org/en/latest/branches.html#working-with-explicit-branches
https://alembic.sqlalchemy.org/en/latest/branches.html#working-with-branch-labels
https://alembic.sqlalchemy.org/en/latest/branches.html#working-with-multiple-bases
Directory structure
versions_*
: version directories, containing revision scripts; filenames include a GUID-based revision identifier, assigned by Alembic. Ordering is set via directives inside the revision scripts (down_revision
)env.py
: invoked by Alembic whenever a command is executed. Its primary purpose is to determine the target database (based on the indicated branch) and configure the migration environment.script.py.mako
: the template file used to generate revisions.README.md
: basic user documentation on how to use the new migrations system.__init__.py
: system core (see Main classes and entry points).alembic.ini
: Alembic configuration filescripts.py
: helpers used for invoking Alembic via script (create_db/migrate_db
)Main classes and entry points
verify_databases
,verify_databases_via_script
: entry pointsAlembicManager
: handles Alembic interactions with one database; serves as a facade to a subset of Alembic commands.DatabaseStateCache
: holds a snapshot of database state. This is an abstraction layer that simplifies the code and reduces (1 vs. 4) the number of calls to the database.DatabaseStateVerifier
: executes the core migration algorithm.Migration algorithm
When invoked via
create_db.sh
script or as part of Galaxy's startup process (inconfig/__init__.py
), the system follows the same algorithm. (Unlikecreate_db.sh
, themigrate_db.sh
script is a thin wrapper around Alembic, so it does not verify the database state; it determines the target branch, retrieves relevant configuration values and invokes Alembic.)We start at one of the two entry points; then, subsequently, create and run a
DatabaseStateVerifier
for thegxy
model, then for thetsi
model. The system checks if the database exists and is non-empty, and then determines whether it is up-to-date, and if not, whether it can be upgraded automatically. The algorithm tries to minimize database calls by using a cache (DatabaseStateCache
) and by taking shortcuts when possible (e.g.is_new_database
will be set if while processing the first model a new or an empty database was initialized).The algorithm makes "an educated guess" only in one case. If within an existing, nonempty database that is combined, the
gxy
model is up-to-date, but thetsi
model is not, we cannot determine the state of thetsi
model tables with absolute certainty. Such an edge might happen for several reasons (see inline comments inTestPartiallyUpgradedCombinedDatabase
for a detailed description). We cannot determine the state of thistsi
model because we do not store the SQLAlchemy Migrate version of the install model in a combined database. So, the reasoning goes like this: if there areno
tsi
tables in the database, we simply treat it as a new install; but if there is at least one table, we assume it is the same version asgxy
(ifgxy
was manually upgraded, we assume the same fortsi
). I think, there's nothing else we can reasonably do, other than fail unconditionally, which seems unreasonable in this case.The diagram below describes the algorithm.
(UPDATED on 8/21/22: same logic; better image from GCC 2022 talk)
(UPDATED ON 2/18/22; see changes to the paths "has alembic table? > no" and "is sqlalchemy migrate version latest? > yes")
Previous version:
Misc. implementation notes:
Old migration scripts have been discarded: we either start from an empty database, or from a database that has been upgraded to the last SQLAlchemy Migration (version 180), at which point an upgrade can be ran automatically.
We don't automatically upgrade the database; we do so only if the
database_auto_migrate
option has been set`. Otherwise, the systems raises an error with an informative message.SQLAlchemy Migrate has not been deleted from Galaxy's Python dependencies list because it is still used for Tool Shed. However, SQLAlchemy Migrate is not compatible with SQLAlchemy 2.0, so moving Tool Shed to Alembic as well seems a reasonable step, given Galaxy's upcoming migration to SQLAlchemy 2.0. Meanwhile, as a temporary fix, I have created TS-specific scripts to create and migrate the Tool Shed databse using SQLAlchemy Migrate.
The
GALAXY_ALLOW_FUTURE_DATABASE
setting has been removed. Revision ordering is no longer determined by a sequence of ascending integers, so determining whether a given revision head stored in the database is a "future" version is no longer possible (if it's more than one version removed from the head in the code). Since this was used for development instances only, an alternative solution is to simply setcheck_migrate_databases=False
. Another alternative is to keep the new revision scripts in the versions directories, so that the system can identify the head stored in the database as a valid revision. Although, I think, if one is knowingly running a database whose version is out of sync with the code base, verifying the database state programmatically makes little sense, so just unsetcheck_migrate_databases
.Discarded approaches
I considered multiple alternatives, primarily these:
--name
argument to distiguish between the two(ref: https://alembic.sqlalchemy.org/en/latest/cookbook.html#run-multiple-alembic-environments-from-one-ini-file)
Reasons for discarding (apply to some or all of the discarded options):
env.py
and onemako
template (with 2 alembic directories, we get duplicatealembic.ini
as well)--name
argument, so for a combined database would need a 3 values: galaxy, install, both - which doesn't feel right for multiple reasonsDiscarded migration algorithm
In the current version, the gxy and tsi models are verified in sequence. In a previous version, they were verified at the same time. That saved a few trips to the database, but the logic was much less straightforward, the design was awkward, and adding/removing additional models/databases would have required rewriting the algorithm. It wasn't a good design and the minimal optimization gains, in my opinion, did not justify the lack of clarity and flexibility.
https://github.com/ic4f/galaxy/blob/7a34c3388feaef05b2d0de73844ddd855743cfcc/lib/galaxy/model/migrations/__init__.py
Usage
See README.md
Testing
test/data/model/migrations
.9c96664
Following is a description of the main parts of the testing infrastructure.
conftest.py
Contains the following:
Used to compose representations of database state. Example:
The object above is composed of 5 tables defined in fixture factories.
These MetaData objects are used to load a database with a given state. For example, the following test function would load this metadata into the provided database:
The database states are generated in fixtures (see bottom of
test_migrations.py
+ inline comments for more details).Each state has 3 versions, distinguished by suffix:
_gxy
(galaxy database that holds gxy* and migration version tables)_tsi
(tool_shed_install database that holds tsi* and migration version tables)_combined
(combined database that holds gxy*, tsi*, and migration version tables)The following states are represented:
(Most ancient state)
(Oldest state versioned with SQLAlchemy Migrate)
(Last (most recent) state versioned with SQLAlchemy Migrate)
(Oldest state versioned with Alembic)
(Oldest state versioned with Alembic that does not include the SQLAchemy Migrate version table)
(Most recent state versioned with Alembic. This is the current state)
(State 0 assumes an empty database, so it needs no state fixtures.)
common.py
Contains helpers (context managers and fixtures) for generating database connection URLs, creating and dropping databases, and creating self-disposing engines. Handles both sqlite and postgresql databases. (mysql not supported yet)
versions/
Contains revision scripts used in some of the tests.
test_migrations.py
Contains:
common.py
How to test the changes?
Following is a testing plan I wrote for myself and followed to manually test the system. It doesn't have to be repeated in its entirety.
To create a revision:
./migrate_db.sh revision --head=gxy@head -m "revision message..."
(use "gxy" or "tsi")(for examples, see
test/data/model/migrations/versions/db1
)For upgrades/downgrades: also try: append
--sql
to do a dry runTo upgrade to head:
./migrate_db.sh upgrade gxy@head
To upgrade/downgrade to a specific revision:
./migrate_db.sh upgrade rev#
(for rev# use several starting chars from the revision id)1. Initial state: no database, separate database setup
Prep: delete old databases, delete old revisions, edit galaxy.yml
1.1. gxy, sqlite database
upgrade
gxy@headupgrade
to gxy-rev2 (use revision identifier)upgrade
to gxy-rev3 (use head)downgrade
to gxy-rev1 (use revision identifier)downgrade
to gxy-base (use base)current
: should see nothingupgrade
: use alembic relative upgrade: gxy@+2downgrade
: use alembic relative downgrade: gxy@-21.2. tsi, sqlite database
upgrade heads
upgrade
to tsi-rev2 (use +1)upgrade
to tsi-rev3 (use head)downgrade
tsi@-1downgrade
to tsi@base1.3. repeat steps 1.1, 1.2 with postgresql
2. Initial state: no database, combined databases setup
Prep: delete old databases, delete old revisions, edit galaxy.yml
2.1. gxy, sqlite
2.2. tsi, sqlite
2.3. repeat steps 2.1, 2.2 with postgresql
3. Initial state: old database (create state in another git branch), separate databases setup
3.1. sqlite, run Galaxy:
3.2. postgres, run Galaxy:
4. Initial state: old database (create state in another git branch), separate databases setup, sqlite, gxy is with alembic and current
Run Galaxy:
5. Initial state: old database (create state in another git branch), combined database, sqlite
Run Galaxy:
6. Initial state: old database (create state in another git branch), combined database, postgres
Run Galaxy:
License