Skip to content

Commit

Permalink
Merge pull request #1153 from facebookresearch/data-porter-feature
Browse files Browse the repository at this point in the history
Add Data Porter feature
  • Loading branch information
meta-paul authored May 6, 2024
2 parents 5dbe02d + 09bcff1 commit f58ca7c
Show file tree
Hide file tree
Showing 72 changed files with 7,306 additions and 940 deletions.
117 changes: 117 additions & 0 deletions docs/web/docs/guides/how_to_contribute/db_migrations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
---

# Copyright (c) Meta Platforms and its affiliates.
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

sidebar_position: 4
---

# Database migrations

## Overview

Currently we are not using any special framework for updating Mephisto database or provider-specific datastores.
This is how it's done:

1. Each database should have table `migrations` where we store all applied or failed migrations
2. Every run of any Mephisto command will automatically attempt to apply unapplied migrations
3. Each migration is a Python module that contains one constant (a raw SQL query string)
4. After adding a migration, its constant must be imported and added to the migrations dict
under a readable name (dict key) that will be used in `migrations` table
5. Any database implementation, must call function `apply_migrations` in method `init_tables` (after creating all tables).
NOTE: Migrations must be applied before creating DB indices, as migrations may erase them without restoring.
6. When migrations fail, you will see a console log message in console.
The error will also be written to `migrations` table under `error_message` column with status `"errored"`

## Details

Let's see how exactly DB migrations should be created.

We'll use Mephisto DB as example; the same set of steps is used for provider-specific databases
.

### Add migration package

To add a new migration package, follow these steps:

1. Create Python-package `migrations` next to `mephisto/abstractions/databases/local_database.py`.
2. Create migration module in that package, e.g. `_001_20240101_add__column_name__in__table_name.py`.
Note leading underscore - Python does not allow importing modeuls that start with a number.
3. Populate module with a SQL query constant:
```python
# <copyright notice>

"""
This migration introduces the following changes:
- ...
"""

MY_SQL_MIGRATION_QUERY_NAME = """
<SQL query>
"""
```
4. Include this SQL query constant in `__init__.py` module (located next to the migration module):
```python
# <copyright notice>
from ._001_20240101_add__column_name__in__table_name import *


migrations = {
"20240101_add__column_name__in__table_name": MY_SQL_MIGRATION_QUERY_NAME,
}
```

5. Note that for now we support only forward migrations.
If you do need a backward migration, simply add it as a forward migration that would undo the undesired changes.


### Call `apply_migrations` function

1. Import migrations in `mephisto/abstractions/databases/local_database.py`:
```python
...
from .migrations import migrations
...
```
2. Apply migrations in `LocalMephistoDB`:
```python
class LocalMephistoDB(MephistoDB):
...
def init_tables(self) -> None:
with self.table_access_condition:
conn = self.get_connection()
conn.execute("PRAGMA foreign_keys = on;")

with conn:
c = conn.cursor()
c.execute(tables.CREATE_IF_NOT_EXISTS_PROJECTS_TABLE)
...

apply_migrations(self, migrations)
...

with conn:
c.executescript(tables.CREATE_IF_NOT_EXISTS_CORE_INDICES)
...
```

## Maintenance of related code

Making changes in databases must be carefully thought through and tested.

This is a list of places that will most likely need to be synced with your DB change:

1. All queries (involving tables that you have updated) in database class, e.g. `LocalMephistoDB`
2. Module with common database queries `mephisto/utils/db.py`
3. Queries in __Review App__ (`mephisto/review_app/server`) - it has its own set of specific queries
4. Names/relationships for tables and columns in __DBDataPorter__ (they're hardcoded in many places there),
within Mephisto DB and provider-specific databases. For instance:
- `mephisto/tools/db_data_porter/constants.py`
- `mephisto/tools/db_data_porter/import_dump.py`
- ...
5. Data processing within Mephisto itself (obviously)

While we did our best to abstract away particular tables and fields structure,
they still have to be spelled out in some places.
Please run tests and check manually all Mephisto applications after performing database changes.
2 changes: 1 addition & 1 deletion docs/web/docs/guides/how_to_contribute/documentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

sidebar_position: 4
sidebar_position: 5
---

# Updating documentation
Expand Down
7 changes: 7 additions & 0 deletions docs/web/docs/guides/how_to_use/data_porter/_category_.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Copyright (c) Meta Platforms and its affiliates.
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

label: "Move data around"
collapsed: false
position: 9
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---

# Copyright (c) Meta Platforms and its affiliates.
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

sidebar_position: 3
---

# Custom conflict resolver

When importing dump data into local DB, some rows may refer to the same object
(e.g. two Task rows with hte same value of "name" column). This class contains default logic
to resolve such merging conflicts (implemented for all currently present DBs).

To change this default behavior, you can write your own coflict resolver class:
1. Add a new Python module next to this module (e.g. `my_conflict_resolver`)

2. This module must contain a class (e.g. `MyMergeConflictResolver`)
that inherits from either `BaseMergeConflictResolver`
or default resolver `DefaultMergeConflictResolver` (also in this directory)
```python
from .base_merge_conflict_resolver import BaseMergeConflictResolver

class CustomMergeConflictResolver(BaseMergeConflictResolver):
default_strategy_name = "..."
strategies_config = {...}
```

3. To use this newly created class, specify its name in import command:
`mephisto db import ... --conflict-resolver MyMergeConflictResolver`

The easiest place to start customization is to modify `strategies_config` property,
and perhaps `default_strategy_name` value (see `DefaultMergeConflictResolver` as an example).

NOTE: All available providers must be present in `strategies_config`.
Table names (under each provider key) are optional, and if missing, `default_strategy_name`
will be used for all conflicts related to this table.

4. There is an example of a working custom conflict resolver in module `mephisto/tools/db_data_porter/conflict_resolvers/example_merge_conflict_resolver.py`. You can launch it like this:

`mephisto db import ... --conflict-resolver ExampleMergeConflictResolver`
133 changes: 133 additions & 0 deletions docs/web/docs/guides/how_to_use/data_porter/reference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
---

# Copyright (c) Meta Platforms and its affiliates.
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

sidebar_position: 2
---

# Reference

This is a reference describing set of commands under the `mephisto db` command group.

## Export

This command exports data from Mephisto DB and provider-specific datastores
as an archived combination of (i) a JSON file, and (ii) a `data` catalog with related files.

If no parameter passed, full data dump (i.e. backup) will be created.

To pass a list of values for one command option, simply repeat that option name before each value.

Examples:
```
mephisto db export
mephisto db export --verbosity
mephisto db export --export-tasks-by-names "My first Task"
mephisto db export --export-tasks-by-ids 1 --export-tasks-by-ids 2
mephisto db export --export-task-runs-by-ids 3 --export-task-runs-by-ids 4
mephisto db export --export-task-runs-since-date 2024-01-01
mephisto db export --export-task-runs-since-date 2023-01-01T00:00:00
mephisto db export --labels first_dump --labels second_dump
mephisto db export --export-tasks-by-ids 1 --delete-exported-data --randomize-legacy-ids --export-indent 4
```

Options (all optional):

- `-tn/--export-tasks-by-names` - names of Tasks that will be exported
- `-ti/--export-tasks-by-ids` - ids of Tasks that will be exported
- `-tri/--export-task-runs-by-ids` - ids of TaskRuns that will be exported
- `-trs/--export-task-runs-since-date` - only objects created after this ISO8601 datetime will be exported
- `-l/--labels` - only data imported under these labels will be exported
- `-del/--delete-exported-data` - after exporting data, delete it from local DB
- `-r/--randomize-legacy-ids` - replace legacy autoincremented ids with
new pseudo-random ids to avoid conflicts during data merging
- `-i/--export-indent` - make dump easy to read via formatting JSON with indentations (Default 2)
- `-v/--verbosity` - write more informative messages about progress (Default 0. Values: 0, 1)

Note that the following options cannot be used together:
`--export-tasks-by-names`, `--export-tasks-by-ids`, `--export-task-runs-by-ids`, `--export-task-runs-since-date`, `--labels`.


## Import

This command imports data from a dump file created by `mephisto db export` command.

Examples:
```
mephisto db import --file <dump_file_name_or_path>
mephisto db import --file 2024_01_01_00_00_01_mephisto_dump.json --verbosity
mephisto db import --file 2024_01_01_00_00_01_mephisto_dump.json --labels my_first_dump
mephisto db import --file 2024_01_01_00_00_01_mephisto_dump.json --conflict-resolver MyCustomMergeConflictResolver
mephisto db import --file 2024_01_01_00_00_01_mephisto_dump.json --keep-import-metadata
```

Options:
- `-f/--file` - location of the `***.zip` dump file (filename if created in
`<MEPHISTO_REPO>/outputs/export` folder, or absolute filepath)
- `-cr/--conflict-resolver` (Optional) - name of Python class to be used for resolving merging conflicts
(when your local DB already has a row with same unique field value as a DB row in the dump data)
- `-l/--labels` - one or more short strings serving as a reference for the ported data (stored in `imported_data` table),
so later you can export the imported data with `--labels` export option
- `-k/--keep-import-metadata` - write data from `imported_data` table of the dump (by default it's not imported)
- `-v/--verbosity` - level of logging (default: 0; values: 0, 1)

Note that before every import we create a full snapshot copy of your local data, by
archiving content of your `data` directory. If any data gets corrupte during the import,
you can always return to the original state by replacing your `data` folder with the snaphot.

## Backup

Creates full backup of all current data (Mephisto DB, provider-specific datastores, and related files) on local machine.

```
mephisto db backup
```


## Restore

Restores all data (Mephisto DB, provider-specific datastores, and related files) from a backup archive.

Note that it will erase all current data, and you may want to run command `mephisto db backup` beforehand.

Examples:
```
mephisto db restore --file <backup_file_name_or_path>
mephisto db restore --file 2024_01_01_00_10_01.zip
```

Options:
- `-f/--file` - location of the `***.zip` backup file (filename if created in
`<MEPHISTO_REPO>/outputs/backup` folder, or absolute filepath)
- `-v/--verbosity` - level of logging (default: 0; values: 0, 1)


## Important notes

### Data dump vs backup

Mephisto stores local data in `outputs` and `data` folders. The safest way to back Mephisto up is to create a copy of the `data` folder - and that's what a Mephisto backup contains.

On the other hand, partial data export is written into a data dump that contains:

- a JSON file representing relevant data entries from DB tables
- a folder with all files related to the exported data entries

With the export command, you **can** create a dump of the entire data as well, and here's when it's useful:
- Use `mephisto db backup` as the safest option, and if you only intend to restore this data instead of previous one
- Use `mephisto db export` to dump complete data from a small Mephisto project, so it can be imported into a larger Mephisto project later.


### Legacy PKs

Prior to release `v1.4` of Mephisto, its DB schemas used auto-incremented integer primary keys. While convenient for debugging, it causes problems during data import/export.

As of `v1.4` we have replaced these "legacy" PKs with quazi-random integers (for backward compatibility their values are designed to be above 1,000,000).

If you do wish to use import/export commands with your "legacy" data, include the `--randomize-legacy-ids` option. It prevents data corruption when merging 2 sets of "legacy" data (because they will contain same integer PKs `1, 2, 3,...` for completely unrelated objects).

This handling of legacy PKs ensures that Data Porter feature is backward compatible, and wll work with your previous existing Mephisto data.
Loading

0 comments on commit f58ca7c

Please sign in to comment.