Merge pull request #1153 from facebookresearch/data-porter-feature

Add Data Porter feature
facebookresearch · May 6, 2024 · f58ca7c · f58ca7c
2 parents 5dbe02d + 09bcff1
commit f58ca7c
Show file tree

Hide file tree

Showing 72 changed files with 7,306 additions and 940 deletions.
diff --git a/docs/web/docs/guides/how_to_contribute/db_migrations.md b/docs/web/docs/guides/how_to_contribute/db_migrations.md
@@ -0,0 +1,117 @@
+---
+
+# Copyright (c) Meta Platforms and its affiliates.
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+sidebar_position: 4
+---
+
+# Database migrations
+
+## Overview
+
+Currently we are not using any special framework for updating Mephisto database or provider-specific datastores.
+This is how it's done:
+
+1. Each database should have table `migrations` where we store all applied or failed migrations
+2. Every run of any Mephisto command will automatically attempt to apply unapplied migrations
+3. Each migration is a Python module that contains one constant (a raw SQL query string)
+4. After adding a migration, its constant must be imported and added to the migrations dict
+   under a readable name (dict key) that will be used in `migrations` table
+5. Any database implementation, must call function `apply_migrations` in method `init_tables` (after creating all tables).
+   NOTE: Migrations must be applied before creating DB indices, as migrations may erase them without restoring.
+6. When migrations fail, you will see a console log message in console.
+   The error will also be written to `migrations` table under `error_message` column with status `"errored"`
+
+## Details
+
+Let's see how exactly DB migrations should be created.
+
+We'll use Mephisto DB as example; the same set of steps is used for provider-specific databases
+.
+
+### Add migration package
+
+To add a new migration package, follow these steps:
+
+1. Create Python-package `migrations` next to `mephisto/abstractions/databases/local_database.py`.
+2. Create migration module in that package, e.g. `_001_20240101_add__column_name__in__table_name.py`.
+   Note leading underscore - Python does not allow importing modeuls that start with a number.
+3. Populate module with a SQL query constant:
+    ```python
+    # <copyright notice>
+
+    """
+    This migration introduces the following changes:
+    - ...
+    """
+
+    MY_SQL_MIGRATION_QUERY_NAME = """
+        <SQL query>
+    """
+    ```
+4. Include this SQL query constant in `__init__.py` module (located next to the migration module):
+    ```python
+    # <copyright notice>
+    from ._001_20240101_add__column_name__in__table_name import *
+
+
+    migrations = {
+        "20240101_add__column_name__in__table_name": MY_SQL_MIGRATION_QUERY_NAME,
+    }
+    ```
+
+5. Note that for now we support only forward migrations.
+If you do need a backward migration, simply add it as a forward migration that would undo the undesired changes.
+
+
+### Call `apply_migrations` function
+
+1. Import migrations in `mephisto/abstractions/databases/local_database.py`:
+    ```python
+    ...
+    from .migrations import migrations
+    ...
+    ```
+2. Apply migrations in `LocalMephistoDB`:
+    ```python
+    class LocalMephistoDB(MephistoDB):
+        ...
+        def init_tables(self) -> None:
+            with self.table_access_condition:
+                conn = self.get_connection()
+                conn.execute("PRAGMA foreign_keys = on;")
+
+                with conn:
+                    c = conn.cursor()
+                    c.execute(tables.CREATE_IF_NOT_EXISTS_PROJECTS_TABLE)
+                    ...
+
+                apply_migrations(self, migrations)
+                ...
+
+                with conn:
+                    c.executescript(tables.CREATE_IF_NOT_EXISTS_CORE_INDICES)
+                ...
+    ```
+
+## Maintenance of related code
+
+Making changes in databases must be carefully thought through and tested.
+
+This is a list of places that will most likely need to be synced with your DB change:
+
+1. All queries (involving tables that you have updated) in database class, e.g. `LocalMephistoDB`
+2. Module with common database queries `mephisto/utils/db.py`
+3. Queries in __Review App__ (`mephisto/review_app/server`) - it has its own set of specific queries
+4. Names/relationships for tables and columns in __DBDataPorter__ (they're hardcoded in many places there),
+   within Mephisto DB and provider-specific databases. For instance:
+      - `mephisto/tools/db_data_porter/constants.py`
+      - `mephisto/tools/db_data_porter/import_dump.py`
+      - ...
+5. Data processing within Mephisto itself (obviously)
+
+While we did our best to abstract away particular tables and fields structure,
+they still have to be spelled out in some places.
+Please run tests and check manually all Mephisto applications after performing database changes.
diff --git a/docs/web/docs/guides/how_to_contribute/documentation.md b/docs/web/docs/guides/how_to_contribute/documentation.md
@@ -4,7 +4,7 @@
 # This source code is licensed under the MIT license found in the
 # LICENSE file in the root directory of this source tree.
 
-sidebar_position: 4
+sidebar_position: 5
 ---
 
 # Updating documentation

diff --git a/docs/web/docs/guides/how_to_use/data_porter/_category_.yml b/docs/web/docs/guides/how_to_use/data_porter/_category_.yml
@@ -0,0 +1,7 @@
+# Copyright (c) Meta Platforms and its affiliates.
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+label: "Move data around"
+collapsed: false
+position: 9
diff --git a/docs/web/docs/guides/how_to_use/data_porter/custom_conflict_resolver.md b/docs/web/docs/guides/how_to_use/data_porter/custom_conflict_resolver.md
@@ -0,0 +1,42 @@
+---
+
+# Copyright (c) Meta Platforms and its affiliates.
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+sidebar_position: 3
+---
+
+# Custom conflict resolver
+
+When importing dump data into local DB, some rows may refer to the same object
+(e.g. two Task rows with hte same value of "name" column). This class contains default logic
+to resolve such merging conflicts (implemented for all currently present DBs).
+
+To change this default behavior, you can write your own coflict resolver class:
+1. Add a new Python module next to this module (e.g. `my_conflict_resolver`)
+
+2. This module must contain a class (e.g. `MyMergeConflictResolver`)
+    that inherits from either `BaseMergeConflictResolver`
+    or default resolver `DefaultMergeConflictResolver` (also in this directory)
+    ```python
+    from .base_merge_conflict_resolver import BaseMergeConflictResolver
+
+    class CustomMergeConflictResolver(BaseMergeConflictResolver):
+        default_strategy_name = "..."
+        strategies_config = {...}
+    ```
+
+3. To use this newly created class, specify its name in import command:
+    `mephisto db import ... --conflict-resolver MyMergeConflictResolver`
+
+The easiest place to start customization is to modify `strategies_config` property,
+and perhaps `default_strategy_name` value (see `DefaultMergeConflictResolver` as an example).
+
+NOTE: All available providers must be present in `strategies_config`.
+Table names (under each provider key) are optional, and if missing, `default_strategy_name`
+will be used for all conflicts related to this table.
+
+4. There is an example of a working custom conflict resolver in module `mephisto/tools/db_data_porter/conflict_resolvers/example_merge_conflict_resolver.py`. You can launch it like this:
+
+`mephisto db import ... --conflict-resolver ExampleMergeConflictResolver`
diff --git a/docs/web/docs/guides/how_to_use/data_porter/reference.md b/docs/web/docs/guides/how_to_use/data_porter/reference.md
@@ -0,0 +1,133 @@
+---
+
+# Copyright (c) Meta Platforms and its affiliates.
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+sidebar_position: 2
+---
+
+# Reference
+
+This is a reference describing set of commands under the `mephisto db` command group.
+
+## Export
+
+This command exports data from Mephisto DB and provider-specific datastores
+as an archived combination of (i) a JSON file, and (ii) a `data` catalog with related files.
+
+If no parameter passed, full data dump (i.e. backup) will be created.
+
+To pass a list of values for one command option, simply repeat that option name before each value.
+
+Examples:
+```
+mephisto db export
+mephisto db export --verbosity
+mephisto db export --export-tasks-by-names "My first Task"
+mephisto db export --export-tasks-by-ids 1 --export-tasks-by-ids 2
+mephisto db export --export-task-runs-by-ids 3 --export-task-runs-by-ids 4
+mephisto db export --export-task-runs-since-date 2024-01-01
+mephisto db export --export-task-runs-since-date 2023-01-01T00:00:00
+mephisto db export --labels first_dump --labels second_dump
+mephisto db export --export-tasks-by-ids 1 --delete-exported-data --randomize-legacy-ids --export-indent 4
+```
+
+Options (all optional):
+
+- `-tn/--export-tasks-by-names` - names of Tasks that will be exported
+- `-ti/--export-tasks-by-ids` - ids of Tasks that will be exported
+- `-tri/--export-task-runs-by-ids` - ids of TaskRuns that will be exported
+- `-trs/--export-task-runs-since-date` - only objects created after this ISO8601 datetime will be exported
+- `-l/--labels` - only data imported under these labels will be exported
+- `-del/--delete-exported-data` - after exporting data, delete it from local DB
+- `-r/--randomize-legacy-ids` - replace legacy autoincremented ids with
+        new pseudo-random ids to avoid conflicts during data merging
+- `-i/--export-indent` - make dump easy to read via formatting JSON with indentations (Default 2)
+- `-v/--verbosity` - write more informative messages about progress (Default 0. Values: 0, 1)
+
+Note that the following options cannot be used together:
+`--export-tasks-by-names`, `--export-tasks-by-ids`,  `--export-task-runs-by-ids`, `--export-task-runs-since-date`, `--labels`.
+
+
+## Import
+
+This command imports data from a dump file created by `mephisto db export` command.
+
+Examples:
+```
+mephisto db import --file <dump_file_name_or_path>
+
+mephisto db import --file 2024_01_01_00_00_01_mephisto_dump.json --verbosity
+mephisto db import --file 2024_01_01_00_00_01_mephisto_dump.json --labels my_first_dump
+mephisto db import --file 2024_01_01_00_00_01_mephisto_dump.json --conflict-resolver MyCustomMergeConflictResolver
+mephisto db import --file 2024_01_01_00_00_01_mephisto_dump.json --keep-import-metadata
+```
+
+Options:
+- `-f/--file` - location of the `***.zip` dump file (filename if created in
+    `<MEPHISTO_REPO>/outputs/export` folder, or absolute filepath)
+- `-cr/--conflict-resolver` (Optional) - name of Python class to be used for resolving merging conflicts
+    (when your local DB already has a row with same unique field value as a DB row in the dump data)
+- `-l/--labels` - one or more short strings serving as a reference for the ported data (stored in `imported_data` table),
+    so later you can export the imported data with `--labels` export option
+- `-k/--keep-import-metadata` - write data from `imported_data` table of the dump (by default it's not imported)
+- `-v/--verbosity` - level of logging (default: 0; values: 0, 1)
+
+Note that before every import we create a full snapshot copy of your local data, by
+archiving content of your `data` directory. If any data gets corrupte during the import,
+you can always return to the original state by replacing your `data` folder with the snaphot.
+
+## Backup
+
+Creates full backup of all current data (Mephisto DB, provider-specific datastores, and related files) on local machine.
+
+```
+mephisto db backup
+```
+
+
+## Restore
+
+Restores all data (Mephisto DB, provider-specific datastores, and related files) from a backup archive.
+
+Note that it will erase all current data, and you may want to run command `mephisto db backup` beforehand.
+
+Examples:
+```
+mephisto db restore --file <backup_file_name_or_path>
+
+mephisto db restore --file 2024_01_01_00_10_01.zip
+```
+
+Options:
+- `-f/--file` - location of the `***.zip` backup file (filename if created in
+    `<MEPHISTO_REPO>/outputs/backup` folder, or absolute filepath)
+- `-v/--verbosity` - level of logging (default: 0; values: 0, 1)
+
+
+## Important notes
+
+### Data dump vs backup
+
+Mephisto stores local data in `outputs` and `data` folders. The safest way to back Mephisto up is to create a copy of the `data` folder - and that's what a Mephisto backup contains.
+
+On the other hand, partial data export is written into a data dump that contains:
+
+- a JSON file representing relevant data entries from DB tables
+- a folder with all files related to the exported data entries
+
+With the export command, you **can** create a dump of the entire data as well, and here's when it's useful:
+- Use `mephisto db backup` as the safest option, and if you only intend to restore this data instead of previous one
+- Use `mephisto db export` to dump complete data from a small Mephisto project, so it can be imported into a larger Mephisto project later.
+
+
+### Legacy PKs
+
+Prior to release `v1.4` of Mephisto, its DB schemas used auto-incremented integer primary keys. While convenient for debugging, it causes problems during data import/export.
+
+As of `v1.4` we have replaced these "legacy" PKs with quazi-random integers (for backward compatibility their values are designed to be above 1,000,000).
+
+If you do wish to use import/export commands with your "legacy" data, include the `--randomize-legacy-ids` option. It prevents data corruption when merging 2 sets of "legacy" data (because they will contain same integer PKs `1, 2, 3,...` for completely unrelated objects).
+
+This handling of legacy PKs ensures that Data Porter feature is backward compatible, and wll work with your previous existing Mephisto data.