Skip to content

Commit

Permalink
Unittest coverage for Data Porter feature
Browse files Browse the repository at this point in the history
  • Loading branch information
meta-paul committed May 6, 2024
1 parent fe83237 commit 09bcff1
Show file tree
Hide file tree
Showing 31 changed files with 1,477 additions and 215 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

label: "Merge databases"
label: "Move data around"
collapsed: false
position: 9
Original file line number Diff line number Diff line change
Expand Up @@ -106,10 +106,28 @@ Options:
- `-v/--verbosity` - level of logging (default: 0; values: 0, 1)


## Note on legacy PKs
## Important notes

### Data dump vs backup

Mephisto stores local data in `outputs` and `data` folders. The safest way to back Mephisto up is to create a copy of the `data` folder - and that's what a Mephisto backup contains.

On the other hand, partial data export is written into a data dump that contains:

- a JSON file representing relevant data entries from DB tables
- a folder with all files related to the exported data entries

With the export command, you **can** create a dump of the entire data as well, and here's when it's useful:
- Use `mephisto db backup` as the safest option, and if you only intend to restore this data instead of previous one
- Use `mephisto db export` to dump complete data from a small Mephisto project, so it can be imported into a larger Mephisto project later.


### Legacy PKs

Prior to release `v1.4` of Mephisto, its DB schemas used auto-incremented integer primary keys. While convenient for debugging, it causes problems during data import/export.

As of `v1.4` we have replaced these "legacy" PKs with quazi-random integers (for backward compatibility their values are designed to be above 1,000,000).

If you do wish to use import/export commands with your "legacy" data, include the `--randomize-legacy-ids` option. It prevents data corruption when merging 2 sets of "legacy" data (because they will contain same integer PKs `1, 2, 3,...` for completely unrelated objects).

This handling of legacy PKs ensures that Data Porter feature is backward compatible, and wll work with your previous existing Mephisto data.
89 changes: 89 additions & 0 deletions docs/web/docs/guides/how_to_use/data_porter/simple_usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---

# Copyright (c) Meta Platforms and its affiliates.
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

sidebar_position: 1
---

# Simple usage


## Introduction

Sometimes you may want to run Mephisto remotely on remote server(s) for data collection, since that stage takes a while.
The Data Porter feature allows to move around data collected by different Mephisto instances, for ease of review and record keeping.

Data Porter can do the following for you:

- Backing up your local data
- Restoring your local data
- Exporting part of your local data (into a data dump)
- Importing data from a data dump (into your local data)

Before making any changes to data, we recommend creating a backup of your local data
(so you can roll back the changes if anything goes wrong).

---

## Common use scenarios

### Backing up data

The below backup command will create an archived version of your local `data` directory
(that contains all data for the project), and place it in `outputs/backup/` directory:

```shell
mephisto db backup
```

### Restoring a backup

You can restore a snapshot of your local data from a backup data dump (created with `mephisto db backup` command):

```shell
mephisto db restore --file <FILE_PATH>
```

where `<FILE_PATH>` can be either full path to a file, or just the filename (if it's located in the `outputs/backup/` directory)

Important notes:

- Your current local data will be erased (to make room for the restored data)
- If DB schema of the data dump is outdated, Mephisto when launched will automatically try to apply all existing migrations


### Exporting data

To export all local data (and make it importable later), run

```shell
mephisto db export
```

To export partial data only partially (i.e. from a few selected Task Runs), you have a few options of identifying the imported data. The simplest method is by using Task name(s):

```shell
mephisto db export --export-tasks-by-names "My first Task" --export-tasks-by-names "My second Task"
```

This will generate an archive file in the `outputs/export/` directory.

#### Legacy PKs note

If you're exporting "legacy" data entries (i.e. created before May 2024), you should add parameter `--randomize-legacy-ids` to your export command. This will ensure lack of conflicts when importing this dump into a "legacy" dated database.
All this parameter does is change (within the dump) sequential integer PKs to random integer PKs, while preserving all object relations.


### Importing data

You can restore data dump content (created with `mephisto db export` command) into your local data like so:

```shell
mephisto db import --file <FILE_PATH>
```

where `<FILE_PATH>` can be either full path to a file, or just the filename (if it's located in the `outputs/export/` directory)

Note that before the import starts, a full backup of your local data will be automatically created and saved to `outputs/backup/` directory.
135 changes: 0 additions & 135 deletions docs/web/docs/guides/how_to_use/merge_dbs/simple_usage.md

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
# LICENSE file in the root directory of this source tree.

"""
List of changes:
1. Rename `unit_review.created_at` -> `unit_review.creation_date`
2. Remove autoincrement parameter for all Primary Keys
3. Add missed Foreign Keys in `agents` table
Expand All @@ -13,7 +14,7 @@
"""


PREPARING_DB_FOR_MERGE_DBS_COMMAND = """
MODIFICATIONS_FOR_DATA_PORTER = """
ALTER TABLE unit_review RENAME COLUMN created_at TO creation_date;
/* Disable FK constraints */
Expand Down
4 changes: 2 additions & 2 deletions mephisto/abstractions/databases/migrations/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

from ._001_20240325_preparing_db_for_merge_dbs_command import *
from ._001_20240325_data_porter_feature import *


migrations = {
"20240418_preparing_db_for_merge_dbs_command": PREPARING_DB_FOR_MERGE_DBS_COMMAND,
"20240418_data_porter_feature": MODIFICATIONS_FOR_DATA_PORTER,
}
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,12 @@
# LICENSE file in the root directory of this source tree.

"""
1. Modified default value for `creation_date`
List of changes:
1. Modify default value for `creation_date`
"""


PREPARING_DB_FOR_MERGE_DBS_COMMAND = """
MODIFICATIONS_FOR_DATA_PORTER = """
/* Disable FK constraints */
PRAGMA foreign_keys = off;
Expand All @@ -36,8 +37,8 @@
INSERT INTO _run_mappings SELECT * FROM run_mappings;
DROP TABLE run_mappings;
ALTER TABLE _run_mappings RENAME TO run_mappings;
/* Runs */
CREATE TABLE IF NOT EXISTS _runs (
run_id TEXT PRIMARY KEY UNIQUE,
Expand All @@ -50,8 +51,8 @@
INSERT INTO _runs SELECT * FROM runs;
DROP TABLE runs;
ALTER TABLE _runs RENAME TO runs;
/* Qualifications */
CREATE TABLE IF NOT EXISTS _qualifications (
qualification_name TEXT PRIMARY KEY UNIQUE,
Expand Down
4 changes: 2 additions & 2 deletions mephisto/abstractions/providers/mturk/migrations/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

from ._001_20240325_preparing_db_for_merge_dbs_command import *
from ._001_20240325_data_porter_feature import *


migrations = {
"20240418_preparing_db_for_merge_dbs_command": PREPARING_DB_FOR_MERGE_DBS_COMMAND,
"20240418_data_porter_feature": MODIFICATIONS_FOR_DATA_PORTER,
}
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,17 @@
# LICENSE file in the root directory of this source tree.

"""
List of changes:
1. Remove autoincrement parameter for all Primary Keys
2. Added `update_date` and `creation_date` in `workers` table
3. Added `creation_date` in `units` table
2. Add `update_date` and `creation_date` in `workers` table
3. Add `creation_date` in `units` table
4. Rename field `run_id` -> `task_run_id`
5. Remove table `requesters`
6. Modified default value for `creation_date`
6. Modify default value for `creation_date`
"""


PREPARING_DB_FOR_MERGE_DBS_COMMAND = """
MODIFICATIONS_FOR_DATA_PORTER = """
/* Disable FK constraints */
PRAGMA foreign_keys = off;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

from ._001_20240325_preparing_db_for_merge_dbs_command import *
from ._001_20240325_data_porter_feature import *


migrations = {
"20240418_preparing_db_for_merge_dbs_command": PREPARING_DB_FOR_MERGE_DBS_COMMAND,
"20240418_data_porter_feature": MODIFICATIONS_FOR_DATA_PORTER,
}
Loading

0 comments on commit 09bcff1

Please sign in to comment.