diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index c0bda94df..58958c8cd 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -15,7 +15,7 @@ We actively welcome your pull requests. 6. If you haven't already, complete the Contributor License Agreement ("CLA"). ### Task Contributions -TODO TODO TODO +Generally we encourage people to provide their own blueprints as part of the repo in which they release their code, though if someone creates a strong case for an abstract `Blueprint` that is generally applicable we'd be happy to review it. ## Contributor License Agreement ("CLA") In order to accept your pull request, we need you to submit a CLA. You only need diff --git a/mephisto/README.md b/mephisto/README.md index 5ee0e7618..bf6523f2f 100644 --- a/mephisto/README.md +++ b/mephisto/README.md @@ -1,16 +1,9 @@ # Mephisto -This is the main package directory, containing all of the core workings of Mephisto. The breakdown is as following: +This is the main package directory, containing all of the core workings of Mephisto. They roughly follow the divisions noted in the [architecture overview doc](https://github.com/facebookresearch/Mephisto/blob/master/docs/architecture_overview.md#agent). The breakdown is as following: -- `client`: Contains interfaces for using Mephisto at a very high level. Primarily comprised of the python code for the cli and -- `core`: Contains components that operate on top of the data_model layer -- `data_model`: Contains the data model components as described in the architecture document, as well as the base classes for all the core abstractions. -- `providers`: contains implementations of the `CrowdProvider` abstraction -- `scripts`: contains commonly executed convenience scripts for Mephisto users -- `server`: contains implementations of the `Architect` and `Blueprint` abstractions. -- `tasks`: an empty default directory to work on your own tasks -- `utils`: unorganized utility classes that are useful in scripts and other places -- `webapp`: contains the frontend that is deployed by the main client - -## Discussions - -Changes to this structure for clarity are being discussed in [#285](https://github.com/facebookresearch/Mephisto/issues/285). \ No newline at end of file +- `abstractions`: Contains the interface classes for the core abstractions in Mephisto, as well as implementations of those interfaces. These are the Architects, Blueprints, Crowd Providers, and Databases. +- `client`: Contains user interfaces for using Mephisto at a very high level. Primarily comprised of the python code for the cli and the web views. +- `data_model`: Contains the data model components as described in the architecture document. These are the relevant data structures that build upon the underlying MephistoDB, and are utilized throughout the Mephisto codebase. +- `operations`: Contains low-level operational code that performs more complex functionality on top of the Mephisto data model. +- `scripts`: Contains commonly executed convenience scripts for Mephisto users. +- `tools`: Contains helper methods and modules that allow for lower-level access to the Mephisto data model than the clients provide. Useful for creating custom workflows and scripts that are built on Mephisto. diff --git a/mephisto/abstractions/README.md b/mephisto/abstractions/README.md new file mode 100644 index 000000000..af25cbf66 --- /dev/null +++ b/mephisto/abstractions/README.md @@ -0,0 +1,18 @@ +# Mephisto Core Abstractions +This directory contains the interfaces for the four core Mephisto abstractions (as well as subcomponents of those abstractions). Those abstractions are discussed at a high level in the [architecture overvierw doc](https://github.com/facebookresearch/Mephisto/blob/master/docs/architecture_overview.md). + +Specific implementations can be made to extend the Mephisto data model to work with new crowd providers, new task types, and new backend server architectures. These four primary abstractions are summarized below, but other sections go more in-depth. + +### `Architect` +An [`Architect`](https://github.com/facebookresearch/Mephisto/blob/master/mephisto/abstractions/architects/README.md#architect) is an abstraction that allows Mephisto to manage setup and maintenance of task servers for you. When launching a task, Mephisto uses an `Architect` to build required server files, launch that server, deploy the task files, and then later shut it down when the task is complete. More details are found in the `abstractions/architects` folder, along with the existing `Architects`. + +Architects also require a `Channel` to allow the `Supervisor` to communicate with the server, and are expected to define their own or select a compatible one from the ones already present. + +### `Blueprint` +A [`Blueprint`](https://github.com/facebookresearch/Mephisto/blob/master/mephisto/abstractions/blueprints/README.md#overview) is the essential formula for running a task on Mephisto. It accepts some number of parameters and input data, and that should be sufficient content to be able to display a frontend to the crowdworker, process their responses, and then save them somewhere. It comprises of extensions of the `AgentState` (data storage), `TaskRunner` (actual steps to complete the task), and `TaskBuilder` (resources to display a frontend) classes. More details are provided in the `abstractions/blueprints` folder, where all the existing `Blueprint`s live. + +### `CrowdProvider` +A [`CrowdProvider`](https://github.com/facebookresearch/Mephisto/blob/master/mephisto/abstractions/providers/README.md#implementation-details) is a wrapper around any of the required functionality that Mephisto will need to utilize to accept work from workers on a specific service. Ultimately it comprises of an extension of each of `Worker`, `Agent`, `Unit`, and `Requester`. More details can be found in the `abstractions/providers` folder, where all of the existing `CrowdProvider`s live. + +### `MephistoDB` +The [`MephistoDB`](https://github.com/facebookresearch/Mephisto/blob/master/mephisto/abstractions/databases/README.md) is an abstraction around the storage for the Mephisto data model, such that it could be possible to create alternate methods for storing and loading the kind of data that mephisto requires without breaking functionality. \ No newline at end of file diff --git a/mephisto/abstractions/blueprints/README.md b/mephisto/abstractions/blueprints/README.md index f0337c116..3319d4c95 100644 --- a/mephisto/abstractions/blueprints/README.md +++ b/mephisto/abstractions/blueprints/README.md @@ -8,36 +8,28 @@ The agent state is responsible for defining the data that is important to store - `set_init_state(data)`: given data provided by the `get_init_data_for_agent` method, initialize this agent state to whatever starting state is relevant for this `Unit`. - `get_init_state()`: Return the initial state to be sent to the agent for use in the frontend. - `load_data()`: Load data that is saved to file to re-initialize the state for this `AgentState`. Generally data should be stored in `self.agent.get_data_dir()`, however any storage solution will work as long as it remains consistent. -- `get_data()`: Return the stored data for this task in the format expected to render a completed task in the frontend. +- `get_data()`: Return the stored data for this task in the format containing everything the frontend needs to render and run the task. +- `get_parsed_data()`: Return the stored data for this task in the format that is relevant for review or packaging the data. - `save_data()`: Save data to a file such that it can be re-initialized later. Generally data should be stored in `self.agent.get_data_dir()`, however any storage solution will work as long as it remains consistent, and `load_data()` will be able to find it. - `update_data()`: Update the local state stored in this `AgentState` given the data sent from the frontend. Given your frontend is what packages data to send, this is entirely customizable by the task creator. -(TODO) Specify a format for data to be sent to the frontend for review. - ### `TaskBuilder` `TaskBuilder`s exist to abstract away the portion of building a frontend to however one would want to, allowing Mephisto users to design tasks however they'd like. They also can take build options to customize what ends up built. They must implement the following: - `build_in_dir(build_dir)`: Take any important source files and put them into the given build dir. This directory will be deployed to the frontend and will become the static target for completing the task. - `get_extra_options()`: Return the specific task options that are relevant to customize the frontend when `build_in_dir` is called. -(TODO) Remove all references to the below functon -- `task_dir_is_valid(task_dir)`: Originally this was intended to specify whether the task directory supplied outside of the task for this task to use was properly formatted, however when `Blueprint`s were finalized, the gallery no longer existed and this route of customization is no longer supported. ### `TaskRunner` The `TaskRunner` component of a blueprint is responsible for actually stepping `Agent`s through the task when it is live. It is, in short, able to set up task control. A `TaskRunner` needs to implement the following: - `get_init_data_for_agent`: Provide initial data for an assignment. If this agent is reconnecting (and as such attached to an existing task), update that task to point to the new agent (as the old agent object will no longer receive data from the frontend). - `run_assignment`: Handle setup for any resources required to get this assignment running. It will be launched in a background thread, and should be tolerant to being interrupted by cleanup_assignment. - `cleanup_assignment`: Send any signals to the required thread for the given assignment to tell it to terminate, then clean up any resources that were set within it. -- `get_data_for_assignment` (optional): Get the data that an assignment is going to use when run. By default, this pulls from `assignment.get_assignment_data()` however if a task has a special storage mechanism or data type, the assignment data can be fetched here. (TODO) make this optional by having the base class use the `StaticTaskRunner`'s implementation. -(TODO) task launching management at the moment is really sloppy, and the API for it is unclear. Something better needs to be picked, as at the moment `get_init_data_for_assignment` is responsible for ensuring that `run_assignment` is set up in a thread. Perhaps this responsibility should be consolidated into the `TaskLauncher` class. +- `get_data_for_assignment` (optional): Get the data that an assignment is going to use when run. By default, this pulls from `assignment.get_assignment_data()` however if a task has a special storage mechanism or data type, the assignment data can be fetched here. ## Implementations ### `StaticBlueprint` The `StaticBlueprint` class allows a replication of the interface that MTurk provides, being able to take a snippet of `HTML` and a `.csv` file and deploy tasks that fill templates of the `HTML` with values from the `.csv`. -(TODO) support other sources than a .csv +You can also specify the task data in a `.json` file, or by passing the data array or a generator to `SharedStaticTaskState.static_task_data`. ### `MockBlueprint` The `MockBlueprint` exists to test other parts of the Mephisto architecture, and doesn't actually provide a real task. - -## Future work -(TODO) - Clean up the notion of galleries and parent task ids, as we're consolidating into blueprints -(TODO) - Allow for using user blueprints diff --git a/mephisto/abstractions/databases/README.md b/mephisto/abstractions/databases/README.md new file mode 100644 index 000000000..5f31ef4e2 --- /dev/null +++ b/mephisto/abstractions/databases/README.md @@ -0,0 +1,5 @@ +# MephistoDB implementations +This folder contains implementations of the `MephistoDB` abstraction. + +## `LocalMephistoDB` +An implementation of the Mephisto Data Model outlined in `MephistoDB`. This database stores all of the information locally via SQLite. Some helper functions are included to make the implementation cleaner by abstracting away SQLite error parsing and string formatting, however it's pretty straightforward from the requirements of MephistoDB. diff --git a/mephisto/abstractions/providers/README.md b/mephisto/abstractions/providers/README.md index 180d024ad..8e2bbc30e 100644 --- a/mephisto/abstractions/providers/README.md +++ b/mephisto/abstractions/providers/README.md @@ -29,11 +29,9 @@ A specific interface for launching tasks on the MTurk sandbox (TODO) Can we bundle this into the `MTurkProvider` and make it so that providers have a TEST/SANDBOX mode bundled in? This would clarify how the testing utilities work, without needing to publish real tasks. -### LocalProvider +### LocalProvider (TODO) An interface that allows for launching tasks on your local machine, allowing for ip-address based workers to submit work. -(TODO) IMPLEMENT THIS - ### MockProvider An implementation of a provider that allows for robust testing by exposing all of the underlying state to a user. @@ -71,7 +69,7 @@ The `Unit` implementation needs to be able to handle the following intera ### `Requester` The `Requester` mostly just needs to abstract the registration process, but the full list of functions are below: - `register`: Given arguments, register this requester -- `get_register_args`: Return the arguments required to register one of these requesters. (TODO) can we turn this into an argparse group somehow? And then later extract from the argparse group to send to the frontend. +- `get_register_args`: Return the arguments required to register one of these requesters. - `is_registered`: Determine if the current credentials for a `Requester` are valid. - `get_available_budget` (Optional): return the available budget for this requester. diff --git a/mephisto/abstractions/providers/mturk/utils/script_utils.py b/mephisto/abstractions/providers/mturk/utils/script_utils.py index aba8ebfca..c8a66e236 100644 --- a/mephisto/abstractions/providers/mturk/utils/script_utils.py +++ b/mephisto/abstractions/providers/mturk/utils/script_utils.py @@ -9,6 +9,7 @@ from mephisto.abstractions.providers.mturk.mturk_utils import give_worker_qualification from mephisto.data_model.requester import Requester from mephisto.data_model.unit import Unit +from tqdm import tqdm if TYPE_CHECKING: from mephisto.abstractions.database import MephistoDB @@ -42,9 +43,7 @@ def direct_soft_block_mturk_workers( ) mturk_client = requester._get_client(requester._requester_name) - for idx, worker_id in enumerate(worker_list): - if idx % 50 == 0: - print(f"Blocked {idx + 1} workers so far.") + for worker_id in tqdm(worker_list): try: give_worker_qualification( mturk_client, worker_id, qualification_id, value=1 diff --git a/mephisto/data_model/test/README.md b/mephisto/abstractions/test/README.md similarity index 56% rename from mephisto/data_model/test/README.md rename to mephisto/abstractions/test/README.md index 8779cd2c1..6665cbaa2 100644 --- a/mephisto/data_model/test/README.md +++ b/mephisto/abstractions/test/README.md @@ -1,6 +1,7 @@ -# data_model/test -## Testers +# Abstraction testers This folder contains a number of Mephisto Data Model "test benches", which serve to be the standard tests that Mephisto Abstractions need to be able to pass in order for the system to be able to use them. As such, they define a number of tests, and then new classes can be tested against the bench by making a subclass that implements the required setup functions. See the `test/server/architects/test_heroku_architect` implementation for an example. +Implementations can add their own additional test methods after extending the baseline test benches in order to ensure that they have a common place to test their complete functionality. + ## Utils -Any utility functions that can be used for creating useful mocks, DB setups, or other such prerequisites for a test. +The `utils.py` module is set up with utility functions that can be used for creating useful mocks, DB setups, or other such prerequisites for a test. diff --git a/mephisto/data_model/test/__init__.py b/mephisto/abstractions/test/__init__.py similarity index 100% rename from mephisto/data_model/test/__init__.py rename to mephisto/abstractions/test/__init__.py diff --git a/mephisto/data_model/test/architect_tester.py b/mephisto/abstractions/test/architect_tester.py similarity index 98% rename from mephisto/data_model/test/architect_tester.py rename to mephisto/abstractions/test/architect_tester.py index 834e68c10..9540a35fc 100644 --- a/mephisto/data_model/test/architect_tester.py +++ b/mephisto/abstractions/test/architect_tester.py @@ -13,7 +13,7 @@ import requests from mephisto.abstractions.architect import Architect from mephisto.data_model.task_run import TaskRun -from mephisto.data_model.test.utils import get_test_task_run +from mephisto.abstractions.test.utils import get_test_task_run from mephisto.abstractions.database import MephistoDB from mephisto.abstractions.blueprint import SharedTaskState from mephisto.abstractions.blueprints.mock.mock_task_builder import MockTaskBuilder diff --git a/mephisto/data_model/test/blueprint_tester.py b/mephisto/abstractions/test/blueprint_tester.py similarity index 99% rename from mephisto/data_model/test/blueprint_tester.py rename to mephisto/abstractions/test/blueprint_tester.py index 222806aa0..bc00b6199 100644 --- a/mephisto/data_model/test/blueprint_tester.py +++ b/mephisto/abstractions/test/blueprint_tester.py @@ -21,7 +21,7 @@ from mephisto.abstractions.databases.local_database import LocalMephistoDB from mephisto.data_model.assignment import Assignment from mephisto.data_model.task_run import TaskRun -from mephisto.data_model.test.utils import get_test_task_run +from mephisto.abstractions.test.utils import get_test_task_run from mephisto.abstractions.providers.mock.mock_agent import MockAgent from mephisto.data_model.agent import Agent from mephisto.operations.hydra_config import MephistoConfig diff --git a/mephisto/data_model/test/crowd_provider_tester.py b/mephisto/abstractions/test/crowd_provider_tester.py similarity index 100% rename from mephisto/data_model/test/crowd_provider_tester.py rename to mephisto/abstractions/test/crowd_provider_tester.py diff --git a/mephisto/data_model/test/data_model_database_tester.py b/mephisto/abstractions/test/data_model_database_tester.py similarity index 99% rename from mephisto/data_model/test/data_model_database_tester.py rename to mephisto/abstractions/test/data_model_database_tester.py index e9cac8845..6066dfb22 100644 --- a/mephisto/data_model/test/data_model_database_tester.py +++ b/mephisto/abstractions/test/data_model_database_tester.py @@ -7,7 +7,7 @@ import unittest from typing import Optional, Tuple -from mephisto.data_model.test.utils import ( +from mephisto.abstractions.test.utils import ( get_test_assignment, get_test_project, get_test_requester, diff --git a/mephisto/data_model/test/utils.py b/mephisto/abstractions/test/utils.py similarity index 100% rename from mephisto/data_model/test/utils.py rename to mephisto/abstractions/test/utils.py diff --git a/mephisto/data_model/README.md b/mephisto/data_model/README.md index bbd3482f8..4a1d90df0 100644 --- a/mephisto/data_model/README.md +++ b/mephisto/data_model/README.md @@ -10,12 +10,12 @@ Note: This abstraction is broken specifically in the case of `Agent`s that are c The following classes are the units of Mephisto, in that they keep track of what mephisto is doing, where things are stored, history of workers, etc. The rest of the system in general should only be utilizing these classes to make any operations, allowing them to be a strong abstraction layer. ### `Project` -High level project that many crowdsourcing tasks may be related to. Useful for budgeting and grouping tasks for a review perspective. They are primarily a bookkeeping tool. +High level project that many crowdsourcing tasks may be related to. Useful for budgeting and grouping tasks for a review perspective. They are primarily a bookkeeping tool. At the moment they are fairly under-implemented, but can certainly be extended. ### `Task` -This class contains all of the required tidbits for launching a set of assignments, including where to find the frontend files to deploy (based on the `Blueprint`), possible arguments for configuring the assignments more exactly (a set of `TaskParam`s), the associated project (if supplied). +The `Task` class is required to create a group of `TaskRun`s for the purpose of aggregation and bookkeeping. Much of what is present in the current `Task` implementation can be deprecated. Much of the functionality here for ensuring that a task has common arguments and correct components is now stored in the `Blueprint` concept. -(TODO) at the moment, the required state creation for bookkeeping for a task (creating the directory to store assignment information and such) is handled in the `new` method for `Task`. This should really be hidden away behind the `MephistoDB` or `core.utils`. Much of the complexity for task creation is now hidden behind `task_type` (`Blueprint`s) though, so it's possible this needs to be totally removed. +Eventually the `Task` code can be deprecated and replaced with useful aggregation functionality across `TaskRun`s within. ### `TaskRun` This class keeps track of the configuration options and all assignments associated with an individual launch of a task. It also provides utility functions for ensuring workers can be assigned units (`get_valid_units_for_worker`, `reserve_unit`). @@ -25,8 +25,7 @@ Generally has 3 states: - In Flight (launched, `_has_assignments=True`, `_is_completed=False`): After launch, when tasks are still in flight and may still be updating statuses. - Completed (all tasks done/expired, `_has_assignments=True`, `_is_completed=True`): Once a task run is fully complete and no tasks will be launched anymore, it's ready for review. -(TODO) Definitely needs a way to keep track of parameters. -(TODO) Responsible for determining worker eligibility? Perhaps this is better placed into the `Supervisor`? As of now there's an `is_eligible` function in `Worker` that should probably be assigned with this. +Configuration parameters for launching a specific run are stored in the form of a json dump of the configuration file provided for the launch. ### `Assignment` This class represents a single unit of work, or a thing that needs to be done. This can be something like "annotate this specific image" or "Have a conversation between two specified characters." It can be seen as an individual instantiation of the more general `Task` described above. As it is mostly captured by the `Blueprint` running the task, the only remaining components are knowing where the data is stored (`get_assignment_data`), tracking the assignment status (`get_status`) and knowing which `Worker`s and `Unit`s are associated with that progress. @@ -40,24 +39,11 @@ This class represents an individual - namely a person. It maintains components o ### `Agent` This class encompasses a worker as they are working on an individual assignment. It maintains details for the current task at hand such as start and end time, connection status, etc. Generally this is an abstraction the worker operating at a frontend and the backend interactions. The `Supervisor` class is responsible for maintaining most of that abstraction, so this class mostly needs to implement ways to approve and reject work, as well as get a work's status or mark it as done when the final work is received. -(TODO) actually implement end time of an assignment, perhaps by leveraging `mark_done`? - ### `Requester` This class encompasses your identity as it is known by a `CrowdProvider` you intend to launch tasks on. It keeps track of some metadata on your account (such as your budget) but also some Mephisto usage statistics (such as amount spent in total from that requester). -## Mephisto Abstraction Interfaces -Specific implementations can be made to extend the Mephisto data model to work with new crowd providers, new task types, and new backend server architectures. These are summarized below, but other sections go more in-depth. - -(TODO) link other READMEs here. - -### `CrowdProvider` -A crowd provider is a wrapper around any of the required functionality that Mephisto will need to utilize to accept work from workers on a specific service. Ultimately it comprises of an extension of each of `Worker`, `Agent`, `Unit`, and `Requester`. More details can be found in the `providers` folder, where all of the existing `CrowdProvider`s live. - -### `Blueprint` -A blueprint is the essential formula for running a task on Mephisto. It accepts some number of parameters and input data, and that should be sufficient content to be able to display a frontend to the crowdworker, process their responses, and then save them somewhere. It comprises of extensions of the `AgentState` (data storage), `TaskRunner` (actual steps to complete the task), and `TaskBuilder` (resources to display a frontend) classes. More details are provided in the `server/blueprints` folder, where all the existing `Blueprint`s live. - -### `Architect` -An `Architect` is an abstraction that allows Mephisto to manage setup and maintenance of task servers for you. When launching a task, Mephisto uses an `Architect` to build required server files, launch that server, deploy the task files, and then later shut it down when the task is complete. More details are found in the `server/architects` folder, along with the existing `Architects`. +### Qualification and GrantedQualification +These classes act as a method for assigning Mephisto-backed qualifications to workers in a manner such that the same qualifications can be used across multiple different crowd providers, or with crowd providers that don't normally provide a method for granting qualifications before a worker connects. ## Non-Database backed abstractions Some classes in the data model aren't backed by the data model because they are generally lightweight views of existing data or transient containers. @@ -69,13 +55,4 @@ Encapsulates messages being sent from the `Supervisor` to any Mephisto server. Keeps track of specific parameters that are necessary to launch a task on any crowd provider, like `title`, `description`, `tags`, `quantity`, `pay_amount`, etc. `TaskRuns` leverage the `TaskConfig` to know what they're doing. ## Constants -Some Mephisto constants that are able to standardize values across multiple classes live in the data model - -(TODO) Does it make sense to move these into a `constants`, folder, and have them be pure? - -### `AssignmentState` -These track the possible valid assignment states that Mephisto is aware of -### Blueprint.AgentState -These track the possible states that an individual agent may be in -### `constants.py` -This file is the catch-all for any other shared constants when there aren't enough in a similar category to make its own file. +Some Mephisto constants that are able to standardize values across multiple classes live in the data model within the contants folder. diff --git a/mephisto/data_model/constants/README.md b/mephisto/data_model/constants/README.md new file mode 100644 index 000000000..66cfed1b0 --- /dev/null +++ b/mephisto/data_model/constants/README.md @@ -0,0 +1,2 @@ +# Data model constants +This folder contains constants modules for constants that are used in multiple places in the data model. As they are not tied to just one place, they need a common import location. \ No newline at end of file diff --git a/mephisto/operations/README.md b/mephisto/operations/README.md new file mode 100644 index 000000000..11cded94e --- /dev/null +++ b/mephisto/operations/README.md @@ -0,0 +1,80 @@ +# Mephisto Operations +The contents of the operations folder comprise controllers for launching and monitoring tasks, as well as other classes that either operate on the data model or support the mid level of Mephisto's design. Each has a high level responsibility, detailed below. Within these classes there's something of a heirarchy depending on the amount of underlying infrastructure that a class is built upon. + +### High-level controller components +This level of components is reserved for modules that operate on the highest level of the Mephisto heirarchy. These should be either directly usable, or easy to bundle into scripts for the client/api. + +- `Operator`: High-level class responsible for launching and monitoring a `TaskRun`. Generally initialized using a `RunScriptConfig` and the `validate_and_run_config` method. + +At the moment only the `Operator` exists in this level, as the module that manages the process of launching and monitoring a complete data collection job. Modules on a similar level of complexity may be written for the review flow, and for packaging data for release. + +### Mid-level connecting components +These components are responsible for tying some of the underlying data model components to the reality of what they represent. They ensure that tasks remain in sync with what is actually happening, such that the content on Mephisto matches what is present on crowd providers and architects, and to some degree to blueprints. + +- `Supervisor`: Responsible for following the status of a worker from the point they attempt to accept a `Unit` until the `Unit` is either completed or returned. This includes spawning the threads that watch specific `Assignment`'s or `Unit`'s, evaluating onboarding and qualifications, and ensuring that reconnecting workers are directed to the correct agents. The supervisor acts as the bridge between `Architect`s and `Blueprints`. +- `registry.py`: Reponsible for keeping track of instances of all of the Mephisto core abstractions, such that the system is able to refer to them just by name. +- `TaskLauncher`: Responsible for moving through an iterator or generator of assignments and units to be created, first creating the local Mephisto state to represent them and then later leveraging the `CrowdProvider` to launch them. Also ensures certain configuration invariants hold, such as a maximum number of concurrent tasks available. + +### Low-level Mephisto infra +These modules contain functionality that is used to condense shared behavior from various parts of the Mephisto codebase into standard functionality and utilities. + +- `config_handler.py`: Functions responsible for providing a consistent interface into a user's configuration file for Mephisto, stored at `~/.mephisto/config.yml`. +- `hydra_config.py`: Classes and functionality responsible for ensuring that Mephisto operates well using Hydra, including base classes to build Hydra structured configs from (such as the `RunScriptConfig`) and methods to simplify interacting with Hydra in the codebase. +- `logger_core.py`: Helpers to simplify the process of generating unique loggers and logging configuration for various parts of Mephisto. (Much still to be done here). +- `utils.py`: Various smaller utility functions that are used in many places within the Mephisto codebase. (Likely getting to a point where these should be grouped). + + +## `Operator` +The Operator is responsible for actually coordinating launching tasks. This is managed using the `validate_and_run_config` function. It takes in a Hydra `DictConfig` of arguments corresponding to the `Blueprint`, `Architect`, and `CrowdProvider` of choice. It can also take a `SharedTaskState` object to pass information that wouldn't normally be able to be parsed on the command line, or where it can only be extracted at runtime. + +One important extra argument is `SharedTaskState.qualifications`, which allows configuring a task with requirements for workers to be eligibible to work on the task. Functionality for this can be seen in `data_model.qualifications`, with examples in how `operator` handles the `block_qualification`. + +The lifecycle of an operator is to launch as many Jobs as desired using the `validate_and_run_config` function, and then to watch over their progress using the `wait_for_runs_then_shutdown` function. These methods cover the full requirements for launching and monitoring a job. + +If `wait_for_runs_then_shutdown` is not used, it's always important to call the `shutdown` methods whenever an operator has been created. While tasks are underway, a user can use `get_running_task_runs` to see the status of things that are currently running. Once there are no running task runs, the `Operator` can be told to shut down. + + +## `Supervisor` +The supervisor is responsible for interfacing between human agents and the rest of the mephisto system. In short, it is the layer that abstracts humans and human work into `Worker`s and `Agent`s that take actions. To that end, it has to set up a socket to connect to the task server, poll status on any agents currently working on tasks, and process incoming agent actions over the socket to put them into the `Agent` so that a task can use the data. It also handles the initialization of an `Agent` from a `Worker`, which is the operation that occurs when a human connecting to the service is accepting a task. + +At a high level, the supervisor manages establishing the abstraction by keeping track of `Job`s (a triple of `Architect`, `Blueprint`, and `CrowdProvider`). The supervisor uses them for the following: +- The `Architect` tells the `Supervisor` where the server(s) that agents are communicating with is(/are) running. In `register_job`, a socket is opened for each of these servers. +- The `Blueprint` contains details about the relevant task run, and is used for properly registering a new `Agent` the correct `Unit`. For this, in `_register_agent` it gets all `Unit`s that a worker is eligible for, reserves one, and then handles creating a new `Agent` out of the given `Worker`-`Unit` pair. +- The `CrowdProvider` is also used during the registration process. In the first part it ensures that upon a first connect by a new person, a corresponding `Worker` is created to keep records for that worker (`_register_worker`). Later it is used during `_register_agent` to ensure that the `Agent` class used is associated with the correct `CrowdProvider` and has its relevant abstractions. + +From this point, all interactions are handled from the perspective of pure Mephisto `Agent`s, and the remaining responsibilities of the `Supervisor` are to ensure that, from the perspective of a `Blueprint`'s `TaskRunner`, the `Agent`s local python state is entirely representative of the actual state of the human worker in the task. In order to handle that it has three primary functions: +- Incoming messages from the server (which represent actions taken by human agents) are passed to the `pending_actions` queue of the `Agent` that corresponds with that human agent. Future calls to `Agent.act()` will pop off from this queue. +- Calls to `Agent.observe()` will add messages to that `Agent`'s `pending_observations` list. The `Supervisor` should periodically send messages from all `Agent`s through to the server, such that the person is able to recieve the operation. +- The `Supervisor` should also be querying for `Agent`'s state and putting any updates into the `Agent` itself, thus allowing tasks to know if an `Agent` has disconnected, returned a task, etc. + +## `registry` +The `registry.py` file contains functions required for establishing a registry of abstraction modules for Mephisto to refer to. This allows Mephisto to properly re-initialize classes and get information for data stored in the MephistoDB without needing to store pickled modules, or information beyond the registration key. + +The file exposes the `register_mephisto_abstraction` class decorator, which ensures that on import a specific module will be added to the given registry. The `fill_registries` function automatically populates the registry dicts with all of the relevant modules in Mephisto, though this should likely be expanded to allow users or repositories to mark or register their own Mephisto implementations. + +Mephisto classes can then use the `get__from_type` methods from the file to retrieve the specific modules to be initialized for the given abstraction type string. + +## `TaskLauncher` +The `TaskLauncher` class is a fairly lightweight class responsible for handling the process of launching units. A `TaskLauncher` is created for a specific `TaskRun`, and provided with `assignment_data` for that full task run. It creates `Assignment`s and `Unit`s for the `TaskRun`, and packages the expected data into the `Assignment`. When a task is ready to go live, one calls `launch_units(url)` with the `url` that the task should be pointed to. If units need to be expired (such as during a shutdown), `expire_units` handles this for all units created for the given `TaskRun`. + +`TaskLauncher`s will parse the `TaskRun`'s `TaskConfig` to know what parameters to set. This info should be used to initialize the assignments and the units as specified. The `TaskLauncher` can also be used to limit the number of currently available tasks using the `max_num_concurrent_units` argument, which prevents too many tasks from running at the same time, potentially overrunning the `TaskRunner` that the `Blueprint` has provided. + + +## `config_handler.py` +The methods in this module standardize how Mephisto interacts with the user configurations options for the whole system. These are stored in `"~/.mephisto/config.yml"` at the moment. The structure of the config file is such that it subdivides values to store into sections containing keys. Those keys can contain any value, but writing and reading data is done by referring to the `section` and the `key` for the data being written or read. + +The following methods are defined: +- `get_config`: loads all of the contents of the mephisto config file. +- `write_config`: Writes an entirely new config to file +- `init_config`: Tries to create an initial configuration file if none exists +- `add_config_arg`: Sets the value for just one configuration arg in the whole set. +- `get_config_arg`: Returns a specific argument value from a section of the config. + +## `hydra_config.py` +The hydra config module contains a number of classes and methods to make interfacing with hydra a little more convenient for Mephisto and its users. It defines common structured config types, currently the `MephistoConfig` and the `RunScriptConfig`, for use in user code. It also defines methods for handling registering those structured configs under the expected names, which the `registry` relies on. Lastly, it provides the `register_script_config` method, which lets a user define a structured config for use in their scripts without needing to initialize a hydra `ConfigStore`. + +## `logger_core.py` +This module contains helpers to simplify the process of generating unique loggers and logging configuration for various parts of Mephisto. At the moment this only outlines the basic logging style that Mephisto uses, though much is still to be done in order to set up logging throughout Mephisto, simplified controls for getting debug information across certain files, and user configuration of Mephisto logs. + +## Utils +The `utils.py` file contains a number of helper utils that (at the moment) rely on the local-storage implementation of Mephisto. These utils help navigate the files present in the mephisto architecture, identify task files, link classes, etc. Docstrings in this class explain in more detail. \ No newline at end of file diff --git a/mephisto/scripts/README.md b/mephisto/scripts/README.md new file mode 100644 index 000000000..de5e33d4e --- /dev/null +++ b/mephisto/scripts/README.md @@ -0,0 +1,4 @@ +# scripts +This directory is for convenience scripts that all Mephisto users may find useful. They should be runnable, polished, and have some kind of API or user interface, as opposed to being methods or modules like those present in the `tools` directory. + +Scripts in this directory should be grouped into folders by the abstractions or tasks they relate to. \ No newline at end of file diff --git a/mephisto/scripts/mturk/README.md b/mephisto/scripts/mturk/README.md new file mode 100644 index 000000000..956c0687c --- /dev/null +++ b/mephisto/scripts/mturk/README.md @@ -0,0 +1,12 @@ +# MTurk Scripts +This directory contains scripts that may be useful for Mephisto users that use MTurk as a crowd provider. Descriptions of the scripts and what they do can be found here: + +# Cleanup +The cleanup script `cleanup.py` is to be used when a run exits due to a catastrophic failure, such as a power outage, sudden reboot, or series of eager Ctrl-C presses. It will search through any tasks that seem to be active and running, and allow users to select to take them down. + +Upon run, the script will ask what requester you want to clean up from. It will try to find all of the HITs currently associated with that requester, and see if any of them appear to be broken or active. (If you have an active job running, there's currently no clear way to distinguish between those and ones from a previously failed run). After this the script will ask for whether you want to clean up by title, or just clean up all of the tasks. + +# Soft-block Workers by MTurk ID +The script `soft_block_workers_by_mturk_id.py` exists to allow a smooth transition into using Mephisto for users that may have blocklists in other locations. Mephisto doesn't directly allow granting Mephisto-backed qualifications to workers that are not in the MephistoDB, but this script can be used to assign such a qualification to MTurk workers just by their ID. + +To use the script, enter the requester name that you would like to assign the block from, the Mephisto qualification name you will be using to block, and then a newline separated list of the MTurk IDs you want to block. After this, entering a blank newline will block all of the given workers. \ No newline at end of file diff --git a/mephisto/tools/README.md b/mephisto/tools/README.md new file mode 100644 index 000000000..471c83a80 --- /dev/null +++ b/mephisto/tools/README.md @@ -0,0 +1,20 @@ +# Tools +The tools directory contains helper methods and modules that allow for lower-level access to the Mephisto data model than the clients provide. These may be useful for creating custom workflows and scripts that are built on Mephisto. + +At the moment this folder contains the following: +- `MephistoDataBrowser`: The `MephistoDataBrowser` is a convenience tool for accessing all of the units and data associated with a specific task run or task name. It is generally used when reviewing or compiling data. +- `scripts.py`: The methods available in `scripts.py` are to be used in user scripts that rely on Mephisto. At the moment, these scripts allow for easy configuration of a database as well as augmentation of a script config for use in a Mephisto `TaskRun`. + +## `MephistoDataBrowser` +The `MephistoDataBrowser` at the moment can handle the job of getting all `Unit`s that are associated with a given task or task run. They can also retrieve the relevant data about a `Unit`, including the work done for that `Unit`, if the `Unit` is completed. + +It has three usable methods at the moment: +- `get_units_for_run_id`: This will return a list of all final `Unit`'s associated with the given `task_run_id`. These will all be in a terminal state, such as `COMPLETED`, `ACCEPTED` or `REJECTED`. Units that are still in flight will not appear using this method. +- `get_units_for_task_name`: This will go through all task runs that share the given `task_name`, and collect their units in the same manner as `get_units_for_run_id`. +- `get_data_from_unit`: When given a `Unit` that is in a terminal state, this method will return data about that `Unit`, including the Mephisto id of the worker, the status of the work, the data saved by this `Unit`, and the start and end times for when the worker produced the data. + +## `scripts.py` +This file contains a few helper methods for running scripts relying on the `MephistoDB`. They are as follows: +- `get_db_from_config`: This method takes in a hydra-produced `DictConfig` containing a `MephistoConfig` (such as a `RunScriptConfig`), and returns an initialized `MephistoDB` compatible with the configuration. Right now this exclusively leverages the `LocalMephistoDB`. +- `augment_config_from_db`: This method takes in a `RunScriptConfig` and a `MephistoDB`, parses the content to ensure that a valid requester and architect setup exists, and then updates the config. It also has validation steps that require user confirmation for live runs. It returns the updated config. +- `load_db_and_process_config`: This is a convenience method that wraps the above two methods, loading in the appropriate `MephistoDB` and using it to process the script. It returns the db and a valid config. \ No newline at end of file diff --git a/mephisto/tools/data_browser.py b/mephisto/tools/data_browser.py index ef85710ce..aa7389005 100644 --- a/mephisto/tools/data_browser.py +++ b/mephisto/tools/data_browser.py @@ -28,6 +28,10 @@ def __init__(self, db=None): self.db = db def _get_units_for_task_runs(self, task_runs: List[TaskRun]) -> List[Unit]: + """ + Return a list of all Units in a terminal completed state from all + the provided TaskRuns. + """ units = [] for task_run in task_runs: assignments = task_run.get_assignments() @@ -44,16 +48,31 @@ def _get_units_for_task_runs(self, task_runs: List[TaskRun]) -> List[Unit]: return units def get_units_for_task_name(self, task_name: str) -> List[Unit]: + """ + Return a list of all Units in a terminal completed state from all + task runs with the given task_name + """ tasks = self.db.find_tasks(task_name=task_name) assert len(tasks) >= 1, f"No task found under name {task_name}" task_runs = self.db.find_task_runs(task_id=tasks[0].db_id) return self._get_units_for_task_runs(task_runs) def get_units_for_run_id(self, run_id: str) -> List[Unit]: + """ + Return a list of all Units in a terminal completed state from the + task run with the given run_id + """ task_run = TaskRun(self.db, run_id) return self._get_units_for_task_runs([task_run]) def get_data_from_unit(self, unit: Unit) -> Dict[str, Any]: + """ + Return a dict containing all data associated with the given + unit, including its status, data, and start and end time. + + Also includes the DB ids for the worker, the unit, and the + relevant assignment this unit was a part of. + """ agent = unit.get_assigned_agent() assert ( agent is not None diff --git a/test/abstractions/architects/test_heroku_architect.py b/test/abstractions/architects/test_heroku_architect.py index faa348e0f..510d570fe 100644 --- a/test/abstractions/architects/test_heroku_architect.py +++ b/test/abstractions/architects/test_heroku_architect.py @@ -9,7 +9,7 @@ import pytest from typing import Type, ClassVar, Optional -from mephisto.data_model.test.architect_tester import ArchitectTests +from mephisto.abstractions.test.architect_tester import ArchitectTests from mephisto.abstractions.architects.heroku_architect import ( HerokuArchitect, HerokuArchitectArgs, diff --git a/test/abstractions/architects/test_local_architect.py b/test/abstractions/architects/test_local_architect.py index dfaafefd4..f4c03086f 100644 --- a/test/abstractions/architects/test_local_architect.py +++ b/test/abstractions/architects/test_local_architect.py @@ -12,7 +12,7 @@ import shlex from typing import Type, ClassVar, Optional -from mephisto.data_model.test.architect_tester import ArchitectTests +from mephisto.abstractions.test.architect_tester import ArchitectTests from mephisto.abstractions.architects.local_architect import ( LocalArchitect, LocalArchitectArgs, diff --git a/test/abstractions/architects/test_mock_architect.py b/test/abstractions/architects/test_mock_architect.py index a7353a3e3..cef43faf2 100644 --- a/test/abstractions/architects/test_mock_architect.py +++ b/test/abstractions/architects/test_mock_architect.py @@ -10,7 +10,7 @@ import tempfile from typing import Type, ClassVar -from mephisto.data_model.test.architect_tester import ArchitectTests +from mephisto.abstractions.test.architect_tester import ArchitectTests from mephisto.abstractions.architects.mock_architect import ( MockArchitect, MOCK_DEPLOY_URL, diff --git a/test/abstractions/blueprints/test_mock_blueprint.py b/test/abstractions/blueprints/test_mock_blueprint.py index 8992de30b..2be913c9e 100644 --- a/test/abstractions/blueprints/test_mock_blueprint.py +++ b/test/abstractions/blueprints/test_mock_blueprint.py @@ -10,7 +10,7 @@ import tempfile from typing import Type, ClassVar -from mephisto.data_model.test.blueprint_tester import BlueprintTests +from mephisto.abstractions.test.blueprint_tester import BlueprintTests from mephisto.data_model.constants.assignment_state import AssignmentState from mephisto.abstractions.blueprints.mock.mock_blueprint import MockBlueprint from mephisto.abstractions.blueprints.mock.mock_task_builder import MockTaskBuilder @@ -25,7 +25,7 @@ ) from mephisto.data_model.assignment import Assignment from mephisto.data_model.task_run import TaskRun -from mephisto.data_model.test.utils import get_test_task_run +from mephisto.abstractions.test.utils import get_test_task_run # TODO(#97) Update supervisor to be able to provide mock setups to test against a blueprint from mephisto.abstractions.providers.mock.mock_agent import MockAgent diff --git a/test/abstractions/providers/mturk_sandbox/test_mturk_provider.py b/test/abstractions/providers/mturk_sandbox/test_mturk_provider.py index ee3e74b0d..45126c497 100644 --- a/test/abstractions/providers/mturk_sandbox/test_mturk_provider.py +++ b/test/abstractions/providers/mturk_sandbox/test_mturk_provider.py @@ -12,8 +12,8 @@ import pytest from typing import Type -from mephisto.data_model.test.utils import get_test_requester -from mephisto.data_model.test.crowd_provider_tester import CrowdProviderTests +from mephisto.abstractions.test.utils import get_test_requester +from mephisto.abstractions.test.crowd_provider_tester import CrowdProviderTests from mephisto.abstractions.crowd_provider import CrowdProvider from mephisto.abstractions.providers.mturk_sandbox.sandbox_mturk_provider import ( SandboxMTurkProvider, diff --git a/test/core/test_database.py b/test/core/test_database.py index d871e52ff..b14ddef55 100644 --- a/test/core/test_database.py +++ b/test/core/test_database.py @@ -9,7 +9,7 @@ import os import tempfile -from mephisto.data_model.test.data_model_database_tester import BaseDatabaseTests +from mephisto.abstractions.test.data_model_database_tester import BaseDatabaseTests from mephisto.abstractions.databases.local_database import LocalMephistoDB diff --git a/test/core/test_operator.py b/test/core/test_operator.py index c61263cb0..15fe52bf6 100644 --- a/test/core/test_operator.py +++ b/test/core/test_operator.py @@ -13,7 +13,7 @@ import time import threading -from mephisto.data_model.test.utils import get_test_requester +from mephisto.abstractions.test.utils import get_test_requester from mephisto.data_model.constants.assignment_state import AssignmentState from mephisto.abstractions.databases.local_database import LocalMephistoDB from mephisto.operations.operator import Operator diff --git a/test/core/test_supervisor.py b/test/core/test_supervisor.py index 17170547f..4043da608 100644 --- a/test/core/test_supervisor.py +++ b/test/core/test_supervisor.py @@ -19,7 +19,7 @@ from mephisto.abstractions.providers.mock.mock_provider import MockProvider from mephisto.abstractions.databases.local_database import LocalMephistoDB from mephisto.operations.task_launcher import TaskLauncher -from mephisto.data_model.test.utils import get_test_task_run +from mephisto.abstractions.test.utils import get_test_task_run from mephisto.data_model.assignment import InitializationData from mephisto.data_model.task_run import TaskRun from mephisto.operations.supervisor import Supervisor, Job diff --git a/test/core/test_task_launcher.py b/test/core/test_task_launcher.py index 4c5c36358..e33e4525e 100644 --- a/test/core/test_task_launcher.py +++ b/test/core/test_task_launcher.py @@ -11,7 +11,7 @@ from typing import List, Iterable import time -from mephisto.data_model.test.utils import get_test_task_run +from mephisto.abstractions.test.utils import get_test_task_run from mephisto.abstractions.databases.local_database import LocalMephistoDB from mephisto.operations.task_launcher import TaskLauncher from mephisto.data_model.assignment import InitializationData