From b888068fc8e6ab1a195c2bc316ed3c63ae4acdcd Mon Sep 17 00:00:00 2001 From: Sean Kim <33474168+seankim658@users.noreply.github.com> Date: Sat, 16 Mar 2024 22:32:53 -0400 Subject: [PATCH] refactor documentation --- README.md | 206 +------------------------------------- docs/config_file.md | 36 +++++++ docs/id_assign_process.md | 67 +++++++++++++ docs/id_implementation.md | 196 ++++++++++++++++++++++++++++++++++++ docs/initial_setup.md | 85 ++++++++++++++++ 5 files changed, 389 insertions(+), 201 deletions(-) create mode 100644 docs/config_file.md create mode 100644 docs/id_assign_process.md create mode 100644 docs/id_implementation.md create mode 100644 docs/initial_setup.md diff --git a/README.md b/README.md index 6cde5a1..0e6cc5f 100644 --- a/README.md +++ b/README.md @@ -1,205 +1,9 @@ # Biomarker Backend API -- [Server Requirements](#server-requirements) -- [Getting Started](#getting-started) - - [Clone the Repository](#clone-the-repository) - - [Creating and Starting Docker Container for MongoDB](#creating-and-starting-docker-container-for-mongodb) - - [Initialize MongoDB User](#initialize-mongodb-user) - - [Assign Biomarker IDs](#assign-biomarker-ids) - - [Populate Database](#populate-database) - - [Copy Files](#copy-files) - - [Creating and Starting Docker Container for the APIs](#creating-and-starting-docker-container-for-the-apis) -- [Config File Definitions](#config-file-definitions) -- [Internal Backend Documentation](#internal-backend-documentation) - - [ID Assignment System](#id-assignment-system) +Usage Guides: +- [Initial Server Setup](/docs/initial_setup.md) +- [Biomarker ID Assignment Process](/docs/id_assign_process.md) +- [ID Backend Implementation Documentation](/docs/id_implementation.md) +- [Config File Definitions](/docs/config_file.md) API documentation can be found [here](./api/biomarker/README.md). - -# Server Requirements - -The following must be available on your server: -- wheel -- pymongo -- docker - -# Getting Started - -## Clone the Repository - -Clone the repository onto your host machine using `git clone`. - -## Creating and Starting Docker Container for MongoDB - -Navigate to the `/api` subdirectory and run the `create_mongodb_container.py` script: - -```bash -cd api -python create_mongodb_container.py -s $SER -docker ps --all -``` - -The first command will navigate you into the api directory. The second command will run the script. The `$SER` argument should be replaced with the server you are running on (dev, tst, beta, prd). The last command lists all docker containers. You should see the docker mongodb docker container that the script created, in the format of `running_biomarker-api_mongo_$SER` where `$SER` is the specified server. - -Expected output should look something like this: - -```bash -Found container: running_biomarker-api_mongo_{SER} -Found network: biomarker-api_backend_network_{SER} -e6c50502da1b - -5e1146780c4fa96a6af6e4555cd119368e9907c4d50ad4790f9f5e54e13bf043 -7baa10fed7e89181c902b24f7af9991e07b00c0f3f31f7df58cccba80aef1a2c -``` - -The first two print statements indicate that an old instance of the container and docker network were found. These will be removed by the script. The `e6c50502da1b` is the ID of the removed container. This indicates that the `docker rm -f ...` command executed successfully and removed the existing container. The second to last line is the ID of the newly created docker network. The last line is the ID of the newly created docker container. - -Start the MongoDB container using the `docker start {container}` command or by creating a service file. The service file should be located at `/usr/lib/systemd/system/` and named something along the lines of `docker-biomarker-api-mongo-{SER}.service`. Place the following content in it: - -``` -[Unit] -Description=Biomarker Backend API MongoDB Container -Requires=docker.service -After=docker.service - -[Service] -Restart=always -ExecStart=/usr/bin/docker start -a running_biomarker-api_mongo_$SER -ExecStop=/usr/bin/docker stop -t 2 running_biomarker-api_mongo_$SER - -[Install] -WantedBy=default.target -``` - -This will ensure the container is automatically restarted in case of server reboot. You can start/stop the container with the following commands: - -``` -$ sudo systemctl daemon-reload -$ sudo systemctl enable docker-biomarker-api-mongo-{SER}.service -$ sudo systemctl start docker-biomarker-api-mongo-{SER}.service -$ sudo systemctl stop docker-biomarker-api-mongo-{SER}.service -``` - -## Initialize MongoDB User - -Stay in the `/api` subdirectory and run the `init_mongodb.py` script: - -```bash -python init_mongodb.py -s $SER -``` - -Where the `$SER` argument is the specified server. This should only be run once on initial setup for each server. - -## Assign Biomarker IDs - -To assign biomarker IDs to your new data, run the `id_assign.py` script from the `/api` directory. This script can only be run from the `tst` server. More information about the under the hood implementation of the ID generation is available in the [ID Assignment System](#id-assignment-system) section. Whether or not a collision is found, the record will be assigned its corresponding `biomarker_id`. An additional key, `collision` is added to the record, with a value of `0` indicating no collision and `1` indicating a collision. This will be used during the data load process and removed. If a collision is found, an entry will be created for it in the file's corresponding collision report. - -```bash -python id_assign.py -s $SER -``` - -## Populate Database - -To load data, run the `load_data.py` script from the `/api` directory. You have to complete the ID assignment steps and handle collisions before this step can be completed. The data should be in the filepath `/data/shared/biomarkerdb/generated/datamodel/new_data/current` where `current/` is a symlink to the current version directory. - -```bash -python load_data.py -s $SER -``` - -Where the `$SER` argument is the specified server. - -The code will do some preliminary checks on the data that is to be loaded. It will make sure that each record has a valid formatted biomarker ID. - -## Copy Files - -After the data has been properly ID assigned, collisions have been handled, and the `tst` and `prd` databases have been loaded, run the `copy_files.py` script to copy the files into the `existing_data` directory. This is the master directory which holds all the data files that have been loaded into the backend API. This must be run from the `tst` server. - -```bash -python copy_files.py -s tst -``` - -## Creating and Starting Docker Container for the APIs - -To create the API container, run the `create_api_container.py` script from the `/api` directory. - -```bash -python create_api_container.py -s $SER -docker ps --all -``` - -The first command will run the script. The `$SER` argument should be replaced with the server you are running on (tst, prd). The last command lists all docker containers. You should see the api container that the script created, in the format of `running_biomarker-api_api_$SER` where `$SER` is the specified server. Start the docker container with the `docker start` command or create a service file as shown above (recommended). - -API documentation can be found [here](./api/biomarker/README.md). - -# Config File Definitions - -```json - { - "project": "project name", - "api_port": { - "prd": "production server api port", - "beta": "beta server api port", - "tst": "test server api port", - "dev": "development server api port" - }, - "data_path": "prefix filepath for the bind-mounted directory", - "dbinfo": { - "dbname": "database name", - "port": { - "prd": "production server database port", - "beta": "beta server database port", - "tst": "test server database port", - "dev": "development server database port" - }, - "bridge_network": "docker bridge network name", - "admin": { - "db": "admin database name (admin)", - "user": "admin username", - "password": "admin password" - }, - "biomarkerdb_api": { - "db": "database name", - "collection": "data collection", - "id_collection": "ID map", - "user": "biomarker database username", - "password": "biomarker database password" - } - } - } -``` - -# Internal Backend Documentation - -## ID Assignment System - -The high level workflow for the ID assignment system is as follows: - -```mermaid -flowchart TD - A[Data Release with JSON Data] --> B{id_assign.py} - B --> C[Extracts the core field elements] - C --> D[Preprocesses/cleans core field values] - D --> E[Concatenates core fields in alphabetical order] - E --> F[Resulting string is hashed] - F --> G[Check the id_map collection for potential collision] - G --> H[If collision:\nMark as collision and add to collision report] - H --> K[Investigate collisions and handle accordingly] - G --> I[If no collision:\nMap and increment new ordinal ID mapped to hash value] - I --> J[Assign new ordinal ID to record] -``` - -The core fields are defined as in the Biomarker-Partnership RFC (which can be found in [this](https://github.com/biomarker-ontology/biomarker-partnership) repository). - -When loading data into the project, the core field values are extracted, cleaned, sorted, and concatenated. The resulting string is hashed and that hash value is checked for a potential collision in the MongoDB `id_map_collection`. If no collision is found, a new entry is added to the `id_map_collection` which stores the hash value and maps an incremented human readable ordinal ID. The core values string that generated the hash value is also stored with each entry for potential debugging purposes. - -Example: -```json -{ - "hash_value": "", - "ordinal_id": "", - "core_values_str": "" -} -``` - -The ordinal ID format is two letters followed by four digits. The ID space goes from `AA0000` to `ZZ9999`. - -This hash system combined with a MongoDB unique field index allows for scalable and fast ID assignments. \ No newline at end of file diff --git a/docs/config_file.md b/docs/config_file.md new file mode 100644 index 0000000..18dd0c6 --- /dev/null +++ b/docs/config_file.md @@ -0,0 +1,36 @@ +# Config File Definitions + +```json + { + "project": "project name", + "api_port": { + "prd": "production server api port", + "beta": "beta server api port", + "tst": "test server api port", + "dev": "development server api port" + }, + "data_path": "prefix filepath for the bind-mounted directory", + "dbinfo": { + "dbname": "database name", + "port": { + "prd": "production server database port", + "beta": "beta server database port", + "tst": "test server database port", + "dev": "development server database port" + }, + "bridge_network": "docker bridge network name", + "admin": { + "db": "admin database name (admin)", + "user": "admin username", + "password": "admin password" + }, + "biomarkerdb_api": { + "db": "database name", + "collection": "data collection", + "id_collection": "ID map", + "user": "biomarker database username", + "password": "biomarker database password" + } + } + } +``` diff --git a/docs/id_assign_process.md b/docs/id_assign_process.md new file mode 100644 index 0000000..bb92163 --- /dev/null +++ b/docs/id_assign_process.md @@ -0,0 +1,67 @@ +# Biomarker ID Assignment Process + +This guide walks you through how to assign new, incoming data to their corresponding biomarker ID's, how to load the processed data into the MongoDB instance, and how to prepare the data release version for the JSON data model formatted data. + +- [Assign Biomarker IDs](#assign-biomarker-ids) +- [Populate the Database](#populate-the-database) +- [Copy Files](#copy-files) + +## Assign Biomarker IDs + +To assign biomarker IDs to your new data, run the `id_assign.py` script from the `/id` directory. This script can only be run from the `tst` server. More information about the under the hood implementation of the ID generation is available in the [ID Implementation Documentation](/docs/id_implementation.md) readme. + +While processing each data record, each data record will be assigned its corresponding `biomarker_canonical_id`. Once the aggregate canonical ID is assigned, the record will be assigned a second level identifier. Whether a collision is found or not, the record will be assigned an additional key called `collision`. This key will have a value of `0` indicating no collision or `1` indicating a collision. If a value of `1` is assigned, some additional information wil lbe added to that specific source file's collision report (which is saved into the `id/collision_reports` subdirectory). This key will be used during the data load process and subsequently removed before loading the data. This value determines which MongoDB collection the data record will be loaded into. + +```bash +cd id +python id_assign.py -s $SER +``` + +## Populate the Database + +To load the processed data, run the `load_data.py` script from the `/id` directory. You have to complete the ID assignment steps before this step can be completed. The data should be in the filepath `/data/shared/biomarkerdb/generated/datamodel/new_data/current` where `current/` is a symlink to the current version directory. + +Optionally, you can also create a load map that indicates if any files should be completely loaded into the `unreviewed` MongoDB collection (bypassing/overriding any checks on the `collision` key). If included, this file should be called `load_map.json` and should be placed in the same directory as the data. The format of the file allows you to specify which files should be loaded where. The format can look like: + +```json +{ + "unreviewed": ["file_1.json", "file_2.json", ..., "file_n.json"], + "reviewed": ["file_1.json", "file_2.json", ..., "file_n.json"] +} +``` + +If you only include one of keys, then the rest of the files will be assumed to be loaded into the other collection. For example, if you are attempting to load 3 files called `file_1.json`, `file_2.json`, and `file_3.json` we can specify in the load map: + +```json +{ + "unreviewed": ["file_1.json"] +} +``` + +In this case, `file_1.json` will be loaded into the unreviewed collection and the other two files will be loaded into the reviewed collection. Alternatively, you can specify: + +```json +{ + "reviewed": ["file_2.json", "file_3.json"] +} +``` + +This will have the same result of the above example. You can also explicitly list both the `unreviewed` and `reviewed` keys and list out all of the files but that can become quite verbose in large data releases. To prevent some errors, the `load_data.py` script will prompt the user for confirmation of their choices before continuing to the data load. + +```bash +python load_data.py -s $SER +``` + +Where the `$SER` argument is the specified server. + +The code will do some preliminary checks on the data that is to be loaded. It will make sure that each record has a valid formatted biomarker ID. + +## Copy Files + +After the data has been properly ID assigned, collisions have been handled, and the `tst` and `prd` databases have been loaded, run the `copy_files.py` script to copy the files into the `existing_data` directory. This is the master directory which holds all the data files that have been loaded into the backend API over the history of the project. This must be run from the `tst` server. + +```bash +python copy_files.py -s tst +``` + +After all these steps have been completed, the data has been successfully assigned their unique IDs and prepared for a new data release. diff --git a/docs/id_implementation.md b/docs/id_implementation.md new file mode 100644 index 0000000..c4d953a --- /dev/null +++ b/docs/id_implementation.md @@ -0,0 +1,196 @@ +# ID Backend Implementation Details + +- [Background](#background) + - [Data Model Structure](#data-model-structure) + - [ID Structure](#id-structure) + - [Canonical ID](#canonical-id-biomarkercanonicalid) + - [Second Level ID](#second-level-id-biomarkerid) +- [Implementation](#backend-implementation) + - [MongoDB Collections](#mongodb-collections) + - [Canonical ID Map Collection](#canonical-id-biomarkercanonicalid) + - [Second Level ID Map Collection](#second-level-id-map-collection) + +# Background + +## Data Model Structure + +```json +{ + "biomarker_canonical_id": "", + "biomarker_id": "", + "biomarker_component": [ + { + "biomarker": "", + "assessed_biomarker_entity": { + "recommended_name": "", + "synonyms": [ + { + "synonym": "" + } + ] + }, + "assessed_biomarker_entity_id": "", + "assessed_entity_type": "", + "specimen": [ + { + "name": "", + "specimen_id": "", + "name_space": "", + "url": "", + "loinc_code": "" + } + ], + "evidence_source": [ + { + "evidence_id": "", + "database": "", + "url": "", + "evidence_list": [ + { + "evidence": "" + } + ], + "tags": [ + { + "tag": "" + } + ] + } + ] + } + ], + "best_biomarker_role": [ + { + "role": "" + } + ], + "condition": { + "condition_id": "", + "recommended_name": { + "condition_id": "", + "name": "", + "description": "", + "resource": "", + "url": "" + }, + "synonyms": [ + { + "synonym_id": "", + "name": "", + "resource": "", + "url": "" + } + ] + }, + "exposure_agent": { + "exposure_agent_id": "", + "recommended_name": { + "exposure_agent_id": "", + "name": "", + "description": "", + "resource": "", + "url": "" + } + }, + "evidence_source": [ + { + "evidence_id": "", + "database": "", + "url": "", + "evidence_list": [ + { + "evidence": "" + } + ], + "tags": [ + { + "tag": "" + } + ] + } + ], + "citation": [ + { + "citation_title": "", + "journal": "", + "authors": "", + "date": "", + "reference": [ + { + "reference_id": "", + "type": "", + "url": "" + } + ], + "evidence_source": { + "evidence_id": "", + "database": "", + "url": "" + } + } + ] +} +``` + +## ID Structure + +### Canonical ID (biomarker_canonical_id) + +- The canonical ID is based on the `biomarker` and `assessed_biomarker_entity` fields. +- A unique pair of `biomarker` and `assessed_biomarker_entity` will be assigned a new `biomarker_canonical_id`. + +### Second Level ID (biomarker_id) + +- The second level ID is based upon the combination of the `biomarker_canonical_id` and `condition_id` fields. + +# Backend Implementation + +- When processing data for ID assignment, the `biomarker` and `assessed_biomarker_entity` fields will be normalized, sorted, and concatenated. +- The resulting string will be hashed and compared against our existing ID collection. +- If no collision is found: + - The new hash value will be added to the canonical ID map collection and the record will be assigned a new `biomarker_canonical_id`. + - The second level `biomarker_id` will be assigned with a value in the format of `{biomarker_canonical_id}-1`. + - The second level ID will be added to the second level ID map collection. +- If a collision is found: + - The record will be assigned the existing `biomarker_canonical_id` that caused the collision. + - The second level ID map collection will be queried on the `biomarker_canonical_id` and the existing_entries (representing the existing condition pairs that already exist under that canonical ID) will be checked for existence of that condition value already. + - If no collision is found: + - The current index will be incremented and the `n + 1` value will be assigned in the format of `{biomarker_canonical_id}-{n + 1}`. + - The new entry will be added to the second level ID map collection. + - If a collision is found: + - The data record will be marked as a collision to be loaded into the collision data collection. + +## MongoDB Collections + +### Canonical ID Map Collection + +```json +{ + "hash_value": , + "biomarker_canonical_id": , // example: AA0001 + "core_values_str": // the string that was used to create the hash value +} +``` + +- There is a unique field index on the `hash_value` key. + +### Second Level ID Map Collection + +```json +{ + "biomarker_canonical_id": "", + "values": { + "cur_index": "{n}", // example: 2 + "existing_entries": [ + { + "{condition_id}": "{biomarker_canonical_id}-1" + }, + ..., + { + "{condition_id}": "{biomarker_canonical_id}-n" + } + ] + } +} +``` +- There is a unique field index on the `biomarker_canonical_id` key. diff --git a/docs/initial_setup.md b/docs/initial_setup.md new file mode 100644 index 0000000..71405e9 --- /dev/null +++ b/docs/initial_setup.md @@ -0,0 +1,85 @@ +# Initial Server Setup + +This usage guide should be followed if setting up this repository on a server from scratch. + +- [Server Dependencies](#server-dependencies) +- [Clone the Repository](#clone-the-repository) +- [Creating and Starting the Docker Container for the MongoDB](#creating-and-starting-the-docker-container-for-the-mongodb) +- [Managing the Docker Containers with a Service File](#managing-the-docker-containers-with-a-service-file) +- [Initialize MongoDB User](#initialize-mongodb-user) +- [Creating and Starting Docker Container for the API](#creating-and-starting-docker-container-for-the-apis) + +## Server Dependencies + +The following must be available on your server: +- wheel +- pymongo +- docker + +## Clone the Repository + +Clone the reporistory onto the target machine using `git clone`. + +## Creating and Starting the Docker Container for the MongoDB + +Navigate to the `/api` subdirectory and run the `create_mongodb_container.py` script: + +```bash +cd api +python create_mongodb_container.py -s $SER +docker ps --all +``` + +The first command will navigate you into the api directory. The second command will run the script. The `$SER` argument should be replaced with the server you are running on (dev, tst, beta, prd). The last command lists all docker containers. You should see the docker mongodb docker container that the script created, in the format of `running_biomarker-api_mongo_$SER` where `$SER` is the specified server (i.e. `tst` or `prd`). + +Start the MongoDB container using the `docker start {container}` command or by creating a service file (instructions for this in the [Managing the Docker Containers with a Service File](#managing-the-docker-containers-with-a-service-file)) section. + +## Initialize MongoDB User + +Stay in the `/api` subdirectory and run the `init_mongodb.py` script: + +```bash +python init_mongodb.py -s $SER +``` + +Where the `$SER` argument is the specified server. This should only be run once on initial setup for each server. + +## Creating and Starting Docker Container for the API + +To create the API container, run the `create_api_container.py` script from the `/api` directory. + +```bash +python create_api_container.py -s $SER +docker ps --all +``` + +The first command will run the script. The `$SER` argument should be replaced with the server you are running on (tst, prd). The last command lists all docker containers. You should see the api container that the script created, in the format of `running_biomarker-api_api_$SER` where `$SER` is the specified server. Start the docker container with the `docker start` command or create a service file (recommended). + +## Managing the Docker Containers with a Service File + +The service files should be located at `/usr/lib/systemd/system/` and named something along the lines of `docker-biomarker-api-mongo-{SER}.service` (using the MongoDB container as an example) where `{SER}` indicates the server. Place the following content in it: + +``` +[Unit] +Description=Biomarker Backend API MongoDB Container +Requires=docker.service +After=docker.service + +[Service] +Restart=always +ExecStart=/usr/bin/docker start -a running_biomarker-api_mongo_$SER +ExecStop=/usr/bin/docker stop -t 2 running_biomarker-api_mongo_$SER + +[Install] +WantedBy=default.target +``` + +This will ensure the container is automatically restarted in case of server reboot. You can start/stop the container with the following commands: + +``` +$ sudo systemctl daemon-reload +$ sudo systemctl enable docker-biomarker-api-mongo-{SER}.service +$ sudo systemctl start docker-biomarker-api-mongo-{SER}.service +$ sudo systemctl stop docker-biomarker-api-mongo-{SER}.service +``` +