clinical-biomarkers · seankim658 · Jan 19, 2024 · Jan 5, 2024 · Jan 5, 2024 · Jan 5, 2024
diff --git a/.gitignore b/.gitignore
@@ -9,4 +9,6 @@ dist/
 build/
 *egg-info/
 test.py
-misc_documentation.md
+misc_documentation.md
+*.log 
+collision_reports/
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# Biomarkerkb Backend Dataset Viewer
+# Biomarker Backend API 
 
 Work in progress. 
 
@@ -10,6 +10,8 @@ Work in progress.
     - [Populate Database](#populate-database)
     - [Creating and Starting Docker Container for the APIs](#creating-and-starting-docker-container-for-the-apis)
 - [Config File Definitions](#config-file-definitions)
+- [Internal Backend Documentation](#internal-backend-documentation)
+    - [ID Assignment System](#id-assignment-system)
 
 API documentation can be found [here](./api/biomarker/README.md).
 
@@ -44,13 +46,13 @@ python create_mongodb_container.py -s $SER
 docker ps --all 
 ```
 
-The first command will navigate you into the api directory. The second command will run the script. The `$SER` argument should be replaced with the server you are running on (dev, tst, beta, prd). The last command lists all docker containers. You should see the docker mongodb docker container that the script created, in the format of `running_biomarkerkb_mongo_$SER` where `$SER` is the specified server.
+The first command will navigate you into the api directory. The second command will run the script. The `$SER` argument should be replaced with the server you are running on (dev, tst, beta, prd). The last command lists all docker containers. You should see the docker mongodb docker container that the script created, in the format of `running_biomarker-api_mongo_$SER` where `$SER` is the specified server.
 
 Expected output should look something like this:
 
 ```bash
-Found container: running_biomarkerkb_mongo_dev
-Found network: biomarkerkb_backend_network_dev
+Found container: running_biomarker-api_mongo_dev
+Found network: biomarker-api_backend_network_dev
 e6c50502da1b
 
 5e1146780c4fa96a6af6e4555cd119368e9907c4d50ad4790f9f5e54e13bf043
@@ -59,6 +61,8 @@ e6c50502da1b
 
 The first two print statements indicate that an old instance of the container and docker network were found. These will be removed by the script. The `e6c50502da1b` is the ID of the removed container. This indicates that the `docker rm -f ...` command executed successfully and removed the existing container. The second to last line is the ID of the newly created docker network. The last line is the ID of the newly created docker container. 
 
+Start the MongoDB container using the `docker start {container}` command. 
+
 ## Initialize MongoDB User 
 
 Stay in the `/api` subdirectory and run the `init_mongodb.py` script: 
@@ -74,17 +78,19 @@ Where the `$SER` argument is the specified server. This should only be run once.
 To load data, run the `load_data.py` script from the `/api` directory. 
 
 ```bash 
-python load_data.py -s $SER -f $FP 
+python load_data.py -s $SER -v $VER
 ```
 
-Where the `$SER` argument is the specified server and `$FP` is the filepath to the seed csv data. 
+Where the `$SER` argument is the specified server and `$VER` is the filepath to the data release to load. 
 
 If testing on a local machine, you can test using code or a GUI option such as MongoDB Compass. The connection string should look something along the lines of:
 
 ```bash 
-mongodb://biomarkeradmin:biomarkerpass@localhost:27017/?authMechanism=SCRAM-SHA-1&authSource=biomarkerkbdb
+mongodb://biomarkeradmin:biomarkerpass@localhost:27017/?authMechanism=SCRAM-SHA-1&authSource=biomarkerdb_api
 ```
 
+The `load_data.py` script will handle the biomarker ID assignment. More information about the under the hood implementation is available in the [ID Assignment System](#id-assignment-system) section. If any collisions are detected during the ID assignment process an output message will be printed indicating the file, document, core values string, and resulting hash value that caused the collision. In this case, the record is NOT added to the MongoDB instance. If no collision was found, the record will be added to the biomarker collection with the new ordinal ID assigned. In the case of no collision, the record has the `biomarker_id` value replaced and the updated JSON is written back out to overwrite the input file. 
+
 ## Creating and Starting Docker Container for the APIs 
 
 To create the API container, run the `create_api_container.py` script from the `/api` directory. 
@@ -96,13 +102,7 @@ docker ps --all
 
 The first command will run the script. The `$SER` argument should be replaced with the server you are running on (dev, tst, beta, prd). The last command lists all docker containers. You should see the api container that the script created, in the format of `running_biomarkerkb_api_$SER` where `$SER` is the specified server. 
 
-After the container is up and running, you can manually test the API using Python's `request` library, curl, or in the web browser. An example API call:
-
-```bash
-http://localhost:8081/dataset/randomsample?sample=5
-```
-
-API documentation can be found [here](https://github.com/biomarker-ontology/biomarkerkb-backend-datasetviewer/tree/main/api/biomarkerkb#endpoints).
+API documentation can be found [here](./api/biomarker/README.md).
 
 # Config File Definitions
 
@@ -115,11 +115,6 @@ API documentation can be found [here](https://github.com/biomarker-ontology/biom
             "tst": "test server api port",
             "dev": "development server api port"
         },
-        "mail":{
-            "server": "not used for now", 
-            "port": "not used for now",
-            "sender": "not used for now"
-        },
         "data_path": "prefix filepath for the bind-mounted directory",
         "dbinfo": {
             "dbname": "database name",
@@ -135,11 +130,47 @@ API documentation can be found [here](https://github.com/biomarker-ontology/biom
                 "user": "admin username",
                 "password": "admin password"
             },
-            "biomarkerkb": {
-                "db": "biomarkerkbdb database",
-                "user": "biomarkerkb database username",
-                "password": "biomarkerkb database password"
+            "biomarkerdb_api": {
+                "db": "database name",
+                "collection": "data collection",
+                "id_collection": "ID map",
+                "user": "biomarker database username",
+                "password": "biomarker database password"
             }
         }
     }
 ```
+
+# Internal Backend Documentation
+
+## ID Assignment System
+
+The high level workflow for the ID assignment system is as follows:
+
+```mermaid
+flowchart TD
+    A[Data Release with JSON Data] --> B{load_data.py}
+    B --> C[Extracts the core field elements]
+    C --> D[Preprocesses core field values]
+    D --> E[Concatenates core fields in alphabetical order]
+    E --> F[Resulting string is hashed]
+    F --> G[Check the id_map collection for potential collision]
+    G --> H[If collision:\nDon't load and add to output message]
+    G --> I[If no collision:\nAssign new ordinal ID in id_map collection]
+    I --> J[Assign new ordinal ID to record and load into MongoDB]
+```
+
+The core fields are defined as in the Biomarker-Partnership RFC (which can be found in [this](https://github.com/biomarker-ontology/biomarker-partnership) repository). 
+
+When loading data into the project, the core field values are extracted, cleaned, and concatenated. The resulting string is hashed and that hash value is checked for a potential collision in the MongoDB `id_map_collection`. If no collision is found, a new entry is added to the `id_map_collection` which stores the hash value and a a human readable ordinal ID. The core values string that generated the hash value is also stored with each entry for potential debugging purposes. 
+
+Example: 
+```json 
+{
+    "hash_value": "<VALUE>",
+    "ordinal_id": "<VALUE>",
+    "core_values_str": "<VALUE>"
+}
+```
+
+The ordinal ID format is two letters followed by four digits. The ID space goes from `AA0000` to `ZZ9999`.
diff --git a/api/Dockerfile b/api/Dockerfile
@@ -2,16 +2,16 @@ FROM python:3.10.4
 
 WORKDIR /app 
 
-ENV FLASK_APP=biomarkerkb
+ENV FLASK_APP=biomarker
 ENV FLASK_ENV=production 
 
 COPY ./requirements.txt .
 RUN pip install -r requirements.txt 
 
 # copy wheel distribution and install it 
-COPY ./dist/biomarkerkb-1.0-py3-none-any.whl .
-RUN pip install biomarkerkb-1.0-py3-none-any.whl
+COPY ./dist/biomarker-1.0-py3-none-any.whl .
+RUN pip install biomarker-1.0-py3-none-any.whl
 
 COPY . .
 
-ENTRYPOINT ["gunicorn", "-b", ":80", "biomarkerkb:create_app()"]
+ENTRYPOINT FLASK_APP=biomarker gunicorn -b :80 'biomarker:create_app()' --timeout 120 --graceful-timeout 60
diff --git a/api/README.md b/api/README.md
@@ -4,12 +4,13 @@
 
 | Directory/File                |                                                                   |
 |-------------------------------|-------------------------------------------------------------------|
-| `biomarkerkb/`                | The biomarkerkb data api.                                         |
+| `biomarker/`                  | The biomarker api.                                                |
 | `config.json`                 | Config file for the api setup.                                    |
 | `create_api_container.py`     | Creates the api container.                                        |                                           
 | `create_mongodb_container.py` | Creates the initial MongoDB container.                            |
 | `Dockerfile`                  | Dockerfile for the api image (used in `create_api_container.py`)                                     | 
+| `id.py`                       | Defines the logic for the ID assignment system.       |
 | `init_mongodb.py`             | Creates the database user scoped to the biomarkerkbdb.            |
 | `load_data.py`                | Loads the MongoDB collection (`biomarker_collection`) with the seed data (from a csv file).           |
 | `requirements.txt`            | Requirements file for the api image.                              | 
-| `setup.py`                    | Setup script for packaging the biomarkerkb project.               |     
+| `setup.py`                    | Setup script for packaging the biomarker project.               |     
diff --git a/api/biomarker/README.md b/api/biomarker/README.md
@@ -1,41 +1,51 @@
 # API 
 
-- [Documentation](#documentation)
-    - [Endpoints](#endpoints)
-    - [Models](#models)
+All endpoints are hosted at the root URL https://hivelab.biochemistry.gwu.edu/biomarker/api/.
+
+- [Endpoints](#endpoints)
+    - [Dataset Endpoints](#dataset-endpoints)
+    - [ID Endpoints](#id-endpoints)
+- [Models](#models)
 - [Directory Strucutre](#directory-structure)
 
-## Documentation 
+## Endpoints 
 
-### Endpoints 
+### Dataset Endpoints 
 
-`GET /dataset/getall`  
-Returns the entire dataset. 
-- Example call: `http://{HOST}:8081/dataset/getall`
-- Parameters: 
-    - x-fields (optional): optional fields mask
-- Return schema: [`data_model`](#models)
+`GET /dataset/getall?page={page}&per_page={per_page}`
+- Parameters:
+    - `page`: The page number to return (default = 1).
+    - `per_page`: The number of records to return per page (default = 50).
+- Returns:
+    - `200 Success`: The biomarker records. 
+
+
+`GET /dataset/randomsample?sample={sample}`
+- Parameters:
+    - `sample`: The number of samples to return (default = 1).
+- Returns:
+    - `200 Success`: The random subset of biomarker records.
+    - `400 Bad Request`: Error indicating an invalid sample size was provided (sample must be positive integer). 
 
----
+### ID Endpoints 
 
-`GET /dataset/randomsample`  
-Returns a random subset of the dataset.
-- Example call: `http://{HOST}:8081/dataset/randomsample?sample={NUMBER}`
+`GET /id/getbiomarker?biomarker_id={biomarker_id}`  
 - Parameters:
-    - sample (optional, default = 1): number of samples to return
-    - x-fields (optional): optional fields mask
-- Return schema: [`data_model`](#models)
+    - `biomarker_id`: The biomarker ID to query for. 
+- Returns: 
+    - `200 Success`: A single biomarker record corresponding to the `biomarker_id` param. 
+    - `400 No biomarker ID provided`: Error indicating param was not included. 
+    - `404 Not Found`: Error indicating biomarker ID was not found. 
 
-### Models 
+## Models 
 
-`data_model`: 
-| Field                 | Type      | Description                       |
-|-----------------------|-----------|-----------------------------------|
+The data models can be seen [here](data_models.py).
 
 ## Directory Structure 
 
 | Directory/File                |                                                                   |
 |-------------------------------|-------------------------------------------------------------------|
 | `__init__.py`                 | Entry point for the api module.                                   |
-| `dataset.py`                  | The local dataset module, which defines the dataset API.          | 
-| `config/`                     | Config files for flask instance.                                  |
+| `dataset.py`                  | The general dataset API endpoints.          | 
+| `id.py`                       | The biomarker ID specific API endpoints.     |
+| `data_models.py`              | Defines the data models for the API documentation. |
diff --git a/api/biomarker/__init__.py b/api/biomarker/__init__.py
@@ -2,26 +2,30 @@
 from flask_cors import CORS 
 from flask_restx import Api
 from .dataset import api as dataset_api 
-from flask_pymongo import PyMongo
+from .id import api as id_api
+from pymongo import MongoClient
+import os
+
+MONGO_URI = os.getenv('MONGODB_CONNSTRING')
+DB_NAME = 'biomarkerdb_api'
+DB_COLLECTION = 'biomarker_collection'
 
 def create_app():
 
     # create flask instance 
     app = Flask(__name__)
 
-    app.config['ENV'] = 'development'
-
-    if app.config['ENV'] == 'production':
-        app.config.from_pyfile('./config/config.py')
-    else:
-        app.config.from_pyfile('./config/config_dev.py')
-
     CORS(app)
-    mongo = PyMongo(app)
-    app.mongo = mongo 
+
+    # initialize mongo client 
+    mongo_client = MongoClient(MONGO_URI)
+    mongo_db = mongo_client[DB_NAME]
+    app.mongo_db = mongo_db
+    app.config['DB_COLLECTION'] = DB_COLLECTION
 
     # setup the api using the flask_restx library 
     api = Api(app, version = '1.0', title = 'Biomarker APIs', description = 'Biomarker Knowledgebase API')
     api.add_namespace(dataset_api)
+    api.add_namespace(id_api)
 
     return app
diff --git a/api/biomarker/config/config.py b/api/biomarker/config/config.py
diff --git a/api/biomarker/config/config_dev.py b/api/biomarker/config/config_dev.py