Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Biomarker id assignment system #4

Merged
merged 37 commits into from
Jan 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
d6c7bc2
add id collection and db name fix
seankim658 Jan 5, 2024
5de77d6
refactor db values
seankim658 Jan 5, 2024
cdbe7cd
clean slate
seankim658 Jan 5, 2024
c39b3a8
update documentation
seankim658 Jan 5, 2024
10b6402
format fix
seankim658 Jan 5, 2024
3e09e63
change persistent api_db path
seankim658 Jan 5, 2024
013e51d
fix mongodb storage path
seankim658 Jan 5, 2024
8911808
id assignment and mapping functions
seankim658 Jan 5, 2024
94a9dad
handle hash id generation and ordinal id assignment
seankim658 Jan 5, 2024
72b84e2
update id assignment documentation
seankim658 Jan 5, 2024
cc4165e
update readme title
seankim658 Jan 9, 2024
6e84825
update ports
seankim658 Jan 9, 2024
2d04dbb
refactor
seankim658 Jan 18, 2024
5799df5
fix init entry ordinal id extraction
seankim658 Jan 18, 2024
7b96c6d
update load data documentation
seankim658 Jan 18, 2024
60fa1e4
update collision message and add logic to write updated data back to …
seankim658 Jan 18, 2024
08b168e
added data models and removed marshaling
seankim658 Jan 18, 2024
3c6b069
added collision_reports directory
seankim658 Jan 19, 2024
f2be913
add biomarker id endpoint
seankim658 Jan 19, 2024
052db8f
add collision report output file handling
seankim658 Jan 19, 2024
e925fb4
add biomarker_id api.param
seankim658 Jan 19, 2024
9b3136d
add better error handling
seankim658 Jan 19, 2024
e2a26a7
change log path
seankim658 Jan 19, 2024
04037f8
testing
seankim658 Jan 19, 2024
6f9a30a
add document deepcopy for mongodb insertion and added timestamp for c…
seankim658 Jan 19, 2024
a835956
update package name to match dockerfile
seankim658 Jan 19, 2024
21fd6e0
add alternative commands for server usage
seankim658 Jan 19, 2024
af349ac
fixed relative imports
seankim658 Jan 19, 2024
22ff434
fixed data evidence list and tag data models
seankim658 Jan 19, 2024
4d0f9ab
manage mongodb connection with environment variable
seankim658 Jan 19, 2024
60f7c80
add db collection environment variable
seankim658 Jan 19, 2024
282e5e7
fixed gunicorn entrypoint
seankim658 Jan 19, 2024
ba2f777
using mongoclient directly instead of flask-pymongo
seankim658 Jan 19, 2024
4e72ff2
ignore internal mongo objectid
seankim658 Jan 19, 2024
bb3202d
update endpoint documentation
seankim658 Jan 19, 2024
679002f
update id namespace
seankim658 Jan 19, 2024
7388319
update documentation
seankim658 Jan 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,6 @@ dist/
build/
*egg-info/
test.py
misc_documentation.md
misc_documentation.md
*.log
collision_reports/
77 changes: 54 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Biomarkerkb Backend Dataset Viewer
# Biomarker Backend API

Work in progress.

Expand All @@ -10,6 +10,8 @@ Work in progress.
- [Populate Database](#populate-database)
- [Creating and Starting Docker Container for the APIs](#creating-and-starting-docker-container-for-the-apis)
- [Config File Definitions](#config-file-definitions)
- [Internal Backend Documentation](#internal-backend-documentation)
- [ID Assignment System](#id-assignment-system)

API documentation can be found [here](./api/biomarker/README.md).

Expand Down Expand Up @@ -44,13 +46,13 @@ python create_mongodb_container.py -s $SER
docker ps --all
```

The first command will navigate you into the api directory. The second command will run the script. The `$SER` argument should be replaced with the server you are running on (dev, tst, beta, prd). The last command lists all docker containers. You should see the docker mongodb docker container that the script created, in the format of `running_biomarkerkb_mongo_$SER` where `$SER` is the specified server.
The first command will navigate you into the api directory. The second command will run the script. The `$SER` argument should be replaced with the server you are running on (dev, tst, beta, prd). The last command lists all docker containers. You should see the docker mongodb docker container that the script created, in the format of `running_biomarker-api_mongo_$SER` where `$SER` is the specified server.

Expected output should look something like this:

```bash
Found container: running_biomarkerkb_mongo_dev
Found network: biomarkerkb_backend_network_dev
Found container: running_biomarker-api_mongo_dev
Found network: biomarker-api_backend_network_dev
e6c50502da1b

5e1146780c4fa96a6af6e4555cd119368e9907c4d50ad4790f9f5e54e13bf043
Expand All @@ -59,6 +61,8 @@ e6c50502da1b

The first two print statements indicate that an old instance of the container and docker network were found. These will be removed by the script. The `e6c50502da1b` is the ID of the removed container. This indicates that the `docker rm -f ...` command executed successfully and removed the existing container. The second to last line is the ID of the newly created docker network. The last line is the ID of the newly created docker container.

Start the MongoDB container using the `docker start {container}` command.

## Initialize MongoDB User

Stay in the `/api` subdirectory and run the `init_mongodb.py` script:
Expand All @@ -74,17 +78,19 @@ Where the `$SER` argument is the specified server. This should only be run once.
To load data, run the `load_data.py` script from the `/api` directory.

```bash
python load_data.py -s $SER -f $FP
python load_data.py -s $SER -v $VER
```

Where the `$SER` argument is the specified server and `$FP` is the filepath to the seed csv data.
Where the `$SER` argument is the specified server and `$VER` is the filepath to the data release to load.

If testing on a local machine, you can test using code or a GUI option such as MongoDB Compass. The connection string should look something along the lines of:

```bash
mongodb://biomarkeradmin:biomarkerpass@localhost:27017/?authMechanism=SCRAM-SHA-1&authSource=biomarkerkbdb
mongodb://biomarkeradmin:biomarkerpass@localhost:27017/?authMechanism=SCRAM-SHA-1&authSource=biomarkerdb_api
```

The `load_data.py` script will handle the biomarker ID assignment. More information about the under the hood implementation is available in the [ID Assignment System](#id-assignment-system) section. If any collisions are detected during the ID assignment process an output message will be printed indicating the file, document, core values string, and resulting hash value that caused the collision. In this case, the record is NOT added to the MongoDB instance. If no collision was found, the record will be added to the biomarker collection with the new ordinal ID assigned. In the case of no collision, the record has the `biomarker_id` value replaced and the updated JSON is written back out to overwrite the input file.

## Creating and Starting Docker Container for the APIs

To create the API container, run the `create_api_container.py` script from the `/api` directory.
Expand All @@ -96,13 +102,7 @@ docker ps --all

The first command will run the script. The `$SER` argument should be replaced with the server you are running on (dev, tst, beta, prd). The last command lists all docker containers. You should see the api container that the script created, in the format of `running_biomarkerkb_api_$SER` where `$SER` is the specified server.

After the container is up and running, you can manually test the API using Python's `request` library, curl, or in the web browser. An example API call:

```bash
http://localhost:8081/dataset/randomsample?sample=5
```

API documentation can be found [here](https://github.com/biomarker-ontology/biomarkerkb-backend-datasetviewer/tree/main/api/biomarkerkb#endpoints).
API documentation can be found [here](./api/biomarker/README.md).

# Config File Definitions

Expand All @@ -115,11 +115,6 @@ API documentation can be found [here](https://github.com/biomarker-ontology/biom
"tst": "test server api port",
"dev": "development server api port"
},
"mail":{
"server": "not used for now",
"port": "not used for now",
"sender": "not used for now"
},
"data_path": "prefix filepath for the bind-mounted directory",
"dbinfo": {
"dbname": "database name",
Expand All @@ -135,11 +130,47 @@ API documentation can be found [here](https://github.com/biomarker-ontology/biom
"user": "admin username",
"password": "admin password"
},
"biomarkerkb": {
"db": "biomarkerkbdb database",
"user": "biomarkerkb database username",
"password": "biomarkerkb database password"
"biomarkerdb_api": {
"db": "database name",
"collection": "data collection",
"id_collection": "ID map",
"user": "biomarker database username",
"password": "biomarker database password"
}
}
}
```

# Internal Backend Documentation

## ID Assignment System

The high level workflow for the ID assignment system is as follows:

```mermaid
flowchart TD
A[Data Release with JSON Data] --> B{load_data.py}
B --> C[Extracts the core field elements]
C --> D[Preprocesses core field values]
D --> E[Concatenates core fields in alphabetical order]
E --> F[Resulting string is hashed]
F --> G[Check the id_map collection for potential collision]
G --> H[If collision:\nDon't load and add to output message]
G --> I[If no collision:\nAssign new ordinal ID in id_map collection]
I --> J[Assign new ordinal ID to record and load into MongoDB]
```

The core fields are defined as in the Biomarker-Partnership RFC (which can be found in [this](https://github.com/biomarker-ontology/biomarker-partnership) repository).

When loading data into the project, the core field values are extracted, cleaned, and concatenated. The resulting string is hashed and that hash value is checked for a potential collision in the MongoDB `id_map_collection`. If no collision is found, a new entry is added to the `id_map_collection` which stores the hash value and a a human readable ordinal ID. The core values string that generated the hash value is also stored with each entry for potential debugging purposes.

Example:
```json
{
"hash_value": "<VALUE>",
"ordinal_id": "<VALUE>",
"core_values_str": "<VALUE>"
}
```

The ordinal ID format is two letters followed by four digits. The ID space goes from `AA0000` to `ZZ9999`.
8 changes: 4 additions & 4 deletions api/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,16 @@ FROM python:3.10.4

WORKDIR /app

ENV FLASK_APP=biomarkerkb
ENV FLASK_APP=biomarker
ENV FLASK_ENV=production

COPY ./requirements.txt .
RUN pip install -r requirements.txt

# copy wheel distribution and install it
COPY ./dist/biomarkerkb-1.0-py3-none-any.whl .
RUN pip install biomarkerkb-1.0-py3-none-any.whl
COPY ./dist/biomarker-1.0-py3-none-any.whl .
RUN pip install biomarker-1.0-py3-none-any.whl

COPY . .

ENTRYPOINT ["gunicorn", "-b", ":80", "biomarkerkb:create_app()"]
ENTRYPOINT FLASK_APP=biomarker gunicorn -b :80 'biomarker:create_app()' --timeout 120 --graceful-timeout 60
5 changes: 3 additions & 2 deletions api/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,13 @@

| Directory/File | |
|-------------------------------|-------------------------------------------------------------------|
| `biomarkerkb/` | The biomarkerkb data api. |
| `biomarker/` | The biomarker api. |
| `config.json` | Config file for the api setup. |
| `create_api_container.py` | Creates the api container. |
| `create_mongodb_container.py` | Creates the initial MongoDB container. |
| `Dockerfile` | Dockerfile for the api image (used in `create_api_container.py`) |
| `id.py` | Defines the logic for the ID assignment system. |
| `init_mongodb.py` | Creates the database user scoped to the biomarkerkbdb. |
| `load_data.py` | Loads the MongoDB collection (`biomarker_collection`) with the seed data (from a csv file). |
| `requirements.txt` | Requirements file for the api image. |
| `setup.py` | Setup script for packaging the biomarkerkb project. |
| `setup.py` | Setup script for packaging the biomarker project. |
58 changes: 34 additions & 24 deletions api/biomarker/README.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,51 @@
# API

- [Documentation](#documentation)
- [Endpoints](#endpoints)
- [Models](#models)
All endpoints are hosted at the root URL https://hivelab.biochemistry.gwu.edu/biomarker/api/.

- [Endpoints](#endpoints)
- [Dataset Endpoints](#dataset-endpoints)
- [ID Endpoints](#id-endpoints)
- [Models](#models)
- [Directory Strucutre](#directory-structure)

## Documentation
## Endpoints

### Endpoints
### Dataset Endpoints

`GET /dataset/getall`
Returns the entire dataset.
- Example call: `http://{HOST}:8081/dataset/getall`
- Parameters:
- x-fields (optional): optional fields mask
- Return schema: [`data_model`](#models)
`GET /dataset/getall?page={page}&per_page={per_page}`
- Parameters:
- `page`: The page number to return (default = 1).
- `per_page`: The number of records to return per page (default = 50).
- Returns:
- `200 Success`: The biomarker records.


`GET /dataset/randomsample?sample={sample}`
- Parameters:
- `sample`: The number of samples to return (default = 1).
- Returns:
- `200 Success`: The random subset of biomarker records.
- `400 Bad Request`: Error indicating an invalid sample size was provided (sample must be positive integer).

---
### ID Endpoints

`GET /dataset/randomsample`
Returns a random subset of the dataset.
- Example call: `http://{HOST}:8081/dataset/randomsample?sample={NUMBER}`
`GET /id/getbiomarker?biomarker_id={biomarker_id}`
- Parameters:
- sample (optional, default = 1): number of samples to return
- x-fields (optional): optional fields mask
- Return schema: [`data_model`](#models)
- `biomarker_id`: The biomarker ID to query for.
- Returns:
- `200 Success`: A single biomarker record corresponding to the `biomarker_id` param.
- `400 No biomarker ID provided`: Error indicating param was not included.
- `404 Not Found`: Error indicating biomarker ID was not found.

### Models
## Models

`data_model`:
| Field | Type | Description |
|-----------------------|-----------|-----------------------------------|
The data models can be seen [here](data_models.py).

## Directory Structure

| Directory/File | |
|-------------------------------|-------------------------------------------------------------------|
| `__init__.py` | Entry point for the api module. |
| `dataset.py` | The local dataset module, which defines the dataset API. |
| `config/` | Config files for flask instance. |
| `dataset.py` | The general dataset API endpoints. |
| `id.py` | The biomarker ID specific API endpoints. |
| `data_models.py` | Defines the data models for the API documentation. |
24 changes: 14 additions & 10 deletions api/biomarker/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,30 @@
from flask_cors import CORS
from flask_restx import Api
from .dataset import api as dataset_api
from flask_pymongo import PyMongo
from .id import api as id_api
from pymongo import MongoClient
import os

MONGO_URI = os.getenv('MONGODB_CONNSTRING')
DB_NAME = 'biomarkerdb_api'
DB_COLLECTION = 'biomarker_collection'

def create_app():

# create flask instance
app = Flask(__name__)

app.config['ENV'] = 'development'

if app.config['ENV'] == 'production':
app.config.from_pyfile('./config/config.py')
else:
app.config.from_pyfile('./config/config_dev.py')

CORS(app)
mongo = PyMongo(app)
app.mongo = mongo

# initialize mongo client
mongo_client = MongoClient(MONGO_URI)
mongo_db = mongo_client[DB_NAME]
app.mongo_db = mongo_db
app.config['DB_COLLECTION'] = DB_COLLECTION

# setup the api using the flask_restx library
api = Api(app, version = '1.0', title = 'Biomarker APIs', description = 'Biomarker Knowledgebase API')
api.add_namespace(dataset_api)
api.add_namespace(id_api)

return app
10 changes: 0 additions & 10 deletions api/biomarker/config/config.py

This file was deleted.

10 changes: 0 additions & 10 deletions api/biomarker/config/config_dev.py

This file was deleted.

Loading