The main goal of HDM is to help asses data quality by running ad-hoc programs that "scan" databases regularly to compute metrics & calculate divergence whether in structure or content of databases. Generating alerts that gives Data Engineers insights on what broke down.
To do this we have developed the following features:
Calculate metrics on the data from our warehouses.
- Set up rules to be able to apply operational / business constraints on the databases in connection with the calculated metrics.
- Detect breaks and regressions in the database structure or in the data itself by generating alerts using business rules.
- Allow constraints to be centralized and create a unified HUB to manage data quality in order to deliver the best possible quality data to doctors and researchers.
- Create dashboards on metrics to be able to visualize and explore them.
As you may have understood, Health Data Metrics needs an ecosystem of application in order to work.
-
Elasticsearch
>=v7.10.0
- Elasticsearch Installed and API Endpoint accessible.
-
Kibana
>=v7.10.0
- Kibana Installed and API Endpoint accessible.
-
Airflow
>=v2.1.0
- Airflow Installed and API Endpoint accessible.
- HDM Pipeline imported, setup & running (See More on Airflow Pipeline).
-
Nexus
>=3.29.2-02
- Nexus Installed and API Endpoint accessible.
- Default Repository
- User / Password with rights to [Read artifacts, Search Queries]
/var/www/html/conf/appli/conf-appli.json
See default file : docs/templates/conf-appli.json
/var/www/html/conf/db/conf-db.json
:
See default file : docs/templates/conf-db.json
/var/www/html/conf/ldap/conf-ldap.json
:
See default file : docs/templates/conf-ldap.json
/var/www/html/conf/mail/msmtprc
:
See default file : docs/templates/msmtprc
You can run HDM from 3 different ways :
To run anywhere :
docker run -p 80:80 -v conf/:/var/www/html/conf/ ghcr.io/curie-data-factory/hdm:latest
To deploy in production environments :
helm repo add curiedfcharts https://curie-data-factory.github.io/helm-charts
helm repo update
helm upgrade --install --namespace default --values ./my-values.yaml my-release curiedfcharts/hdm
More info Here
For dev purposes :
- Clone git repository :
git clone https://github.com/curie-data-factory/health-data-metrics.git
cd health-data-metrics/
- Create Conf files & folders :
touch conf/ldap/conf-ldap.json
- Set configuration variables see templates above
- Then run the Docker Compose stack.
docker-compose up -d
- http://localhost:80 Hdm front
- http://localhost:8081 Nexus
- http://localhost:5601 Kibana
- http://localhost:9200 Elasticsearch
- tcp://127.0.0.1:3306 MySQL Endpoint
(host: 127.0.0.1 Port: 3306 User: hdm Password: password Database: dbhdm)
- Resolve composer package dependencies. See Here for installing and using composer.
docker exec -ti hdm sh -c "composer install --no-dev --optimize-autoloader"
You can install Airflow and run the entire stack on local if you have enough RAM & CPU (4 core & 16 Go RAM recommended). To see how : go Here
The documentation is compiled from markdown sources using Material for MkDocs To compile the documentation :
- Go to your source directory :
cd health-data-metrics
- Run the docker build command :
docker run --rm -i -v "$PWD:/docs" squidfunk/mkdocs-material:latest build
See https://airflow.apache.org/docs/apache-airflow/stable/index.html
Data Factory - Institut Curie - 2021