LODQuA is a large-scale automated quality assessment pipeline specifically for Linked Open Data. This repository links three containers, each of which measures different quality metrics as follows:
- https://github.com/MaastrichtU-IDS/dqa_descriptive_statistics: The descriptive statistics are metrics from eight queries defined by the Health Care and the Life Sciences (HCLS group on the description of datasets using the Resource Description Framework. That is, the number of triples, entities, subjects, properties, objects and graphs of the dataset are reported.
- https://github.com/MaastrichtU-IDS/fairsharing-metrics: The FAIRSharing(https://fairsharing.org/) metrics extract information from the FAIRSharing resource, which covers standards (terminologies, formats, models and reporting guidelines), databases, and data policies in the life sciences, broadly encompassing the biological, environmental and biomedical sciences. The license information, terminologies used and scope and datatypes of the specified resource are extracted.
- https://github.com/MaastrichtU-IDS/RDFUnit: RDFUnit is a tool, which measures several computational metrics to analyze syntactic validity and consistency metrics on the datasets.
The https://github.com/MaastrichtU-IDS/dqa_combine_statistics module then combines the outputs of all the three containers, adds a timestamp and the https://github.com/MaastrichtU-IDS/RdfUpload container uploads the output file to the specified SPARQL endpoint.
The DQA pipeline can be run using the Common Workflow Language. See the d2s-core repository.
See the workflow file workflow-dqa.cwl and the config file config-cwl-dqa.yml.
See the documentation to install CWL runner.
# Create workspace
mkdir -p /data/dqa-workspace/output/tmp-outdir
sudo chown -R ${USER}:${USER} /data/dqa-workspace
# Clone the CWL workflows repository
git clone https://github.com/MaastrichtU-IDS/d2s-core
cd d2s-core
# Run the CWL workflow, providing the config YAML file
cwl-runner --custom-net d2s-core_network \
--outdir /data/dqa-workspace/output \
--tmp-outdir-prefix=/data/dqa-workspace/output/tmp-outdir/ \
--tmpdir-prefix=/data/dqa-workspace/output/tmp-outdir/tmp- \
cwl/workflows/workflow-dqa.cwl \
support/config-cwl-dqa.yml
Workflow files goes to
/data/dqa-workspace
.
The DQA pipeline can be run using Argo workflow. See the d2s-argo-workflows repository.
See documentation to run Argo workflows on the DSRI or on a single machine.
# Clone the Argo workflows repository
git clone https://github.com/MaastrichtU-IDS/d2s-argo-workflows
cd d2s-argo-workflows
# Run the Argo workflow, using a config file
argo submit cwl/workflows/dqa-workflow-argo.yaml -f support/config-dqa-pipeline.yml
You might need to specify the service account
argo submit --serviceaccount argo cwl/workflows/dqa-workflow-argo.yaml -f support/config-dqa-pipeline.yml
git clone --recursive https://github.com/MaastrichtU-IDS/dqa-pipeline.git
chmod +x *.sh && ./build.sh
- wd (workDirectory): Path where intermediate files will be stored fsu (FAIRSharingURL): a FAIRSharing URL
- iep (input SPARQL endpoint): SPARQL URL of input endpoint iun (input user name) [Optional]: username for SPARQL input endpoint ipw (input password) [Optional]: password for SPARQL input endpoint oep (output endpoint): URL of output SPARQL endpoint
- ouep (output update endpoint): URL of update SPARQL endpoint oun (output user name): [Optional] username for output SPARQL endpoint opw (output password): [Optional] password for output SPARQL endpoint
We take an input endpoint, the corresponding fairsharing URL (for the input dataset), the output endpoint where the triples will be loaded (GraphDB at the moment)
./run.sh \
-wd <work-directory> \
-fsu <fairsharing url> \
-iep <input-endpoint> \
-sch <your schema> \
-oep <your sparql endpoint> \
-ouep <optional sparql update endpoint> \
-ogr <output graphdb repository> \
-oun <optional sparql endpoint username> \
-opw <optional sparql endpoint password>
# For WikiPathways (using HTTP repository for RdfUpload)
./run.sh \
-wd /data/dqa-pipeline/wikipathways/ \
-fsu https://fairsharing.org/FAIRsharing.1x53qk \
-iep http://sparql.wikipathways.org/ \
-sch https://www.w3.org/2012/pyRdfa/extract?uri=http://vocabularies.wikipathways.org/wp# \
-oep http://graphdb.dumontierlab.com \
-ogr test2 \
-oun import_user \
-opw password
# Legacy RdfUpload (using SPARQL repository). Not working anymore
./run.sh \
-wd /data/dqa-pipeline/wikipathways/2018-03-29-1330/ \
-fsu https://fairsharing.org/FAIRsharing.1x53qk \
-iep http://sparql.wikipathways.org/ \
-oep http://graphdb.dumontierlab.com/repositories/test2 \
-ouep http://graphdb.dumontierlab.com/repositories/test2/statements \
-oun import_user \
-opw password
To implement in Java using jsoup