Skip to content

Large-scale RDF-based Data Quality Assessment Pipeline

License

Notifications You must be signed in to change notification settings

MaastrichtU-IDS/dqa-pipeline

Repository files navigation

License License DOI

LODQuA: Large-scale RDF-based Data Quality Assessment Pipeline

LODQuA is a large-scale automated quality assessment pipeline specifically for Linked Open Data. This repository links three containers, each of which measures different quality metrics as follows:

The https://github.com/MaastrichtU-IDS/dqa_combine_statistics module then combines the outputs of all the three containers, adds a timestamp and the https://github.com/MaastrichtU-IDS/RdfUpload container uploads the output file to the specified SPARQL endpoint.

Installation usage

Using CWL workflows

The DQA pipeline can be run using the Common Workflow Language. See the d2s-core repository.

See the workflow file workflow-dqa.cwl and the config file config-cwl-dqa.yml.

See the documentation to install CWL runner.

# Create workspace
mkdir -p /data/dqa-workspace/output/tmp-outdir
sudo chown -R ${USER}:${USER} /data/dqa-workspace

# Clone the CWL workflows repository
git clone https://github.com/MaastrichtU-IDS/d2s-core
cd d2s-core

# Run the CWL workflow, providing the config YAML file
cwl-runner --custom-net d2s-core_network \
  --outdir /data/dqa-workspace/output \
  --tmp-outdir-prefix=/data/dqa-workspace/output/tmp-outdir/ \
  --tmpdir-prefix=/data/dqa-workspace/output/tmp-outdir/tmp- \
  cwl/workflows/workflow-dqa.cwl \
  support/config-cwl-dqa.yml

Workflow files goes to /data/dqa-workspace.

Using Argo workflows

The DQA pipeline can be run using Argo workflow. See the d2s-argo-workflows repository.

See documentation to run Argo workflows on the DSRI or on a single machine.

# Clone the Argo workflows repository
git clone https://github.com/MaastrichtU-IDS/d2s-argo-workflows
cd d2s-argo-workflows

# Run the Argo workflow, using a config file
argo submit cwl/workflows/dqa-workflow-argo.yaml -f support/config-dqa-pipeline.yml

You might need to specify the service account

argo submit --serviceaccount argo cwl/workflows/dqa-workflow-argo.yaml -f support/config-dqa-pipeline.yml

Deprecated installation and Usage

Download

git clone --recursive https://github.com/MaastrichtU-IDS/dqa-pipeline.git

Build

chmod +x *.sh && ./build.sh

Command

  • wd (workDirectory): Path where intermediate files will be stored fsu (FAIRSharingURL): a FAIRSharing URL
  • iep (input SPARQL endpoint): SPARQL URL of input endpoint iun (input user name) [Optional]: username for SPARQL input endpoint ipw (input password) [Optional]: password for SPARQL input endpoint oep (output endpoint): URL of output SPARQL endpoint
  • ouep (output update endpoint): URL of update SPARQL endpoint oun (output user name): [Optional] username for output SPARQL endpoint opw (output password): [Optional] password for output SPARQL endpoint

Run

We take an input endpoint, the corresponding fairsharing URL (for the input dataset), the output endpoint where the triples will be loaded (GraphDB at the moment)

./run.sh \
-wd <work-directory> \
-fsu <fairsharing url> \
-iep <input-endpoint> \
-sch <your schema> \
-oep <your sparql endpoint> \
-ouep <optional sparql update endpoint> \
-ogr <output graphdb repository> \
-oun <optional sparql endpoint username> \
-opw <optional sparql endpoint password>

Example

# For WikiPathways (using HTTP repository for RdfUpload) 
./run.sh \
-wd /data/dqa-pipeline/wikipathways/ \
-fsu https://fairsharing.org/FAIRsharing.1x53qk \
-iep http://sparql.wikipathways.org/ \
-sch https://www.w3.org/2012/pyRdfa/extract?uri=http://vocabularies.wikipathways.org/wp# \
-oep http://graphdb.dumontierlab.com \
-ogr test2 \
-oun import_user \
-opw password


# Legacy RdfUpload (using SPARQL repository). Not working anymore
./run.sh \
-wd /data/dqa-pipeline/wikipathways/2018-03-29-1330/ \
-fsu https://fairsharing.org/FAIRsharing.1x53qk \
-iep http://sparql.wikipathways.org/ \
-oep http://graphdb.dumontierlab.com/repositories/test2 \
-ouep http://graphdb.dumontierlab.com/repositories/test2/statements \
-oun import_user \
-opw password

TODO

To implement in Java using jsoup