Skip to content

Latest commit

 

History

History
executable file
·
564 lines (421 loc) · 21.3 KB

README.md

File metadata and controls

executable file
·
564 lines (421 loc) · 21.3 KB

RDF2X

Convert big Linked Data RDF datasets to a relational database model, CSV, JSON and ElasticSearch using Spark.

Disclaimer

⚠️ RDF2X is using Spark 1.6 which was found to have high severity security issues with deserialization https://github.com/advisories/GHSA-8rhc-48pp-52gr. Use at your own risk.

Architecture

image

Tutorials

Visualizing ClinicalTrials.gov RDF data in Tableau using RDF2X

Querying Wikidata RDF with SQL using RDF2X

Distributed Conversion of RDF Data to the Relational Model (thesis)

Get started

RDF2X can be executed from source using Maven or using a JAR file.

Running from source

To launch from source using Maven:

# First build the project
mvn compile

# Save to CSV
mvn exec:java -Dexec.args="convert \
--input.file /path/to/input/ \
--output.target CSV \
--output.folder /path/to/output/folder"

# Save to JSON
mvn exec:java -Dexec.args="convert \
--input.file /path/to/input/ \
--output.target JSON \
--output.folder /path/to/output/folder"

# Save to a database
mvn exec:java -Dexec.args="convert \
--input.file /path/to/input/ \
--output.target DB \
--db.url 'jdbc:postgresql://localhost:5432/database_name' \
--db.user user \
--db.password 123456 \
--db.schema public"

# More config options
mvn \
-Dspark.app.name="RDF2X My file" \
-Dspark.master=local[2] \
-Dspark.driver.memory=3g \
exec:java  \
-Dexec.args="convert \
--input.file /path/to/input/ \
--input.lineBasedFormat true \
--input.batchSize 500000 \
--output.saveMode Overwrite \
--output.target DB \
--db.url \"jdbc:postgresql://localhost:5432/database_name\" \
--db.user user \
--db.password 123456 \
--db.schema public \
--db.batchSize 1000"

Refer to the Configuration section below for all config parameters.

Running JAR using spark-submit

To launch locally via spark-submit:

  • Download the packaged JAR from our releases page
  • Install JDK 1.8
  • Download Spark 1.6
  • Add the Spark bin directory to your system PATH variable
  • Refer to the Configuration section below for all config parameters.
  • Run this command from the project target directory (or anywhere you have put your packaged JAR)
spark-submit \
--name "RDF2X ClinicalTrials.gov" \
--class com.merck.rdf2x.main.Main \
--master 'local[2]' \
--driver-memory 2g \
--packages postgresql:postgresql:9.1-901-1.jdbc4,org.eclipse.rdf4j:rdf4j-runtime:2.1.4,org.apache.jena:jena-core:3.1.1,org.apache.jena:jena-elephas-io:3.1.1,org.apache.jena:jena-elephas-mapreduce:0.9.0,com.beust:jcommander:1.58,com.databricks:spark-csv_2.10:1.5.0,org.elasticsearch:elasticsearch-spark_2.10:2.4.4,org.jgrapht:jgrapht-core:1.0.1 \
rdf2x-1.0-SNAPSHOT.jar \
convert \
--input.file /path/to/clinicaltrials \
--input.lineBasedFormat true \
--cacheLevel DISK_ONLY \
--input.batchSize 1000000 \
--output.target DB \
--db.url "jdbc:postgresql://localhost:5432/database_name?tcpKeepAlive=true" \
--db.user user \
--db.password 123456 \
--db.schema public \
--db.batchSize 1000 \
--output.saveMode Overwrite

To run stats job via spark-submit:

spark-submit --name "RDF2X ClinicalTrials.gov" --class com.merck.rdf2x.main.Main --master 'local' \
--driver-memory 2g \
--packages postgresql:postgresql:9.1-901-1.jdbc4,org.eclipse.rdf4j:rdf4j-runtime:2.1.4,org.apache.jena:jena-core:3.1.1,org.apache.jena:jena-elephas-io:3.1.1,org.apache.jena:jena-elephas-mapreduce:0.9.0,com.beust:jcommander:1.58,com.databricks:spark-csv_2.10:1.5.0,org.elasticsearch:elasticsearch-spark_2.10:2.4.4,org.jgrapht:jgrapht-core:1.0.1 \
rdf2x-0.1.jar \
stats \
--input.file  bio2rdf-clinicaltrials.nq \
--input.batchSize 1000000 \
--stat SUBJECT_URI_COUNT

Running on YARN

To launch on a cluster:

  • Copy the JAR you packaged earlier to your server
  • Optionally, configure driver log level by referencing custom log4j.properties. You can copy and modify the existing ones in src/main/resources/ folder.

Run on YARN: Save to DB

spark-submit \
--name "RDF2X ClinicalTrials.gov" \
--class com.merck.rdf2x.main.Main \
--master yarn \
--deploy-mode client \
--driver-memory 4g \
--queue default \
--executor-memory 6g \
--executor-cores 1 \
--num-executors 5 \
--conf spark.yarn.executor.memoryOverhead=2048 \
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///path/to/your/log4j.properties" \
--packages postgresql:postgresql:9.1-901-1.jdbc4,org.eclipse.rdf4j:rdf4j-runtime:2.1.4,org.apache.jena:jena-core:3.1.1,org.apache.jena:jena-elephas-io:3.1.1,org.apache.jena:jena-elephas-mapreduce:0.9.0,com.beust:jcommander:1.58,com.databricks:spark-csv_2.10:1.5.0,org.elasticsearch:elasticsearch-spark_2.10:2.4.4,org.jgrapht:jgrapht-core:1.0.1 \
rdf2x-1.0-SNAPSHOT.jar \
convert \
--input.file hdfs:///path/to/clinicaltrials/ \
--input.lineBasedFormat true \
--input.batchSize 1000000 \
--output.saveMode Overwrite \
--output.target DB \
--db.url "jdbc:postgresql://your.db.server.com/database_name" \
--db.user user \
--db.password 123456 \
--db.schema public \
--db.batchSize 1000

Run on YARN: Save to CSV

...
--output.target CSV \
--output.folder hdfs:///path/to/clinicaltrials-csv/ 

Run on YARN: Save to JSON

...
--output.target JSON \
--output.folder hdfs:///path/to/clinicaltrials-csv/ 

Run on YARN: Save to ElasticSearch

Note:

  • Currently the data is saved to ElasticSearch in a relational format - entity and relation tables.
  • --output.saveMode is ignored when saving to ElasticSearch (data is always appended).
  • Connection parameters and other ES config can be specified as System properties via Spark conf: --conf spark.es.nodes=localhost --conf spark.es.port=9200
spark-submit \
--name "RDF2X ClinicalTrials.gov" \
--class com.merck.rdf2x.main.Main \
--master yarn \
--deploy-mode client \
--driver-memory 4g \
--queue default \
--executor-memory 6g \
--executor-cores 1 \
--num-executors 5 \
--conf spark.es.nodes=localhost \
--conf spark.es.port=9200 \
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///path/to/your/log4j.properties" \
--packages postgresql:postgresql:9.1-901-1.jdbc4,org.eclipse.rdf4j:rdf4j-runtime:2.1.4,org.apache.jena:jena-core:3.1.1,org.apache.jena:jena-elephas-io:3.1.1,org.apache.jena:jena-elephas-mapreduce:0.9.0,com.beust:jcommander:1.58,com.databricks:spark-csv_2.10:1.5.0,org.elasticsearch:elasticsearch-spark_2.10:2.4.4,org.jgrapht:jgrapht-core:1.0.1 \
rdf2x-1.0-SNAPSHOT.jar \
convert \
--input.file hdfs:///path/to/clinicaltrials/ \
--input.lineBasedFormat true \
--input.batchSize 1000000 \
--output.target ES \
--es.index clinicaltrials

Refer to the Configuration section below for all config parameters.

Data sources

Download your RDF dataset, e.g. ClinicalTrials.gov:

wget http://download.bio2rdf.org/release/4/clinicaltrials/clinicaltrials.nq.gz

If you plan on using a cluster, add the data to HDFS:

# Single file
hadoop fs -put clinicaltrials.nq.gz /path/to/datasets/

# Multiple files
hadoop fs -mkdir /path/to/datasets/clinicaltrials
hadoop fs -put * /path/to/datasets/clinicaltrials

Building RDF2X JAR

Use Maven to get a packaged JAR file:

# compile, run tests and create JAR
mvn package

# or without running tests
mvn package -Dmaven.test.skip=true

Example

Consider this simple example from the W3C Turtle specification:

@base <http://example.org/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rel: <http://www.perceive.net/schemas/relationship/> .

<#green-goblin>
        rel:enemyOf <#spiderman> ;
        a foaf:Person ;    # in the context of the Marvel universe
        foaf:name "Green Goblin" .

<#spiderman>
        rel:enemyOf <#green-goblin> ;
        a foaf:Person ;
        foaf:name "Spiderman", "Человек-паук"@ru .

Database output

Converting to SQL format will result in the following tables:

Person

ID URI name_ru_string name_string
1 #green-goblin null Green Goblin
2 #spiderman Человек-паук Spiderman

Person_Person

person_ID_from person_ID_to predicate
2 1 3
1 2 3

Along with the entities and relationships, metadata is persisted:

_META_Entities:

URI name label num_rows
http://xmlns.com/foaf/0.1/Person Person null 2

_META_Columns

name predicate type multivalued language non_null entity_name
name_ru_string 2 STRING false ru 0.5 Person
name_string 2 STRING false null 1 Person

_META_Relations

name from_name to_name
person_person person person

_META_Predicates

predicate URI name label
1 http://www.w3.org/1999/02/22-rdf-syntax-ns#type type null
2 http://xmlns.com/foaf/0.1/name name null
3 http://www.perceive.net/schemas/relationship/enemyOf enemyof null

Tested datasets

ClinicalTrials.gov

Property Value
Number of quads 159,001,344
Size gzipped 1.8 GB gzipped
Size uncompressed 45.1 GB uncompressed
Output entity table rows 17,855,687 (9,859,796 rows in largest table)
Output relation table rows 74,960,010 (19,084,633 rows in largest table)
  • Cluster setup: 5 executors, 6GB RAM each.
    • Run time: 2.7 hours

Configuration

Convert

Name Default Description
--input.file required Path to input file or folder
--flavor null Specify a flavor to be used. Flavors modify the behavior of RDF2X, applying default settings and providing custom methods for data source specific modifications.
--filter.cacheLevel StorageLevel(false, false, false, false, 1) Level of caching of the input dataset after filtering (None, DISK_ONLY, MEMORY_ONLY, MEMORY_AND_DISK, ...). See Spark StorageLevel class for more info.
--cacheLevel StorageLevel(true, false, false, false, 1) Level of caching of the instances before collecting schema and persisting (None, DISK_ONLY, MEMORY_ONLY, MEMORY_AND_DISK, ...). See Spark StorageLevel class for more info.
--instancePartitions null Repartition before aggregating instances into this number of partitions.
--help false Show usage page

Currently supported flavors:

  • Wikidata: Applies settings and methods for converting the Wikidata RDF dumps.
  • Bio2RDF: Generates nicer names for Resource entities (with vocabulary prefix).

Parsing

Name Default Description
--input.lineBasedFormat null Whether the input files can be read line by line (e.g. true for NTriples or NQuads, false for Turtle). In default, will try to guess based on file extension. Line based formats can be parsed by multiple nodes at the same time, other formats will be read by master node and repartitioned after parsing.
--input.repartition null Repartition after parsing into this number of partitions.
--input.batchSize 500000 Batch size for parsing line-based formats (number of quads per partition)
--input.errorHandling Ignore How to handle RDF parsing errors (Ignore, Throw).
--input.acceptedLanguage [] Accepted language. Literals in other languages are ignored. You can specify more languages by repeating this parameter.

Filtering

Name Default Description
--filter.resource [] Accept resources of specified URI. More resource URIs can be specified by repeating this parameter.
--filter.resourceBlacklist [] Ignore resources of specified URI. More resource URIs can be specified by repeating this parameter.
--filter.relatedDepth 0 Accept also resources related to the original set in relatedDepth directed steps. Uses an in-memory set of subject URIs, therefore can only be used for small results (e.g. less than 1 million resources selected).
--filter.directed true Whether to traverse only in the subject->object directions of relations when retrieving related resources.
--filter.type [] Accept only resources of specified type. More type URIs can be specified by repeating this parameter.
--filter.ignoreOtherTypes true Whether to ignore instance types that were not selected. If true, only the tables for the specified types are created. If false, all of the additional types and supertypes of selected instances are considered as well.

Output

Name Default Description
--output.target required Where to output the result (DB, CSV, JSON, ES, Preview).
--output.saveMode ErrorIfExists How to handle existing tables (Append, Overwrite, ErrorIfExists, Ignore).

Based on --output.saveMode, you have to specify additional parameters:

Output to DB

Name Default Description
--db.url required Database JDBC string
--db.user required Database user
--db.password null Database password
--db.schema null Database schema name
--db.batchSize 5000 Insert batch size
--db.bulkLoad true Use CSV bulk load if possible (PostgreSQL COPY)

Output to JSON, CSV

Name Default Description
--output.folder required Folder to output the files to

Output to ElasticSearch (ES)

Name Default Description
--es.index null ElasticSearch Index to save the output to
--es.createIndex true Whether to create index in case it does not exist, overrides es.index.auto.create property

Connection parameters can be specified as system properties:

  • Via Spark conf: --conf spark.es.nodes=localhost --conf spark.es.port=9200
  • In standalone mode: -Dspark.es.nodes=localhost -Dspark.es.port=9200

RDF Schema

Name Default Description
--rdf.typePredicate [] Additional URI apart from rdf:type to treat as type predicate. You can specify more predicates by repeating this parameter.
--rdf.subclassPredicate [] Additional URI apart from rdfs:subClassOf to treat as subClassOf predicate. You can specify more predicates by repeating this parameter.
--rdf.collectSubclassGraph true Whether to collect the graph of subClass predicates.
--rdf.collectLabels true Whether to collect type and predicate labels (to be saved in meta tables and for name formatting if requested).
--rdf.cacheFile null File for saving and loading cached schema.

Creating instances

Name Default Description
--instances.defaultLanguage null Consider all values in this language as if no language is specified. Language suffix will not be added to columns.
--instances.addSuperTypes true Automatically add all supertypes to each instance, instance will be persisted in all parent type tables.
--instances.repartitionByType false Whether to repartition instances by type. Profitable in local mode when, causes an expensive shuffle in cluster mode.

Writing entities and relations

Name Default Description
--formatting.entityTablePrefix String to prepend to entity table names
--formatting.relationTablePrefix String to prepend to relation table names
--relations.storePredicate true Store predicate (relationship type) as a third column of entity relation tables.

Entities

Name Default Description
--entities.maxNumColumns null Maximum number of columns for one table.
--entities.minColumnNonNullFraction 0.0 Properties require at least minColumnNonNullFraction non-null values to be stored as columns. The rest is stored in the Entity-Attribute-Value table (e.g. 0.4 = properties with less than 40% values present will be stored only in the EAV table, 0 = store all as columns, 1 = store all only in EAV table).
--entities.redundantEAV false Store all properties in the EAV table, including values that are already stored in columns.
--entities.redundantSubclassColumns false Store all columns in subclass tables, even if they are also present in a superclass table. If false (default behavior), columns present in superclasses are removed, their superclass location is marked in the column meta table.
--entities.minNumRows 1 Minimum number of rows required for an entity table. Tables with less rows will not be included.
--entities.sortColumnsAlphabetically false Sort columns alphabetically. Otherwise by non-null ratio, most frequent first.
--entities.forceTypeSuffix false Whether to always add a type suffix to columns, even if only one datatype is present.

Relations

Name Default Description
--relations.schema Types How to create relation tables (SingleTable, Types, Predicates, TypePredicates, None)
--relations.rootTypesOnly true When creating relation tables between two instances of multiple types, create the relation table only for the root type pair. If false, relation tables are created for all combinations of types.

Supported relation table strategies:

  • SingleTable Store all relations in a single table
  • Types Create relation tables between related pairs of entity tables (for example Person_Address)
  • Predicates Create one relation table for each predicate (for example livesAt)
  • TypePredicates Create one relation table for each predicate between two entity tables (for example Person_livesAt_Address)
  • None Do not extract relations

Formatting

Name Default Description
--formatting.maxTableNameLength 25 Maximum length of entity table names
--formatting.maxColumnNameLength 50 Maximum length of column names
--formatting.uriSuffixPattern [/:#=] When collecting name from URI, use the segment after the last occurrence of this regex
--formatting.useLabels false Try to use rdfs:label for formatting names. Will use URIs if label is not present.

RDF Concepts implementation

Instances with multiple types

An instance with multiple types is saved in each specified type's entity table. For example, this input:

@base <http://example.org/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<#spiderman>
        a foaf:Person, foaf:Agent ;
        foaf:name "Spiderman".

<#lexcorp>
        a foaf:Organization, foaf:Agent ;
        foaf:name "LexCorp" ;
        foaf:homepage "https://www.lexcorp.io/".

will result in three entity tables:

Agent

ID URI homepage_string name_string
1 #lexcorp https://www.lexcorp.io/ LexCorp
2 #spiderman null Spiderman

Organization

ID URI homepage_string
1 #lexcorp https://www.lexcorp.io/

Person

ID URI
2 #spiderman

Whether the inherited name column is duplicated in subclass tables is configurable.

Literal data types

Supported datatypes depend on Jena.

The following types will be stored: STRING, INTEGER, DOUBLE, FLOAT, LONG, BOOLEAN. Other types will be converted to STRING.

DATETIME will be stored as STRING, column type has to be converted in post-processing, e.g. with Postgres:

ALTER TABLE public.en_thing
ALTER COLUMN start_date_datetime TYPE timestamp
USING start_date_datetime::timestamp without time zone;

Multi-valued properties

Multi-valued properties occur when multiple different values are specified for a single predicate, data type and language:

@base <http://example.org/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<#spiderman>
        a foaf:Person;
        foaf:name "Spiderman", "Spider-Man", "Spider man", "Человек-паук"@ru.

In that case, only one of the values is saved in the column (not deterministic).

Additionally, the column is added to the EAV set, which means that all the column values (even of the other instances that have only a single value) are saved in the Entity-Attribute-Value table.

Person

ID URI name_string name_ru_string
1 #green-goblin Green Goblin null
2 #spiderman Spider-Man Человек-паук

EAV table

ID PREDICATE datatype language value
1 3 STRING null Green Goblin
2 3 STRING null Spider-Man
2 3 STRING null Spider man
2 3 STRING null Spiderman

Blank nodes

Not implemented yet. Triples with blank nodes are ignored.