-
Notifications
You must be signed in to change notification settings - Fork 39
Graph and RDF Examples
This page shows brief examples to set up consumers for the various graph and RDF output options for Baleen. To generate a graph it is necessary to include annotators which extract relations, co-references or events.
These examples use Docker to host the external dependencies. Note that on Windows the Docker machines may not run on localhost and so references in this document and in the pipeline may need to be replaced with the Docker machine IP.
This example outputs the document graphs as graph_ML and RDF in RDF_XML format.
consumers: - class: file.DocumentGraph outputDirectory: ./output_document_graph format: GRAPHML - class: file.EntityGraph outputDirectory: ./output_entity_graph format: GRAPHML - class: file.Rdf outputDirectory: ./output_rdf format: RDF_XML - class: file.RdfEntityGraph outputDirectory: ./output_entity_rdf format: RDF_XML
Alternative formats for the graph outputs are:
- GRAPHML - XML-based format
- GRAPHSON - JSON-based format
- GYRO - Kryo format (uses JVM object graphs)
Alternative formats for RDF output are:
- RDF_XML - Standard RDF XML serialisation5
- TURTLE - Terse RDF Triple Language. Output is similar in form to SPARQL
- RDF_XML_ABBREV - Abbreviated RDF XML serialisation
- N_TRIPLES - Each line is a triple in the form "Subject Predicate Object ."
- RDF_JSON - A JSON representation of the RDF, see https://jena.apache.org/documentation/io/rdf-json.html
- JSONLD - JSON for Linked Data, see https://json-ld.org/
- N3 - Notation3, a Human readable triple format.
This example outputs the graph representation to the Neo4J graph database. You can run the service in Docker with the following command.
docker run -d -p 7474:7474 -p 7687:7687 neo4j:3.0
You must set a password for the root user neo4j
in the UI at http://localhost:7474. The example assumes you set it to neopass
but can be altered in the configuration below. Run the example with:
consumers: #- class: graph.Neo4JDocumentGraphConsumer - class: graph.Neo4JEntityGraphConsumer #closeAfterEveryDocument: true #url: bolt://localhost:7687 #username: neo4j password: neopass filterFeatures: - isNormalised valueStrategy: - gender - Mode - geoJson - Mode - type - Mode - relationshipType - Mode
This demonstrates the ability to output the higher level entity graph to a graph database using the Apache Tinkerpop abstraction layer on top of OrientDB. This currently only works with the version 3 release candidate. This may be obsolete soon. To run OrientDB 3.0.0RC1 in Docker use:
docker run -d -p 2424:2424 -p 2480:2480 -e ORIENTDB_ROOT_PASSWORD=rootpwd orientdb:3.0.0RC1
You must create a database to use named baleen
from the user interface on http://localhost:2480.
Note that this example requires the graph drivers to be on the classpath. This can be done if running from the code by added a maven dependency on
<dependency>
<groupId>com.orientechnologies</groupId>
<artifactId>orientdb-gremlin</artifactId>
<version>3.0.0RC1</version>
</dependency>
and for convenience, these are commented on the baleen-graph/pom.xml
.
Running from the command line you must download the jars from:
- http://central.maven.org/maven2/com/orientechnologies/orientdb-core/3.0.0RC1/orientdb-core-3.0.0RC1.jar
- http://central.maven.org/maven2/com/orientechnologies/orientdb-client/3.0.0RC1/orientdb-client-3.0.0RC1.jar
- http://central.maven.org/maven2/com/orientechnologies/orientdb-gremlin/3.0.0RC1/orientdb-gremlin-3.0.0RC1.jar
- http://central.maven.org/maven2/net/java/dev/jna/jna/4.5.0/jna-4.5.0.jar
- http://central.maven.org/maven2/net/java/dev/jna/jna-platform/4.5.0/jna-platform-4.5.0.jar
- http://central.maven.org/maven2/org/xerial/snappy/snappy-java/1.1.0.1/snappy-java-1.1.0.1.jar
and include them on the classpath (for example in a folder named "orient").
Baleen can then be run with:
java -cp "baleen.jar:orient/*" uk.gov.dstl.baleen.runner.Baleen
or on Windows
java -cp "baleen.jar;orient/*" uk.gov.dstl.baleen.runner.Baleen
(See Using-Third-Party-Components for more information on running Baleen with third party jars.)
The OrientDB consumer is added to the pipeline file as follows:
consumers: - class: graph.EntityGraphConsumer graphConfig: ./graph/orient.properties
or
consumers: - class: graph.DocumentGraphConsumer graphConfig: ./graph/orient.properties
where ./graph/orient.properties
is a text file containing:
gremlin.graph=org.apache.tinkerpop.gremlin.orientdb.OrientGraph orient-url:remote:localhost/baleen orient-user=root orient-pass=rootpwd
Note that on windows "localhost" may need to be replaced with the Docker machine IP address.
Baleen's output data can be output using a simple OWL schema based on the Document and Entity graph structures defined above using the file.Rdf
or file.RdfEntityGraph
consumers as follows.
consumers: - class: file.Rdf outputDirectory: ./output_rdf format:RDF_XML - class: file.RdfEntityGraph outputDirectory: ./output_entity_rdf format: RDF_XML
Where the supported formats are:
- RDF_XML - Standard RDF XML serialisation5
- TURTLE - Terse RDF Triple Language. Output is similar in form to SPARQL
- RDF_XML_ABBREV - Abbreviated RDF XML serialisation
- N_TRIPLES - Each line is a triple in the form "Subject Predicate Object ."
- RDF_JSON - A JSON representation of the RDF, see https://jena.apache.org/documentation/io/rdf-json.html
- JSONLD - JSON for Linked Data, see https://json-ld.org/
- N3 - Notation3, a Human readable triple format.
This demonstrates the ability to represent the extracted information as RDF and store in a triple store.
To run this example you must have an instance of Fuseki running with admin password pw123
and you must create datasets named baleen_entity
and 'baleen_document' through the user interface that can be accessed on localhost:3030 with credentials admin:pw123
. You can run with Docker
docker run -d -p 3030:3030 -e ADMIN_PASSWORD=pw123 stain/jena-fuseki:3.6.0
consumers: - class: rdf.RdfEntityGraphConsumer query: http://localhost:3030/baleen_entity/query update: http://localhost:3030/baleen_entity/update store: http://localhost:3030/baleen_entity/data filterFeatures: - isNormalised - class: rdf.RdfDocumentGraphConsumer query: http://localhost:3030/baleen_document/query update: http://localhost:3030/baleen_document/update store: http://localhost:3030/baleen_document/data filterFeatures: - isNormalised