Skip to content

Latest commit

 

History

History
133 lines (102 loc) · 9.96 KB

README.md

File metadata and controls

133 lines (102 loc) · 9.96 KB

FOAF Site Integration In Stanbol

This project includes the results of the FOAF (Friend-of-a-Friend) ReferenceSite created for Stanbol. It also provides the necessary steps to configure the foaf-site in Stanbol entityhub and use it in the enhancement phase as an EntityHubLinking Engine to enhance the content. Below sections will take you through a step by step guide on the FOAF site integration process

Selection of a FOAF Datasource

FOAF data is mainly provided by Linked-Data projects. There are several datasources mentioned in the FOAF project wiki [1], most of them are social networking sites offering their data in FOAF format. However most of the projects are out of date therefore it was not recommended to use them as the datasources for my project. The 2 best options were;

  1. The billion-tripple challenge (btc) 2012 project [2] :
    A web-crawled dataset including data from dbpedia, freebase, datahub, timbl, rest datasources. Quantity wise this has a sufficient amount (1436545545 quads) of data, foaf data and it's fairly upto date.
  2. WebDataCommons project [3] :
    A linked-data project which has a dataset (1079175202 quads) created in August 2012. But the sources of the data is not specified in the project.

After a discussion with Stanbol community and other related FOAF communities I selected the btc2012 dataset as it has a sufficiently up-to-date FOAF dataset. Following section will describe how I developed a ReferenceSite in Stanbol project with the selected dataset.

Creating a ReferenceSite with FOAF dataset

For this purpose I used the generic-rdf indexing tool in Stanbol project. Some of the tasks such as FOAF filtering required additional configuration files to be copied to the tool from other sources. Below guide will explain how to develop a FOAF datasite as a custom vocabulary integration in Stanbol.

###Building the indexing tool The generic-rdf indexing tool can be found in the Stanbol trunk at [4]. Build it from source using mvn clean install. This will create the org.apache.stanbol.entityhub.indexing.genericrdf-0.12.0-SNAPSHOT.jar file in the target. Then intialize the tool with the below command :

java -jar org.apache.stanbol.entityhub.indexing.genericrdf-0.12.0-SNAPSHOT.jar init
Above initialization command will create the indexing tool directories for various purposes in the indexing process. The main directories are as below:

/indexing
	/config {the main configuration directory}
	/destination {the target directory of Solr indexing files and extracted entity data}
	/dist  {the results of the indexing process including a reference-site data-file and solr-index}
	/resources {the rdf datasources to be used for the indexing process}

For demo purpose I have uploaded the pre-built jar file and the indexing directories with init command executed. The uploaded files here under directory: generic-rdf/indexing are pre-configured with the required configurations to execute FOAF filtering and indexing. Below steps will describe each configuration done to achieve FOAF filtering on the used btc2012 dataset.

###Configuring the tool to filter foaf entities indexing/config is the main configuration directory of the tool and the main configuration file is indexing.properties.
To give a unique name to the EntityHub site, set the 'name' value in indexing.properties to a suitable unique Site name (eg: foaf-site )
The FOAF filtering configurations require to edit the EntityDataIterable field to support FOAF entity iterations as below.
entityDataIterable=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,config:indexingsource,bnode:true
(Please note the additional bnode:true parameter above is activated to process blank nodes in the dataset)

Above entityDataIterable configuration requires 2 additional configuration files : indexingsource.properties and propertiyfilter.config. These files are not included in generic-rdf index tool by default. You can use the 2 files used in freebase indexing tool at [5] for filtering purpose. Copy the 2 files into indexing/config and add the below entry to propertyfilter.config:
foaf:*
Above entry instructs the tool to filter entities from the datasource which defines some foaf property in foaf namespace.

To index only foaf:Person and foaf:Organization type entities, activate 'values' in entityTypes.properties file as below:
values=foaf:Person;foaf:Organization
Check above entity filtering in entityTypes.properties is enabled in indexing.properties as a entityProcessor by searching for below entry.
entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter,config:entityTypes;

To match entity-mentions in the content and link them to Entities in the FOAF dataset, certain foaf properties should be identified as the fields to map entities and copy them as label fields in the entityhub. For this purpose I have used foaf fields like foaf:name, firstName, givenName as label fields. These entries should be configured in the mappings.txt as below;

foaf:name > rdfs:label
foaf:nick > rdfs:label
foaf:givenName > rdfs:label
foaf:familyName > rdfs:label
foaf:firstName > rdfs:label	

In the enhancement phase, to traverse between entities, the Stanbol engine uses the redirect field. In FOAF there are 2 main fields to link similar/related entities. They are rdfs:seeAlso and owl:sameAs. To use both of them as redirect fields in Stanbol engines, they have to be converged as Stanbol only allows 1 redirect field. Therefore I will merge both these fields into Stanbol internally used fise:redirects and used as the single redirect field in the linking engine configuration explained later.

Following are the extra configurations to be added to mappings.txt in the indexing tool:

rdfs:seeAlso | d=entityhub:ref
owl:sameAs | d=entityhub:ref

rdfs:seeAlso > fise:redirects
owl:sameAs > fise:redirects

Running the Indexing Tool and Deploying the FOAF dataset to Stanbol

Now all the necessary configurations to index and filter a FOAF dataset is done. You need to include the FOAF dataset files to index in indexing/resources/rdfdata. For this I have used the datahub/data-4.nq [6] and timbl/data-6.nq [7] datasets available at the btc2012 project site. Download the data files from given links and copy them to indexing/resources/rdfdata directory prior to indexing.
Now you can run the indexing tool using below command:
java -Xmx1024m -jar org.apache.stanbol.entityhub.indexing.genericrdf-0.12.0-SNAPSHOT.jar index

Above will execute the entity extraction and indexing process and create 2 files in the indexing/dist directory. Copy the generated org.apache.stanbol.data.site.foaf-site-1.0.0.jar to ${stanbol-server}/fileinstall directory. Copy the generated foaf-site.solrindex.zip to ${stanbol-server}/datafiles directory.

Launch Stanbol server using full-launcher and access the foaf-site at : localhost:8080/entityhub/site/foaf-site The next step is to create an Enhancement Engine in Stanbol utilizing above created FOAF Site.

Configuring a FOAF Linking Engine & an Enhancement Chain

After successfully deploying the foaf-site, I configured an enhancement chain to perform content enhancements using above ceated foaf-site. Most of these configurations can be done via the osgi console configuration manager of Apache stanbol accessible at : http://localhost:8080/system/console/configMgr

Following are the enhancement engine configurations required to create a FOAF site linking engine.

  • Configure a new entityhub-linking-engine [8] with below configuration changes:
Name : foaf-site-linking
Referenced site : foaf-site
Redirect field : fise:redirects
Case sensitivity : disabled

* Configure a weighted enhancement chain [9] using above created foaf-site-linking engine by doing below configuration changes. In the enhancement-chain I have added several available engines to perform language detection and natural language processing prior to foaf-linking:
Name : foaf-site-chain
Engines : langdetect, opennlp-sentence, opennlp-token, opennlp-pos, foaf-site-linking

Now you can invoke the new foaf-site-chain by going to : http://localhost:8080/enhancer/chain/foaf-site-chain and giving a test content like : "Tim Bernes Lee is the inventor of World Wide Web".

You can even try it using a REST client like curl without using the Stanbol web-interface as below :
curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" --data "Tim Bernes Lee is the inventor of World Wide Web" http://localhost:8080/enhancer/chain/foaf-site-chain

If the configurations are done correctly Tim Berness Lee and World Wide Web should be identified as entities from the foaf-site dataset. Please refer the screen-shot image attached here with the demo results. This foaf-site-linking engine will be used as the base of the foaf-disambiguation engine to be created in the 2nd phase of the GSOC project.

[1] http://www.w3.org/wiki/FoafSites
[2] http://km.aifb.kit.edu/projects/btc-2012/
[3] http://webdatacommons.org/
[4] https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/genericrdf
[5] https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/freebase
[6] http://km.aifb.kit.edu/projects/btc-2012/datahub/data-4.nq.gz
[7] http://km.aifb.kit.edu/projects/btc-2012/timbl/data-6.nq.gz
[8] https://stanbol.apache.org/docs/trunk/components/enhancer/engines/entityhublinking
[9] http://stanbol.apache.org/docs/trunk/components/enhancer/chains/weightedchain.html