-
Notifications
You must be signed in to change notification settings - Fork 39
Relation Extraction
Additional relationship extraction functionality has been added to Baleen 2.6. This includes some simple high recall low precision relationship extraction annotators based on the co-occurance of entities in sentences or documents, a number of pattern based algorithms, and also a more complex annotator based on the ReNoun paper.
Two new annotators have been generated which assign a relationship between entities based on co-occurance in a sentence or document.
-
DocumentRelationshipAnnotator
assigns a relationship to entities which occur in the same document but different sentences within a configurable sentence distance. Additionally it may be configured to relationships between specific types. It adds a "sentence distance" to the annotation which may later be used to assign a confidence to the relation. -
SentenceRelationshipAnnotator
assigns relationships to entities of configurable types which appear in the same sentence. These relations have a sentence distance of 0, but may also have values set by the word distance (number of words between entities) or the dependency distance (length of the shortest path between the two entities in the dependency graph).
A specialised MongoRelations
consumer has been developed to quickly analyse the relationships derived from Sentence and Document co-occurance. This may used as per the simple example pipeline below which uses openNLP to detect people and locations and before assigning relationships and outputting to MongoRelations for analysis.
mongo: db: baleen_simple_relations host: localhost collectionreader: class: FolderReader folders: ..\corpora\re3d annotators: - language.OpenNLP - class: stats.OpenNLP model: ..\models\en-ner-person.bin type: uk.gov.dstl.baleen.types.common.Person - class: stats.OpenNLP model: ..\models\en-ner-location.bin type: uk.gov.dstl.baleen.types.semantic.Location - relations.DocumentRelationshipAnnotator - relations.SentenceRelationshipAnnotator consumers: - MongoRelations
The existing NPVNP
and SimpleInteraction
relationship annotators have been extended with a three new relationship annotators.
-
DependencyRelationshipAnnotator
is a more restricted version of theSentenceRelationshipAnnotator
but restricted to ensure that there is a dependency path between the two entities in the sentence. -
RegExRelationAnnotator
captures simple cases using regular expressions applied to words between entities. For example( :Person: )\\s+(?:visit\\w*|went to)\\s+( :Location: )
would create a relationship between a person and a location that they visit(ed) or went to. -
PartOfSpeechRelationshipAnnotator
uses parts of speech in a customised regular expression system, for example( NNP ). *( VBD ).* ( NNP )
will extract a proper noun followed by a past tense verb followed by another proper noun with any text in between.
See the relevant Javadoc for further information including the list of extended Penn Treebank tags that may be used with PartOfSpeechRelationshipAnnotator
.
Note that these relationship extraction annotators require dependency parsing to function. For example the pipeline below will generate relationships between places and or locations which depend on each other within a sentence.
mongo: db: baleen_dependency_relations host: localhost collectionreader: class: FolderReader folders: ..\corpora\re3d annotators: - language.OpenNLP - language.MaltParser - class: stats.OpenNLP model: ..\models\en-ner-person.bin type: uk.gov.dstl.baleen.types.common.Person - class: stats.OpenNLP model: ..\models\en-ner-location.bin type: uk.gov.dstl.baleen.types.semantic.Location - relations.DependencyRelationshipAnnotator consumers: - MongoRelations
Baleen 2.6 contains an implementation of a system based on the ReNoun paper: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42849.pdf
The aim of the ReNoun system is to extract facts in the form (subject, attribute, object) where the attribute is expressed in noun form. For example,
- (NPR, legal affairs correspondent, Nina Totenberg)
- (Princeton, economist, Paul Krugman)
- (Google, CEO, Larry Page). in Baleen these are expressed as relations of the form (subject, value, target).
The system has 4 stages in order to produce facts:
- Seed Fact Extraction - this takes a list of valid attribute types and uses ‘handcrafted’ dependency patterns to extract a set of ‘known facts’ for bootstrapping.
- Extract Pattern Learning - uses the ‘known facts’ to extract further patterns that would also extract the known facts from the corpus.
- Fact Extraction - uses the learned patterns to extract more facts from the corpus.
- Fact Scoring - assigns a measure of confidence in the extracted fact based on a scoring of the patterns used.
Each of these stages can be run as its own Baleen pipeline or job, using Mongo to facilitate communication of data between each stage.
This example assumes that the corpus is stored as text files in a folder ./files/ relative to the location of the baleen.jar file.
This example required Mongo to be running on localhost:27017
The scoring requires the glove word vector model, which can be downloaded from https://nlp.stanford.edu/projects/glove/ . For example download glove.6B.zip and unzip it to the folder ./models/ relative to baleen.jar.
If required, an 'attributes' file may be provided to limit the attributes to a specific set of nouns. An example is given below but this should be tuned to the corpus or not used to match all nouns.
CEO COO CFO Secretary chief executive officer chief operating officer chief financial officer Chief administrative officer Chief analytics officer Chief brand officer Chief business development officer Chief business officer, Chief commercial officer Chief communications officer Chief compliance officer Chief creative officer Chief customer officer Chief data officer Chief design officer Chief digital officer Chief diversity officer Chief content officer Chief events officer[1] Chief executive officer Chief experience officer Chief financial officer Chief gaming officer Chief genealogical officer Chief human resources officer Chief information officer Chief information officer (higher education) Chief information security officer Chief innovation officer Chief investment officer Chief knowledge officer Chief learning officer Chief legal officer Chief marketing officer Chief operating officer Chief privacy officer Chief process officer Chief product officer Chief reputation officer Chief research officer Chief restructuring officer Chief Revenue Officer Chief risk officer Chief science officer Chief Scientific Officer Chief security officer Chief services officer Chief strategy officer Chief sustainability officer Chief technology officer Chief visibility officer Chief visionary officer Chief web officer director chairman chairperson president owner treasurer board member father mother brother sister wife husband partner captain chief producer coach
The seed generation step can be used if the dependency parser of the model is changed from the default MaltParser. The seed generation pipeline should be configured in 0_seed_generation.yml.
mongo: db: baleen-renoun host: localhost # Supply the default document of fact sentences collectionreader: class: renoun.ReNounSeedDocument annotators: # Ensure the language parsing is done in the pipeline - language.OpenNLP - language.MaltParser # ReNoun Seed Fact Extraction - class: renoun.ReNounSeedGenerator outputCollection: seedPatterns # Save relations to Mongo consumers: - class: MongoRelations collection: seeds
This pipeline is run using:
java -jar baleen.jar -p 0_seed_generation.yml
This pipeline extracts seed facts using a set of hand crafted patterns for the given attributes. There is also an option to use all nouns that match the patterns as attributes, if a target attribute list is not known. The seed facts are stored as relations in Mongo. These should be sanity checked and verified removing any that are not valid before moving on to the next stage.
The seed extraction pipeline should be configured in 1_generated_seed_extraction.yml if the seed generation step was run.
mongo: db: baleen-renoun host: localhost # Read your corpus here collectionreader: - class: FolderReader folders: - ./files/ annotators: # Ensure the language parsing is done in the pipleine - language.OpenNLP - language.MaltParser # Perform your usual entity extraction here e.g. # ... # ... # ReNoun Seed Fact Extraction - class: renoun.ReNounGeneratedSeedsRelationshipAnnotator collection: seedPatterns # attributesFile: attributes.txt # Save relations to Mongo consumers: - class: MongoRelations collection: seedFacts
This pipeline is run using:
java -jar baleen.jar -p 1_generated_seed_extraction.yml
If the seed generation step was skipped then the default seeds can be used by using the 1_default_seed_extraction.yml pipleline file.
mongo: db: baleen-renoun host: localhost # Read your corpus here collectionreader: - class: FolderReader folders: - ./files/ annotators: # Ensure the language parsing is done in the pipleine - language.OpenNLP - language.MaltParser # Perform your usual entity extraction here e.g. # ... # ... # ReNoun Seed Fact Extraction - class: renoun.ReNounDefaultSeedsRelationshipAnnotator collection: seedPatterns # attributesFile: attributes.txt # Save relations to Mongo consumers: - class: MongoRelations collection: seedFacts
This pipeline is run using:
java -jar baleen.jar -p 1_default_seed_extraction.yml
The attribute list (if supplied) and the (refined) seed facts are used by this pipeline to generate more patterns that would have extracted these facts. These patterns are stored in mongo.
Pattern learning can be configured in 2_pattern_learning.yml
mongo: db: baleen-renoun host: localhost # Read your corpus here collectionreader: - class: FolderReader folders: - ./files/ annotators: - language.OpenNLP - language.MaltParser # ReNoun Pattern Learning - class: renoun.ReNounPatternDataGenerator collection: seedFacts # outputCollection: custom
This pipeline is run using:
java -jar baleen.jar -p 2_pattern_learning.yml
Using the extended set of patterns more facts/relations are extracted from the corpus to give the noun based relations. These are stored as relations in mongo and (optionally) in a specific collection for scoring.
this pipeline is configured in 3_fact_extraction.yml
mongo: db: baleen-renoun host: localhost # Read your corpus here collectionreader: - class: FolderReader folders: - ./files/ annotators: # Ensure the language parsing is done in the pipleine (done in default here) - language.OpenNLP - language.MaltParser # Perform your usual entity extraction here e.g. # ... # ... # ReNoun Fact Extraction - class: renoun.ReNounRelationshipAnnotator factCollection: renoun_facts # attributeFile: ./renoun/attributes # Save relations to Mongo consumers: - class: Mongo outputHistory: true - class: MongoRelations
This pipeline is run using:
java -jar baleen.jar -p 3_fact_extraction.yml
This optional post process can score the facts to give you more information about the confidence you should have in the extracted fact. It is configured in 4_fact_scoring.yml
mongo: db: baleen-renoun host: localhost tasks: - class: renoun.ReNounScoring factCollection: renoun_facts model: ./models/glove.6B.300d.txt
which can be run as a Baleen job using:
java -jar baleen.jar -j 4_fact_scoring.yml