Skip to content

read CoreNLP and AIDA files into a Python class

Notifications You must be signed in to change notification settings

thuwyq/corenlp-xml-reader

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Corenlp-xml-reader documentation

.. py:module:: corenlp_xml_reader

Purpose

Stanford's CoreNLP tool suite is a full-featured tool for generating annotations in text like POS (part-of-speech) tags and the dependency parse.

The CoreNLP tool can output the annotations to xml files. Working with these files is a bit tricky: it is up to the reading program to rebuild the logical links between the various kinds of information (e.g. POS, parse, and coreference information, etc).

The format also has some questionable aspects. It uses one-based indexing for sentence and token ids, while character offsets are zero-based. Also, named entities and coreference chains don't have a consistent relationship to one another.

The corenlp_xml_reader provides an API in Python that simplifies access to CoreNLP's annotations and traversal of the document, while ironing out some of the inconsistencies.

Install

Basic install: pip install corenlp-xml-reader

Hackable install:

git clone https://github.com/enewe101/corenlp-xml-reader.git
cd corenlp-xml-reader
python setup.py develop

Example

Suppose we have the one-sentence document:

President Obama cannot run for a third term (but I think he wants to).

Let's assume that it has been processed by CoreNLP, creating the output file obama.txt.xml.

Instantiation

The first thing we do is import the module and get an AnnotatedText object.

>>> from corenlp_xml_reader import AnnotatedText as A
>>> xml = open('obama.txt.xml').read()
>>> annotated_text = A(xml)

Sentences

Usually you'll access parts of the document using the sentences list.

>>> len(annotated_text.sentences)
1
>>> sentence = annotated_text.sentences[0]
>>> sentence.keys()
['tokens', 'entities', 'references', 'mentions', 'root', 'id']

A Sentence is a special class that, for the most part, feels like a simple dict.

The tokens property is a list of the sentence's tokens:

>>> obama = sentence['tokens'][1]
>>> obama
' 0: Obama (10,14) NNP PERSON'
>>> term = sentence['tokens'][7]
>>> term
' 7: term (39,42) NN -'

Tokens

Tokens have properties corresponding to CoreNLP's annotations, plus some other stuff:

>>> obama.keys()
['word', 'character_offset_begin', 'character_offset_end', 'pos',
'lemma', 'sentence_id', 'entity_idx', 'speaker', 'mentions', 'parents',
'ner', 'id']

Named Entities

"Obama" is the name of a person, so, if CoreNLP is working well, it should pick that up. Named entity information is found in the ner property:

>>> obama['ner']
'PERSON'
>>> term['ner'] is None
True

POS Tags

Similarly we can check the part-of-speech:

>>> obama['pos']
'NNP'
>>> term['pos']
'NN'

Dependency Tree

We can traverse the dependency tree using the parents and children properties. In our example, "run" is the parent of "Obama" (because "Obama" is the subject (nsubj) of "run"):

>>> relation, parent = obama['parents'][0]
>>> relation
u'nsubj'
>>> parent
' 3: run (23,25) -'

If you're processing dependency trees, you'll often want to start with the head word (which is like the root of the sentence). Sentences have a special root property that stores the head word. Usually it's a verb:

>>> sentence['root']
' 3: run (23,25) -'

Coreference Chains

A coreference chain is a series of references to the same entity. In our example, "President Obama" and "he" are each mentions from the same coreference chain. We can access all the mentions of a coreference chain.

First, we can get the mention that "Obama" is part of:

>>> first_mention = obama['mentions'][0]
>>> first_mention['tokens']
[' 0: President (0,8) -', ' 1: Obama (10,14) PERSON']

Note that a token can be part of multiple mentions. For example, consider the phrase "Obama's pyjamas". If his pyjamas are mentioned multiple times, then there will be a coreference chain made for it, as well as for Obama himself. And in the phrase "Obama's pyjamas", the token "Obama" is both part of a mention corresponding to the 44th President of the United States, and part of a mention corresponding to some garments for sleeping.

Once we have gotten ahold of a mention, we can access the coreference chain that it belongs to, which is found in the mention's 'reference' property. Conversely, if we have accessed a coreference chain, we can find all of its mentions by looking at its 'mentions' property.

So, starting from the mention containing the token "Obama", we can get to the other mention ("he") like this:

>>> reference = first_mention['reference']
>>> len(reference['mentions'])
2
>>> second_mention = reference['mentions'][1]
>>> second_mention['tokens']
['12: he (57,58) -']

Mentions have various properties:

>>> first_mention.keys()
['head', 'end', 'reference', 'tokens', 'start', 'sentence_id']

In addition to the coreference chain ('reference'), we get the id of the sentence in which the mention is found, the list of token objects in the mention, the slice indices ('start' and 'end') for those tokens as they occur in the sentence's token list, and the head token of the mention.

References have various properties too:

>>> reference.keys()
['mentions', 'id', 'representative']

In addition to the mentions that are part of the coreference chain, we get an id for the coreference chain (unique on a per-article-basis), and a reference to the "representative" mention. The representative mention is the one that is deemed to have the fullest realization of the object's name. So in our example, the representative reference would be "President Obama", not "he". This is useful for getting the human-readable name to represent the coreference chain.

We can access all of the mentions or all of the coreference chains, for a given sentence, using its mentions and references properties.

>>> len(sentence['mentions'])
2
>>> len(sentence['references'])
1

One thing to note is that mentions and references aren't necessarily anchored to any named entity (though they often are). For example, consider this sentence:

The police are yet to find any suspects. They say they will continue their search.

Here, "The police", "they" (which occurs twice), and "their" are all part of one coreference chain, yet none is a named entity.

To access only mentions that are named entities, use the entities property of the sentence.

The document as a whole also provides global mentions, references, and entities properties which can be iterated over directly..

Reference

.. py:class:: AnnotatedText(corenlp_xml, **kwargs)

   Create a new AnnotatedText object.  Only the first parameter is normally
   needed.  The remaining parameters enable adding entity linking data from
   the AIDA software, controlling the kind of dependency parse
   used, and filtering the kinds of named entities, coreference chains,
   and mentions that are included (by default all those provided by CoreNLP
   are are included).

   :param str corenlp_xml: An xml string output by CoreNLP.
   :param str aida_json=None: A JSON string output by AIDA.  AIDA is a program that disambiguates named entities, linking them to the YAGO knowledge base.  If the JSON output of AIDA is provided, then ``entities``, ``mentions`` and ``references`` entries will be augmented with entity linking information.
   :param str dependencies='collapsed-ccprocessed': Determines which kind of dependencies will be used in constructing dependency trees.  Three options are available: ``'collapsed-ccprocessed'`` (the default), ``'collapsed'``, and ``'basic'``.
   :param bool exclude_ordinal_NERs=False: Whether to recognize ordinal named entities.  If ``True``, named entities of the following types will be ignored: ``'TIME'``, ``'DATE'``, ``'NUMBER'``, ``'DURATION'``, ``'PERCENT'``, ``'SET'``, ``'ORDINAL'``, and ``'MONEY'``.
   :param bool exclude_long_mentions=False: CoreNLP occaisionally includes mentions, as part of coreference chains, that are very long noun phrases.  These mentions can be surprising and are often not useful.  Setting this option to ``True`` causes any mentions longer that the value specified by ``long_mention_threshold`` to be discarded (default length is 5 tokens).
   :param int long_mention_threshold=5: Maximum number of tokens allowed in a coreference chain mention, above which the mention will be ignored if ``exclude_long_mentions`` is ``True``.
   :param bool exclude_non_ner_coreferences=False: In some cases, it is only desirable to consider those coreference chains that have at least one named entity as a mention.  Setting this option to ``True`` will exclude references and their mentions if the reference includes no named entities.

About

read CoreNLP and AIDA files into a Python class

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%