naruhodo
is a python library for automatic semantic graph generation from human-readable text. Graphs generated by naruhodo
are networkx objects, so you can apply furthur garph analysis/processing conveniently.
Supported languages:
- Japanese
- English(WIP)
- Chinese(WIP)
Supported semantic graph types:
- knowledge-structure-graph(KSG): A directed graph based on entity-predicate model of knowledge representation.
- dependency-structure-graph(DSG): A directed graph based on dependency structure.
naruhodo
provides basic visulization utilities using nxpd. A full-fledged visualization webapp naruhodo-viewer is also available. This webapp provides faster and interactive visualization of large graphs.
Knowledge structure graph(KSG) tries to capture meaningful relationships between different entities. It is generated based on the dependency structure of the text.
In KSG, there are primarily two kinds of nodes:
- nodes that represent entities (entities)
- nodes that represent properties or actions (predicates)
Entity nodes are primarily connected by predicate nodes. Edges represent the relationship between these nodes. If an entity has a property or performs an action, it has an edge pointing to corresponding predicate nodes. If an entity is part of the property of another entity or the object of an action, it has an edge pointing from the predicate node to it.
This kind of graph structure is inspired by the way human brains store knowledge/information. Accurate and comprehensive KSG parsing is the main focus of naruhodo
.
Below is an example of KSG generated by naruhodo
using the following texts.
"田中一郎は田中次郎が描いた絵を田中三郎に贈った。"
"三郎はこの絵を持って市場に行った。"
"彼はしばらく休む事にした。"
Dependency parsing is the analysis of dependency grammar on a block of text using computer programs.
The directed linking nature of dependency grammar makes the result of dependency parsing directed graphs.
naruhodo
generates denpendency structure graphs(DSG) directly from the output of dependency parsing programs.
Below is an example of DSG generated by naruhodo using the following texts.
"田中一郎は田中次郎が描いた絵を田中三郎に贈った。"
"三郎はこの絵を持って市場に行った。"
"彼はしばらく休む事にした。"
naruhodo
supports python
version 3.4 and above.
You can install the library directly using pip:
pip install naruhodo
This will install the latest release version of naruhodo
.
The current development version of naruhodo
can be installed from github repository directly using the following command:
pip install https://github.com/superkerokero/naruhodo/archive/dev.zip
naruhodo
relies on external programs to do Japanese word and dependency parsing, so you need to have corresponding programs installed as well.
naruhodo
is designed to support multiple backend parsers, but currently only the support for mecab
+ cabocha
is implemented.
For guide on installing mecab
and cabocha
, please refer to this page:
Amazon Linux に MeCab と CaboCha をインストール
Support for other parsers such as KNP
is planned in the future.
naruhodo
stores graph information in a networkx DiGraph
object. The properties of nodes and edges provided by naruhodo are listed in the following table.
-
Node properties
Property Description name A string that stores the name of the node stored in the graph. This is what you use to refer to the node from graph object. count An integer representing the number of this node being referred to. Can be used as an indicator of node's significance. type An integer representing the type of the node. For meanings of integers, refer to the table of node types below. label A string that stores the normalized representation of the node. This is what you see from the visualizations. pro An integer representing the pronoun type of this node. For meanings of integers, refer to the table of pronoun types below. NE An integer representing the named-entity(NE) type of this node. For meanings of integers, refer to the table of NE types below. negative If chunk is negative 1, elif chunk double negtive(strongly positive) -1, else 0 question If chunk contains ? 1, else 0. passive If chunk is passive 1, else 0. compulsory If chunk is compulsory 1, else 0. tense If chunk has no tense or present 0, elif past -1, elif present continuous 1 pos[0:n-1] A list of integers representing the id of sentences where this node appears. lpos[0:n-1] A list of integers representing the id of chunk in the sentence it appears. surface[0:n-1] A list of strings that stores the surfaces of this node(original form as it appears in the text). yomi[0:n-1] A list of strings that stores the yomi of the corresponding surface of this node. sub A string that stores the subject of this node(if none it will be an empty string). -
Node types
Type ID Description -1 Unknown type 0 Noun 1 Adjective 2 Verb 3 Conjective 4 Interjection 5 Adverb 6 Connect -
Pronoun types
Pronoun ID Description -1 Not a pronoun(or unknown pronoun) 0 Demonstrative-location 1 Demonstrative-object 2 Personal-1st 3 Personal-2nd 4 Personal-3rd 5 Indefinite 6 Inclusive 7 Omitted subject -
Named-entity types
NE ID Description 0 Not named-entity(or unknown) 1 Person 2 Location 3 Organization 4 Number/Date 5 General -
Edge properties
Property Description weight An integer representing the number of appearance of this edge. Can be used as an indicator of edge's significance. label A string that stores the label of this edge. type A string that stores the type of this edge. For details, refer to the table of edge types below. -
Edge types
Type Description none Unknown type(also used as DSG edges) sub Edge from a subject to predicate autosub Edge from a potential subject to predicate obj Edge from a predicate to object aux Edge from auxiliary to predicate cause Edge from potential cause to result coref Edge from potential antecedent to pronoun synonym Edge from potential synonym to an entity para Edge between parallel entities attr Edge from potential attribute to an entity stat Edge from potential subject to a statement
The tutorial of naruhodo
is provided as ipynb
files in the tutorial folder. You can view it directly in your browser.
Tutorial notebook for Japanese text parsing
The complete python API document for naruhodo
can be found here:
naruhodo
Python API Reference.
This document is generated automatically from source code using pdoc, so it is always up-to-date.
You can find the change log of naruhodo
here.
naruhodo
is still in development state(especially KSG related part), so you might find it outputs weird results sometimes. If you like the idea and want to help improve the library, feel free to create a pull request on github.
Here are some of my thoughts on the development of naruhodo
:
-
As you can see from the source code,
naruhodo
relies mostly on rule-based system to parse given information. And for a subject as large and complex as a language, long-term testing and procedural improvement of the program is neccessary before it can go anywhere.Currently the knowledge structure graph(KSG) generated by
naruhodo
is below my expectation for large amount of input texts. Improvement will come from furthur examination on varieties of input text and corresponding refinement of parsing logic.As a rule-based system, it certainly has its limitations such as completely resolving coreferences. But I believe in the realm of NLP, especially in rudimentary information parsing tasks, rule-based system can be used to make practical applications. Recent advances in statitics-based techniques such as deep learning seem promising. But almost all of these techniques require large amount of labelled data, which is hard to retrieve. The rule-based approach taken here is more or less an Ab Initio way of looking at some NLP problems(which doesn't take any training data before making useful predictions). My hope is that applications like this may at least alleviate the pain of collecting large amount of labelled data by automating some of the tedious tasks.
naruhodo
is my personal experiment on how far rule-based system can go in the world of NLP. It may fail to be practically useful, but I am sure it is going to be an interesting journey. -
Coreference resolution is the task of finding all expressions that refer to the same entity in a text. Without proper coreference resolution, generated KSG does not capture all meaningful information, and its usability will be quite limited. Currently
naruhodo
has a primitive coreference resolution added from 0.2.1, but the performance is quite limited. I am experimenting with some published works on this topic. A method based on word embeddings and reinforcement learning might be added tonaruhodo
first starting from version 0.5 ~. -
There aren't many Japanese parsing programs available on the internet yet. Aside of
mecab
+cabocha
, the most usable parsing program seems to bejuman(++)
+knp
. The output format ofknp
does contain extra useful information and can be more accurate thancabocha
in some situations. But its output lacks a unified scheme, making it difficult to use. Another important fact is thatjuman(++)
+knp
parsing can be very time consuming compared tomecab
+cabocha
, which limits its use cases.I am looking into some fast generic libraries like
spaCy
as well. Though Japanese is not the officially supported language for the moment.To summarize, though
naruhodo
is designed to support multiple backends, since its current focus is Japanese only, adding support for other backends is not a priority. -
Japanese is the only language
naruhodo
supports now. Besides my personal interest, I chosen Japanese because it has some unique characteristics that make it both challenging and rewarding. In my opnion, the difficulty regarding Japanese mostly comes from its ambiguity in the expression(for example, the subject is frequently ommited in Japanese) and large amount of word transformations(the same verb can have as many as 10+ forms).From a practical point of view, languages such as English and Chinese are in potentially large demand. So I am thinking about expanding the library to these popular languages in the future, if the rule-based approach taken by
naruhodo
proves to be usable afterall. -
It seems that everybody is excited about machine learning these days. And I do see huge application potential in techniques like reinforcement learning and generative adversarial models. I do have some thoughts about the applications of these techniques to some specific knowledge retrieval problems. For example, the coreference problem is obviously outside the reach of any rule-based systems, and a reinforcement learning based approach seems quite attractive in this case(provided that we have a real-time feedback system from users).
As my understanding of machine learning techniques improves, some statistics-based approaches may be added in the future.
-
I think information of DSG and KSG is especially useful in the realm of automating information retrieval processes. This includes, but not limited to,
- automatic text summarization
- knowledge base generation for Q&A system and translation system
- generic sentiment analysis
As the quality of KSG generated by
naruhodo
improves, I will try to apply it to some of these areas.