Xiang Ren*, Ahmed El-Kishky, Chi Wang, Fangbo Tao, Clare R. Voss, Heng Ji, Jiawei Han, "ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering”, Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, August 2015. (Slides)
Xiang Ren*, Ahmed El-Kishky, Chi Wang, Jiawei Han, "Automatic Entity Recognition and Typing from Massive Text Corpora: A Phrase and Network Mining Approach”, Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15 Conference Tutorial), Sydney, Australia, August 2015. (Website) (Slides)
"./result" folder contains results on a sample of 50k Yelp reviews.
We will take Ubuntu for example.
- python 2.7
$ sudo apt-get install python
- numpy
$ sudo apt-get install pip
$ sudo pip install numpy
- scipy
$ sudo pip install scipy
- scikit-learn
$ sudo pip install sklearn
- TextBlob and data for its POS tagger
$ sudo pip install textblob
$ sudo python -m textblob.download_corpora
- lxml
$ sudo pip install lxml
$ ./run.sh
We will take Yelp dataset as an example.
Input: dataset folder. There are one sample Yelp review dataset (yelp) and one NYT news dataset (nyt).
Input data file path.
Input: type mapping file path.
- Format: "type name \TAB typeId \n". "NIL" means "Not-of-Interest".
Input: stopword list.
Output: output file from candidate generation.
- Format: "docId \TAB segmented sentence \n".
- Segments are separated by ",". Entity mention candidates are marked with ":EP". Relation phrases are marked with ":RP".
Output: entity linking output file.
- Format: "docId \TAB entity name \TAB Original Freebase Type \TAB Refined Type \TAB Freebase EntityID \TAB Similarity Score \TAB Relative Rank \n".
- Download Seed file for Yelp dataset.
- Download Seed file for NYT dataset.
NOTE: Our entity linking module calls DBpediaSpotLight Web service, which has limited querying speed. This process can be largely accelarated by installing the tool on your local machine Link.
Output: data statistics on graph construction.
Output: Typed entity mentions.
- Format: "docId \TAB entity mention \TAB entity type \n".
Output: Typed mentions annotated in the segmented text.
Threshold on significance score for candidate generation.
Switch on capitalization feature for candidate generation.
Maximal phrase length for candidate generation.
Minimal support of phrases for candidate generation.
Number of relation phrase clusters.