-
Xiang Ren*, Ahmed El-Kishky, Chi Wang, Fangbo Tao, Clare R. Voss, Heng Ji, Jiawei Han, "ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering”, Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, August 2015. (Slides)
-
Xiang Ren*, Ahmed El-Kishky, Chi Wang, Jiawei Han, "Automatic Entity Recognition and Typing from Massive Text Corpora: A Phrase and Network Mining Approach”, Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15 Conference Tutorial), Sydney, Australia, August 2015. (Website) (Slides)
"./result" folder contains results on a sample of 50k Yelp reviews.
We will take Ubuntu for example.
- python 2.7
$ sudo apt-get install python
- numpy
$ sudo apt-get install pip
$ sudo pip install numpy
- scipy
$ sudo pip install scipy
- scikit-learn
$ sudo pip install sklearn
- TextBlob and data for its POS tagger
$ sudo pip install textblob
$ sudo python -m textblob.download_corpora
- lxml
$ sudo pip install lxml
$ ./run.sh
We will take Yelp dataset as an example.
Input: dataset folder. There are one sample Yelp review dataset (yelp) and one NYT news dataset (nyt).
DataPath='data/yelp'
Input data file path.
RawText='data/yelp/yelp_sample50k.txt'
Input: type mapping file path.
- Format: "type name \TAB typeId \n". "NIL" means "Not-of-Interest".
TypeFile='data/yelp/type_tid.txt'
Input: stopword list.
StopwordFile='data/stopwords.txt'
Output: output file from candidate generation.
- Format: "docId \TAB segmented sentence \n".
- Segments are separated by ",". Entity mention candidates are marked with ":EP". Relation phrases are marked with ":RP".
SegmentOutFile='result/segment.txt'
Output: entity linking output file.
- Format: "docId \TAB entity name \TAB Original Freebase Type \TAB Refined Type \TAB Freebase EntityID \TAB Similarity Score \TAB Relative Rank \n".
- Download Seed file for Yelp dataset.
- Download Seed file for NYT dataset.
NOTE: Our entity linking module calls DBpediaSpotLight Web service, which has limited querying speed. This process can be largely accelarated by installing the tool on your local machine Link.
SeedFile='result/seed.txt'
Output: data statistics on graph construction.
DataStatsFile='result/data_model_stats.txt'
Output: Typed entity mentions.
- Format: "docId \TAB entity mention \TAB entity type \n".
ResultFile='result/results.txt'
Output: Typed mentions annotated in the segmented text.
ResultFileInText='result/resultsInText.txt'
Threshold on significance score for candidate generation.
significance="1"
Switch on capitalization feature for candidate generation.
capitalize="1"
Maximal phrase length for candidate generation.
maxLength='4'
Minimal support of phrases for candidate generation.
minSup='10'
Number of relation phrase clusters.
NumRelationPhraseClusters='50'