ClusType

Publication

Xiang Ren*, Ahmed El-Kishky, Chi Wang, Fangbo Tao, Clare R. Voss, Heng Ji, Jiawei Han, "ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering”, Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, August 2015. (Slides)
Xiang Ren*, Ahmed El-Kishky, Chi Wang, Jiawei Han, "Automatic Entity Recognition and Typing from Massive Text Corpora: A Phrase and Network Mining Approach”, Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15 Conference Tutorial), Sydney, Australia, August 2015. (Website) (Slides)

Note

"./result" folder contains results on a sample of 50k Yelp reviews.

Requirements

We will take Ubuntu for example.

python 2.7

$ sudo apt-get install python

numpy

$ sudo apt-get install pip
$ sudo pip install numpy

scipy

$ sudo pip install scipy

scikit-learn

$ sudo pip install sklearn

TextBlob and data for its POS tagger

$ sudo pip install textblob
$ sudo python -m textblob.download_corpora

lxml

$ sudo pip install lxml

Default Run

$ ./run.sh

File path setting - run.sh

We will take Yelp dataset as an example.

Input: dataset folder. There are one sample Yelp review dataset (yelp) and one NYT news dataset (nyt).

DataPath='data/yelp'

Input data file path.

RawText='data/yelp/yelp_sample50k.txt'

Input: type mapping file path.

Format: "type name \TAB typeId \n". "NIL" means "Not-of-Interest".

TypeFile='data/yelp/type_tid.txt'

Input: stopword list.

StopwordFile='data/stopwords.txt'

Output: output file from candidate generation.

Format: "docId \TAB segmented sentence \n".
Segments are separated by ",". Entity mention candidates are marked with ":EP". Relation phrases are marked with ":RP".

SegmentOutFile='result/segment.txt'

Output: entity linking output file.

Format: "docId \TAB entity name \TAB Original Freebase Type \TAB Refined Type \TAB Freebase EntityID \TAB Similarity Score \TAB Relative Rank \n".
Download Seed file for Yelp dataset.
Download Seed file for NYT dataset.

NOTE: Our entity linking module calls DBpediaSpotLight Web service, which has limited querying speed. This process can be largely accelarated by installing the tool on your local machine Link.

SeedFile='result/seed.txt'

Output: data statistics on graph construction.

DataStatsFile='result/data_model_stats.txt'

Output: Typed entity mentions.

Format: "docId \TAB entity mention \TAB entity type \n".

ResultFile='result/results.txt'

Output: Typed mentions annotated in the segmented text.

ResultFileInText='result/resultsInText.txt'

Parameters - run.sh

Threshold on significance score for candidate generation.

significance="1"

Switch on capitalization feature for candidate generation.

capitalize="1"

Maximal phrase length for candidate generation.

maxLength='4'

Minimal support of phrases for candidate generation.

minSup='10'

Number of relation phrase clusters.

NumRelationPhraseClusters='50'

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
candidate_generation		candidate_generation
data		data
entity_linking		entity_linking
result		result
src		src
LICENSE.md		LICENSE.md
README.md		README.md
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClusType

Publication

Note

Requirements

Default Run

File path setting - run.sh

Parameters - run.sh

About

Releases

Packages

Languages

License

UIUC-data-mining/ClusType

Folders and files

Latest commit

History

Repository files navigation

ClusType

Publication

Note

Requirements

Default Run

File path setting - run.sh

Parameters - run.sh

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages