Skip to content

Automatic Entity Recognition and Typing for Massive, Domain-Specific Corpora

License

Notifications You must be signed in to change notification settings

UIUC-data-mining/ClusType

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ClusType

Publication

Note

"./result" folder contains results on a sample of 50k Yelp reviews.

Requirements

We will take Ubuntu for example.

  • python 2.7
$ sudo apt-get install python
  • numpy
$ sudo apt-get install pip
$ sudo pip install numpy
  • scipy
$ sudo pip install scipy
  • scikit-learn
$ sudo pip install sklearn
  • TextBlob and data for its POS tagger
$ sudo pip install textblob
$ sudo python -m textblob.download_corpora
  • lxml
$ sudo pip install lxml

Default Run

$ ./run.sh  

File path setting - run.sh

We will take Yelp dataset as an example.

Input: dataset folder. There are one sample Yelp review dataset (yelp) and one NYT news dataset (nyt).

DataPath='data/yelp'

Input data file path.

RawText='data/yelp/yelp_sample50k.txt'

Input: type mapping file path.

  • Format: "type name \TAB typeId \n". "NIL" means "Not-of-Interest".
TypeFile='data/yelp/type_tid.txt'

Input: stopword list.

StopwordFile='data/stopwords.txt'

Output: output file from candidate generation.

  • Format: "docId \TAB segmented sentence \n".
  • Segments are separated by ",". Entity mention candidates are marked with ":EP". Relation phrases are marked with ":RP".
SegmentOutFile='result/segment.txt'

Output: entity linking output file.

  • Format: "docId \TAB entity name \TAB Original Freebase Type \TAB Refined Type \TAB Freebase EntityID \TAB Similarity Score \TAB Relative Rank \n".
  • Download Seed file for Yelp dataset.
  • Download Seed file for NYT dataset.

NOTE: Our entity linking module calls DBpediaSpotLight Web service, which has limited querying speed. This process can be largely accelarated by installing the tool on your local machine Link.

SeedFile='result/seed.txt'

Output: data statistics on graph construction.

DataStatsFile='result/data_model_stats.txt'

Output: Typed entity mentions.

  • Format: "docId \TAB entity mention \TAB entity type \n".
ResultFile='result/results.txt'

Output: Typed mentions annotated in the segmented text.

ResultFileInText='result/resultsInText.txt'

Parameters - run.sh

Threshold on significance score for candidate generation.

significance="1"

Switch on capitalization feature for candidate generation.

capitalize="1"

Maximal phrase length for candidate generation.

maxLength='4'

Minimal support of phrases for candidate generation.

minSup='10'

Number of relation phrase clusters.

NumRelationPhraseClusters='50'

About

Automatic Entity Recognition and Typing for Massive, Domain-Specific Corpora

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.4%
  • Shell 0.6%