Source code and data for EMNLP'16 paper AFET: Automatic Fine-Grained Entity Typing by Hierarchical Partial-Label Embedding.
Given a text corpus with entity mentions detected and heuristically labeled by distant supervision, this code performs training of a rank-based loss over distant supervision and predict the fine-grained entity types for each test entity mention. For example, check out AFET's output on WSJ news articles.
An end-to-end tool (corpus to typed entities) is under development. Please keep track of our updates.
Performance of fine-grained entity type classification over Wiki (Ling & Weld, 2012) dataset.
Method | Accuray | Macro-F1 | Micro-F1 |
---|---|---|---|
HYENA (Yosef et al., 2012) | 0.288 | 0.528 | 0.506 |
FIGER (Ling & Weld, 2012) | 0.474 | 0.692 | 0.655 |
FIGER + All Filter (Gillick et al., 2014) | 0.453 | 0.648 | 0.582 |
HNM (Dong et al., 2015) | 0.237 | 0.409 | 0.417 |
WSABIE (Yogatama et al,., 2015) | 0.480 | 0.679 | 0.657 |
AFET (Ren et al., 2016) | 0.533 | 0.693 | 0.664 |
The output on BBN dataset can be found here. Each line is a sentence in the test data of BBN, with entity mentions and their fine-grained entity typed identified.
- python 2.7, g++
- Python library dependencies
$ pip install pexpect unidecode six requests protobuf
- Setup stanford coreNLP and its python wrapper.
$ cd DataProcessor/
$ git clone [email protected]:stanfordnlp/stanza.git
$ cd stanza
$ pip install -e .
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
$ unzip stanford-corenlp-full-2016-10-31.zip
$ rm stanford-corenlp-full-2016-10-31.zip
We pre-processed three public datasets (train/test sets) to our JSON format. We ran Stanford NER on training set to detect entity mentions, and performed distant supervision using DBpediaSpotlight to assign type labels:
- Wiki (Ling & Weld, 2012): 1.5M sentences sampled from 780k Wikipedia articles. 434 news sentences are manually annotated for evaluation. 113 entity types are organized into a 2-level hierarchy (download JSON)
- OntoNotes (Weischedel et al., 2011): 13k news articles with 77 of them are manually labeled for evaluation. 89 entity types are organized into a 3-level hierarchy. (download JSON)
- BBN (Weischedel et al., 2005): 2,311 WSJ articles that are manually annotated using 93 types in a 2-level hierarchy. (download JSON)
Type hierarches
for each dataset are included.- Please put the data files in the corresponding subdirectories under
AFET/Data/
.
$ cd AFET/Model; make
Run AFET for fine-grained entity typing on BBN dataset
$ java -mx4g -cp "DataProcessor/stanford-corenlp-full-2016-10-31/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
$ ./run.sh
Dataset to run on.
Data="BBN"
- concrete parameters for running each dataset can be found in the README in corresponding data folder under
AFET/Data/
Evaluate prediction results (by classifier trained on de-noised data) over test data
python Evaluation/emb_prediction.py $Data pl_warp bipartite maximum cosine 0.25
python Evaluation/evaluation.py $Data pl_warp bipartite
- python Evaluation/evaluation.py -DATA(BBN/ontonotes/FIGER) -METHOD(hple/...) -EMB_MODE(hete_feature)
Please cite the following paper if you find the codes and datasets are helpful:
@inproceedings{Ren2016AFETAF,
title={AFET: Automatic Fine-Grained Entity Typing by Hierarchical Partial-Label Embedding},
author={Xiang Ren and Wenqi He and Meng Qu and Lifu Huang and Heng Ji and Jiawei Han},
booktitle={EMNLP},
year={2016}
}