Note: This project is mostly based on https://github.com/yuhaozhang/sentence-convnet
- Python 2.7
- Tensorflow (tested with version
0.10.0rc0-> 1.0.1) - Numpy
To download wikipedia articles (distant_supervision.py
)
- Beautifulsoup
- Pandas
- Stanford NER
*Path to Stanford-NER is specified in
ner_path
variable indistant_supervision.py
To visualize the results (visualize.ipynb
)
-
data
directory includes preprocessed data:cnn-re-tf ├── ... ├── word2vec └── data ├── er # binay-classification dataset │ ├── source.txt # source sentences │ └── target.txt # target labels └── mlmi # multi-label multi-instance dataset ├── source.att # attention ├── source.left # left context ├── source.middle # middle context ├── source.right # right context ├── source.txt # source sentences └── target.txt # target labels
To reproduce:
python ./distant_supervision.py
-
word2vec
directory is empty. Please download the Google News pretrained vector data from this Google Drive link, and unzip it to the directory. It will be a.bin
file.
python ./util.py
It creates vocab.txt
, ids.txt
and emb.npy
files.
-
Binary classification (ER-CNN):
python ./train.py --sent_len=3 --vocab_size=11208 --num_classes=2 --train_size=15000 \ --data_dir=./data/er --attention=False --multi_label=False --use_pretrain=False
-
Multi-label multi-instance learning (MLMI-CNN):
python ./train.py --sent_len=255 --vocab_size=36112 --num_classes=23 --train_size=10000 \ --data_dir=./data/mlmi --attention=True --multi_label=True --use_pretrain=True
-
Multi-label multi-instance Context-wise learning (MLMI-CONT):
python ./train_context.py --sent_len=102 --vocab_size=36112 --num_classes=23 --train_size=10000 \ --data_dir=./data/mlmi --attention=True --multi_label=True --use_pretrain=True
Caution: A wrong value for input-data-dependent options (sent_len
, vocab_size
and num_class
)
may cause an error. If you want to train the model on another dataset, please check these values.
python ./eval.py --train_dir=./train/1473898241
Replace the --train_dir
with the output from the training.
tensorboard --logdir=./train/1473898241
P | R | F | AUC | init_lr | l2_reg | |
---|---|---|---|---|---|---|
ER-CNN | 0.9410 | 0.8630 | 0.9003 | 0.9303 | 0.005 | 0.05 |
MLMI-CNN | 0.8205 | 0.6406 | 0.7195 | 0.7424 | 1e-3 | 1e-4 |
MLMI-CONT | 0.8819 | 0.7158 | 0.7902 | 0.8156 | 1e-3 | 1e-4 |
*As you see above, these models somewhat suffer from overfitting ...
- http://github.com/yuhaozhang/sentence-convnet
- http://github.com/dennybritz/cnn-text-classification-tf
- http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
- http://tkengo.github.io/blog/2016/03/14/text-classification-by-cnn/
- Adel et al. Comparing Convolutional Neural Networks to Traditional Models for Slot Filling NAACL 2016
- Nguyen and Grishman. Relation Extraction: Perspective from Convolutional Neural Networks NAACL 2015
- Lin et al. Neural Relation Extraction with Selective Attention over Instances ACL 2016