This is a tensorflow version of TextCNN proposed by Yoon Kim in paper Convolutional Neural Networks for Sentence Classification.
There are two implementations here. In the earlier implementation in the old directory, I try to structure the model by class and some interfaces such as inference, training, loss and so on. Later, I found that using TFRecord dataset to train is more efficient, so I reimplement this project in a new structure with tf.dataset. The new version is in the new directory.
There is an excellent tutorial here. This blog and the implementation introduced in it give a great help to me.
- Download the Google (Mikolov) word2vec file.
- Preprocess movie reviews data, build vocabulary, create dataset for training and validation, and store them in TFRecord files:
cd new
python preprocess_dataset.py --pos_input_file /path/to/positive/examples/file --neg_input_file /path/to/negative/examples/file --output_dir /path/to/save/tfrecords
Note: the clean movie reviews dataset, rt-polarity.pos and rt-polarity.neg, are originally taken from Yoon Kim's repository. You can use them directly to generate TFRecords.
- Train the TextCNN:
python train.py --input_train_file_pattern "/path/to/save/tfrecords/train-?????-of-?????" --input_valid_file_pattern "/path/to/save/tfrecords/valid-?????-of-?????" --w2v_file /path/to/google/word2vec/file --vocab_file /path/to/vocab/file --train_dir /path/to/save/checkpoints
With the default settings in configuration.py, the model obtained a dev accuracy of 78% without any fine-tuning.