Pytorch Implementation of Improving Reinforcement Learning Based Image Captioning with Natural Language Prior
Python 2.7
PyTorch 0.4 (along with torchvision)
cider package (copy from Here and dump them to cider/
)
pycoco package (copy from Here and extract them to pycoco/
)
You need to download pretrained resnet model for both training and evaluation. The models can be downloaded from here, and should be placed in data/imagenet_weights
.
Download preprocessed coco captions from link following Karpathy's split. Copy dataset_coco.json
,captions_train.json
,captions_val.json
and captions_test.json
in to data/features
.
Then do:
$ python scripts/prepro_labels.py --input_json data/dataset_coco.json --output_json data/cocotalk.json --output_h5 data/cocotalk
prepro_labels.py
will map all words that occur <= 5 times to a special UNK
token, and create a vocabulary for all the remaining words. The image information and vocabulary are dumped into data/cocotalk.json
and discretized caption data are dumped into data/cocotalk_label.h5
.
Download the coco images from link. We need 2014 training images and 2014 val. images. You should put the train2014/
and val2014/
in the same directory, denoted as $IMAGE_ROOT
.
Then:
$ python scripts/prepro_feats.py --input_json data/dataset_coco.json --output_dir data/cocotalk --images_root $IMAGE_ROOT
prepro_feats.py
extract the resnet101 features (both fc feature and last conv feature) of each image. The features are saved in data/cocotalk_fc
and data/cocotalk_att
, and resulting files are about 200GB.
(Check the prepro scripts for more options, like other resnet models or other attention sizes.)
In order to help CIDEr based REINFORCE algorithm converge more stable and faster, We need to warm start the captioning model and run the script below
$ python train_warm.py --caption_model fc
if you want to use Attention, then run
$ python train_warm.py --caption_model att
Download our pretrained warm start model from this link. And the best CIDEr score in validation set are 90.1 for FC and 94.2 for Attention.
$ python train_sc_cider.py --caption_model att
You will see a large boost of CIDEr score but with lots of bad endings.
First you should preprocess the dataset and get the ngram data:
$ python get_ngram.py
and will generate fourgram.pkl
and trigram.pkl
in data/
.
Then
$ python train_fourgram.py --caption_model fc
It will take almost 40,000 iterations to converge and the experiment details are written in experiment.log
in save_dir
like
First you should train a neural language or you can download our pretrained LSTM language model from link.
$ python train_rnnlm.py
Then train RL setting with Neural Language model constraint with the same warm start model.
$ python train_rnnlm_cider.py --caption_model fc
or
$ python train_rnnlm_cider.py --caption_model att
It will take almost 36,000 iterations to converge and the experiment details are written in experiment.log
in save_dir
.
$ python Eval_model.py --caption_model fc --rl_type fourgram
We also try another neural network structure and get the similar results. Please see the MoreNet.md for more details.
Thanks the original self-critical performed by ruotianluo.