Skip to content

Latest commit

 

History

History
110 lines (74 loc) · 4.53 KB

README.md

File metadata and controls

110 lines (74 loc) · 4.53 KB

Pytorch Implementation of Improving Reinforcement Learning Based Image Captioning with Natural Language Prior

Requirements

Python 2.7

PyTorch 0.4 (along with torchvision)

cider package (copy from Here and dump them to cider/)

pycoco package (copy from Here and extract them to pycoco/)

You need to download pretrained resnet model for both training and evaluation. The models can be downloaded from here, and should be placed in data/imagenet_weights.

Train your own network on COCO

Download COCO captions and preprocess them

Download preprocessed coco captions from link following Karpathy's split. Copy dataset_coco.json,captions_train.json,captions_val.json and captions_test.json in to data/features.

Then do:

$ python scripts/prepro_labels.py --input_json data/dataset_coco.json --output_json data/cocotalk.json --output_h5 data/cocotalk

prepro_labels.py will map all words that occur <= 5 times to a special UNK token, and create a vocabulary for all the remaining words. The image information and vocabulary are dumped into data/cocotalk.json and discretized caption data are dumped into data/cocotalk_label.h5.

Download COCO dataset and pre-extract the image features

Download the coco images from link. We need 2014 training images and 2014 val. images. You should put the train2014/ and val2014/ in the same directory, denoted as $IMAGE_ROOT.

Then:

$ python scripts/prepro_feats.py --input_json data/dataset_coco.json --output_dir data/cocotalk --images_root $IMAGE_ROOT

prepro_feats.py extract the resnet101 features (both fc feature and last conv feature) of each image. The features are saved in data/cocotalk_fc and data/cocotalk_att, and resulting files are about 200GB.

(Check the prepro scripts for more options, like other resnet models or other attention sizes.)

Warm Starm

In order to help CIDEr based REINFORCE algorithm converge more stable and faster, We need to warm start the captioning model and run the script below

$ python train_warm.py --caption_model fc 

if you want to use Attention, then run

$ python train_warm.py --caption_model att 

Download our pretrained warm start model from this link. And the best CIDEr score in validation set are 90.1 for FC and 94.2 for Attention.

Train using Self-critical

$ python train_sc_cider.py --caption_model att 

You will see a large boost of CIDEr score but with lots of bad endings. Image text

Train using Ngram constraint

First you should preprocess the dataset and get the ngram data:

$ python get_ngram.py

and will generate fourgram.pkl and trigram.pkl in data/ .

Then

$ python train_fourgram.py  --caption_model fc 

It will take almost 40,000 iterations to converge and the experiment details are written in experiment.log in save_dir like Image text

Train using Neural Language model

First you should train a neural language or you can download our pretrained LSTM language model from link.

$ python train_rnnlm.py

Then train RL setting with Neural Language model constraint with the same warm start model.

$ python train_rnnlm_cider.py  --caption_model fc 

or

$ python train_rnnlm_cider.py  --caption_model att 

It will take almost 36,000 iterations to converge and the experiment details are written in experiment.log in save_dir.

Image text

Evaluating CIDEr,METEOR,ROUGEL,BLEUscore with Bad Ending removal

$ python Eval_model.py  --caption_model fc --rl_type fourgram

Try another network structure

We also try another neural network structure and get the similar results. Please see the MoreNet.md for more details.

Acknowledgements

Thanks the original self-critical performed by ruotianluo.