(Note: You can read an in-depth tutorial about the implementation in this blogpost.)
This is an implementation of image captioning model based on Vinyals et al. with a few differences:
-
For CNN we use Inception v3 instead of Inception v1.
-
For RNN we use multi-layered LSTM instead of single-layered one.
-
We don't have a special start-of-sentence word so we feed the first word at t = 1 instead of t = 2.
-
We use different values for some hyperparameters:
Hyperparameter Value Learning rate 0.00051 Batch size 32 Epochs 33 Dropout rate 0.22 Embedding size 300 LSTM output size 300 LSTM layers 3
Quantitatively, the proposed model's performance is on par with Vinyals' model on Flickr8k dataset:
Metric | Proposed Model | Vinyals' Model |
---|---|---|
BLEU-1 | 61.8 | 63 |
BLEU-2 | 40.8 | 41 |
BLEU-3 | 27.8 | 27 |
BLEU-4 | 19.0 | N/A |
METEOR | 21.5 | N/A |
CIDEr | 41.5 | N/A |
-
Download the dataset needed.
./scripts/download_dataset.sh
-
Download pretrained word vectors.
./scripts/download_pretrained_word_vectors.sh
-
Download pycocoevalcap data.
./scripts/download_pycocoevalcap_data.sh
-
Install the dependencies.
Note: It was only tested on Python 2.7. It may need minor code changes to work on Python 3.
# Optional: Create and activate your virtualenv / Conda environment pip install -r requirements.txt
-
Setup
PYTHONPATH
.source ./scripts/setup_pythonpath.sh
-
Download a pretrained model from releases page.
-
Copy
model-weights.hdf5
tokeras-image-captioning/results/flickr8k/final-model
. -
Now you can run an inference from that checkpoint by executing a command below from
keras-image-captioning
directory:python -m keras_image_captioning.inference \ --dataset-type test \ --method beam_search \ --beam-size 3 \ --training-dir results/flickr8k/final-model
For reproducing the model, execute:
python -m keras_image_captioning.training \
--training-label repro-final-model \
--from-training-dir results/flickr8k/final-model
There are many arguments available that you can look inside training.py
.
python -m keras_image_captioning.inference \
--dataset-type test \
--method beam_search \
--beam-size 3 \
--training-dir var/flickr8k/training-results/repro-final-model
Note:
dataset_type
can be either 'validation' or 'test'.- You can look the captions generated at
var/flickr8k/training-results/repro-final-model/test-predictions-3-20.yaml
. You can compare it with my result atresults/flickr8k/final-model/test-predictions-3-20.yaml
.
MIT License. See LICENSE file for details.