Skip to content

Leo-Thomas/ConvNeXt-for-Image-Captioning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ConvNeXt for Image Captioning

About the project

This study explores the effectiveness of the ConvNeXt model, an advanced computer vision architecture, in the task of image captioning. We integrated ConvNeXt with a Long Short-Term Memory network that includes a visual attention module, focusing on assessing its performance across different scenarios. Experiments were conducted using various ConvNeXt versions for feature extraction, different learning rates during the training phase were tested, and the impact of including or excluding teacher-forcing was analyzed. The MS COCO 2014 dataset was employed, with top-5 accuracy and BLEU metrics used to evaluate performance. The implementation of ConvNeXt in image captioning systems reveals notable performance enhancements. In terms of BLEU-4 scores, ConvNeXt outperformed existing benchmarks by 43.04% for models using soft-attention and by 39.04% for those with hard-attention mechanisms. Furthermore, ConvNeXt surpassed models based on vision transformers and data-efficient image transformers by 4.57% and 0.93%, respectively, in BLEU-4 scores. When compared with systems using encoders such as ResNet-101, ResNet-152, VGG-16, ResNeXt-101, and MobileNet V3, ConvNeXt achieved higher top-5 accuracy improvements of 6.44%, 6.46%, 6.47%, 6.39%, and 6.68%, and reduced loss by 18.46%, 18.44%, 18.46%, 18.24%, and 18.72%, respectively.

Paper

https://ieeexplore.ieee.org/abstract/document/10311597

Built With

(back to top)

Installation

pip install git+https://github.com/Leo-Thomas/ConvNeXt-for-Image-Captioning.git

Description of the main files

  • train.py: Network training script.
  • caption.py: Used to perform the inference process for an input image.
  • models.py: It contains the whole architecture including the encoder and decoder components.

Data set

We used the 2014 version of the COCO dataset. It can be found for free in the official web site and includes the training and validation sub data sets.

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/NewFeature)
  3. Commit your Changes (git commit -m 'Add some NewFeature')
  4. Push to the Branch (git push origin feature/NewFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the GNU General Public License v3.0. See LICENSE for more information.

(back to top)

Citation

@ARTICLE{10410861,
  author={Ramos, Leo and Casas, Edmundo and Romero, Cristian and Rivas-Echeverría, Francklin and Morocho-Cayamcela, Manuel Eugenio},
  journal={IEEE Access}, 
  title={A Study of ConvNeXt Architectures for Enhanced Image Captioning}, 
  year={2024},
  volume={12},
  number={},
  pages={13711-13728},
  doi={10.1109/ACCESS.2024.3356551}}

Contact

Leo Ramos - LinkedIn - [email protected]

Francklin Rivas - LinkedIn - [email protected]

Eugenio Morocho - LinkedIn - [email protected]

(back to top)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages