Visual Question Answering Using BLIP

Project Overview

This project evaluates the performance of the BLIP (Bootstrapping Language-Image Pre-training) model for the Visual Question Answering (VQA) task. We performed inference using a pretrained BLIP model on three datasets: VQA v2.0 training, VQA v2.0 validation, and DAQUAR. The model was pretrained on both the VQA v2.0 training and validation datasets but not on the DAQUAR dataset. We analyzed the model's performance using various evaluation metrics and documented our findings in this project.

Project Structure

The project directory contains the following files and folders:

.git/ - Git repository metadata.
DAQUAR/ - Contains the DAQUAR dataset.
Evaluation.ipynb - Jupyter notebook for evaluating the model.
LatexCode/ - Contains LaTeX code used for the report.
README.md - This README file.
Report.pdf - Detailed project report.
requirements.txt - File listing the required Python packages.
Slides.pptx - Presentation slides summarizing the project.
Visualization.ipynb - Jupyter notebook for visualizing the datasets.
VQA_v2_Training/ - Contains the VQA training dataset.
VQA_v2_Val/ - Contains the VQA validation dataset.

Datasets

We utilized the following datasets for evaluating the BLIP model:

Dataset	# Images	# Questions	# Answers	Links
VQA v2.0 Training	82,783	443,757	4,437,570	VQA v2.0 Training
VQA v2.0 Validation	40,504	214,354	2,143,540	VQA v2.0 Validation
DAQUAR	1,449	5,674	5,674	DAQUAR

Evaluation Metrics

We used a diverse set of evaluation metrics to assess the model's performance, each offering unique insights into its capabilities. Their explaination can be found in the report. Below are the metrics used:

Accuracy
BLEU Score
BERT Score
WUPS Score
VQA Score

Results

Accuracy

Dataset	Accuracy
VQA v2.0 Training	0.769
VQA v2.0 Validation	0.766
DAQUAR	0.230

BLEU Scores

Dataset	BLEU1	BLEU2	BLEU3	BLEU4
VQA v2.0 Training	0.763	0.552	0.438	0.349
VQA v2.0 Validation	0.760	0.551	0.438	0.354
DAQUAR	0.183	0.081	0.037	0.0

BERT Scores

Dataset	BERT Precision	BERT Recall	BERT F1
VQA v2.0 Training	0.985	0.986	0.985
VQA v2.0 Validation	0.985	0.985	0.985
DAQUAR	0.945	0.935	0.939

WUPS Scores

Dataset	WUPS 0.0	WUPS 0.9
VQA v2.0 Training	86.573	79.484
VQA v2.0 Validation	86.223	79.203
DAQUAR	58.122	30.680

VQA Scores

Dataset	VQA Score
VQA v2.0 Training	84.89
VQA v2.0 Validation	84.73
DAQUAR	-

Experimental Setup

The VQA v2.0 datasets was divided into parts to handle its large size. The training set was divided into 10 parts, and the validation set into 2 parts, to facilitate parallel processing. The DAQUAR dataset was small enough to process as a whole.

Conclusion

The BLIP model shows high performance on the VQA v2 datasets, with good precision and recall metrics. However, the model struggles with the DAQUAR dataset, as evidenced by lower accuracy and BLEU scores. Despite this, high BERT scores indicate strong semantic similarity in the model's answers across all datasets. This project highlights the importance of using multiple evaluation metrics to gain a comprehensive understanding of model performance.

How to Use

Prerequisites

Ensure you have the following installed:

Python 3.x
Jupyter Notebook
Required Python packages

Running the Evaluation

Clone the repository:

git clone https://github.com/shreyas21563/VQA-using-BLIP

Navigate to the project directory:
```
cd VQA-using-BLIP
```
Install the required packages:
```
pip install -r requirements.txt
```
Open Evaluation.ipynb and Visualization.ipynb to run the evaluation and visualization code.

Authors

Shreyas Kabra
Ritwik Harit
Vasan Vohra

References

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. Advances in neural information processing systems, 27, 2014.
George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual Question Answering Using BLIP

Table of Contents

Project Overview

Project Structure

Datasets

Evaluation Metrics

Results

Accuracy

BLEU Scores

BERT Scores

WUPS Scores

VQA Scores

Experimental Setup

Conclusion

How to Use

Prerequisites

Running the Evaluation

Authors

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
DAQUAR		DAQUAR
LatexCode		LatexCode
VQA_v2_Training		VQA_v2_Training
VQA_v2_Val		VQA_v2_Val
Evaluation.ipynb		Evaluation.ipynb
README.md		README.md
Report.pdf		Report.pdf
Slides.pptx		Slides.pptx
Visualization.ipynb		Visualization.ipynb
requirements.txt		requirements.txt

shreyas21563/VQA-using-BLIP

Folders and files

Latest commit

History

Repository files navigation

Visual Question Answering Using BLIP

Table of Contents

Project Overview

Project Structure

Datasets

Evaluation Metrics

Results

Accuracy

BLEU Scores

BERT Scores

WUPS Scores

VQA Scores

Experimental Setup

Conclusion

How to Use

Prerequisites

Running the Evaluation

Authors

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages