Skip to content

Implemented a CLIP-based approach to Visual Question Answering (VQA) using the VizWiz dataset and achieved an accuracy of 42.0% and an answerability of 82.8%.

Notifications You must be signed in to change notification settings

AhmedDusuki/CLIP_VizWiz_Question_Answering

Repository files navigation

CLIP_VizWiz_Question_Answering

Implemented a CLIP-based approach to Visual Question Answering (VQA) using the VizWiz dataset. Loaded and split the data using stratified sampling on answer type and answerability, selected the most common answer for each question using Levenshtein distance to break ties, and encoded image-question pairs using a CLIP ViT-L/14@336px model with data augmentation. Trained a VQA model using auxiliary answer type loss and an answerability model, and evaluated the approach using accuracy and answerability metrics. Achieved an accuracy of 42.0% and an answerability of 82.8%, indicating effectiveness in answering open-ended questions based on images.

About

Implemented a CLIP-based approach to Visual Question Answering (VQA) using the VizWiz dataset and achieved an accuracy of 42.0% and an answerability of 82.8%.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published