Implemented a CLIP-based approach to Visual Question Answering (VQA) using the VizWiz dataset. Loaded and split the data using stratified sampling on answer type and answerability, selected the most common answer for each question using Levenshtein distance to break ties, and encoded image-question pairs using a CLIP ViT-L/14@336px model with data augmentation. Trained a VQA model using auxiliary answer type loss and an answerability model, and evaluated the approach using accuracy and answerability metrics. Achieved an accuracy of 42.0% and an answerability of 82.8%, indicating effectiveness in answering open-ended questions based on images.
-
Notifications
You must be signed in to change notification settings - Fork 0
AhmedDusuki/CLIP_VizWiz_Question_Answering
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Implemented a CLIP-based approach to Visual Question Answering (VQA) using the VizWiz dataset and achieved an accuracy of 42.0% and an answerability of 82.8%.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published