From d44d0eab83d8c278bf9ed674823fed0c3b1ef9ab Mon Sep 17 00:00:00 2001 From: shubhamprshr-tamu Date: Tue, 6 Feb 2024 20:56:54 -0600 Subject: [PATCH] push --- index.html | 50 +++++++++++++++++++++----------------------------- 1 file changed, 21 insertions(+), 29 deletions(-) diff --git a/index.html b/index.html index c3ce7d7..0401166 100644 --- a/index.html +++ b/index.html @@ -66,13 +66,14 @@

The Neglected Tails of Vision Language Xiangjue Dong1, +
Yanan Li4, - Deva Ramanan2 + Deva Ramanan2, - James Caverlee1 + James Caverlee1, Shu Kong1,3 @@ -137,32 +138,23 @@

The Neglected Tails of Vision Language

Abstract

- Vision-language models (VLMs) such as CLIP excel in zero-shot recognition but - exhibit drastically imbalanced performance across visual concepts in downstream - tasks. Despite high zero-shot accuracy on ImageNet (72.7%), CLIP performs poorly (<10%) on certain concepts like night snake and gyromitra, - likely due to underrepresentation in VLM's pretraining datasets. Assessing this imbalance is complex due to the difficulty in calculating concept frequency in large-scale pretraining data. -

-
    -
  • - Estimating Concept Frequency Using Large Language Models (LLMs): - We use an LLM to help count relevant texts that contain synonyms of the given concepts and resolve ambiguous cases, - confirming that popular VLM datasets like LAION exhibit a long-tailed - concept distribution. -
  • -
  • - Long-tailed Behaviors Of All Mainstream VLMs: VLMs (CLIP, OpenCLIP, MetaCLIP), visual chatbots (GPT-4V, LLaVA), and text-to-image models (DALL-E 3, SD-XL) - struggle with recognizing and generating rare concepts identified by our method. -
  • -
  • - REtrieval Augmented Learning (REAL) Achieves SOTA Zero-Shot Performance: - We propose two solutions to boost zero-shot performance over both tail and head classes, without leveraging downstream data. -
      -
    • REAL-Prompt: We prompt VLMs with the most frequent synonym of a downstream concept (e.g., "ATM" instead of "cash machine"). - This simple change outperforms other ChatGPT-based prompting methods such as DCLIP and CuPL. -
    • REAL-Linear: REAL-Linear retrieves a small, class-balanced set of pretraining data from LAION to train a robust linear classifier, - surpassing recent state-of-the-art REACT, using 400x less storage and 10,000x less training time!
    • -
    -
  • + Vision-language models (VLMs) excel in zero-shot +recognition but their performance varies greatly across +different visual concepts. For example, although CLIP +achieves impressive accuracy on ImageNet (60-80%), its +performance drops below 10% for more than ten concepts +like night snake, presumably due to their limited presence in the pretraining data. However, measuring the frequency of concepts in VLMs’ large-scale datasets is challenging. We address this by using large language models +(LLMs) to count the number of pretraining texts that contain synonyms of these concepts. Our analysis confirms +that popular datasets, such as LAION, exhibit a long-tailed +concept distribution, yielding biased performance in VLMs. +We also find that downstream applications of VLMs, including visual chatbots (e.g., GPT-4V) and text-to-image models +(e.g., Stable Diffusion), often fail to recognize or generate +images of rare concepts identified by our method. To mitigate the imbalanced performance of zero-shot VLMs, we +propose REtrieval-Augmented Learning (REAL). First, instead of prompting VLMs using the original class names, +REAL uses their most frequent synonyms found in pretraining texts. This simple change already outperforms costly +human-engineered and LLM-enriched prompts over nine +benchmark datasets. Second, REAL trains a linear classifier on a small yet balanced set of pretraining data retrieved using concept synonyms. REAL surpasses the previous zero-shot SOTA, using 400× less storage and 10,000× +less training time!
+
@misc{parashar2024neglected, 
         title={The Neglected Tails of Vision Language Models.}, 
         author={Shubham Parashar and Zhiqiu Lin and Tian Liu and Xiangjue Dong and Yanan Li and Deva Ramanan and James Caverlee and Shu Kong},
         year={2023},