Skip to content

Commit

Permalink
push
Browse files Browse the repository at this point in the history
  • Loading branch information
shubhamprshr27 committed Feb 7, 2024
1 parent 338ac54 commit d44d0ea
Showing 1 changed file with 21 additions and 29 deletions.
50 changes: 21 additions & 29 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -66,13 +66,14 @@ <h1 class="title is-1 publication-title">The Neglected Tails of Vision Language
<a target="_blank">Xiangjue Dong</a><sup>1</sup>,</span>
<!-- <span class="author-block">
<a target="_blank">Tiffany Ling</a><sup>2</sup>,</span> -->
<br />
<span class="author-block">
<a target="_blank">Yanan Li</a><sup>4</sup>,</span>
<span class="author-block">
<a target="_blank">Deva Ramanan</a><sup>2</sup>
<a target="_blank">Deva Ramanan</a><sup>2</sup>,</span>
</span>
<span class="author-block">
<a target="_blank">James Caverlee</a><sup>1</sup>
<a target="_blank">James Caverlee</a><sup>1</sup>,</span>
</span>
<span class="author-block">
<a target="_blank">Shu Kong</a><sup>1</sup><sup>,</sup><sup>3</sup>
Expand Down Expand Up @@ -137,32 +138,23 @@ <h1 class="title is-1 publication-title">The Neglected Tails of Vision Language
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p style="text-align: justify;">
Vision-language models (VLMs) such as CLIP excel in zero-shot recognition but
exhibit drastically imbalanced performance across visual concepts in downstream
tasks. Despite high zero-shot accuracy on ImageNet (72.7%), CLIP performs poorly (&lt;10%) on certain concepts like night snake and gyromitra,
likely due to underrepresentation in VLM's pretraining datasets. Assessing this imbalance is complex due to the difficulty in calculating concept frequency in large-scale pretraining data.
</p>
<ul>
<li>
<strong>Estimating Concept Frequency Using Large Language Models (LLMs):</strong>
We use an LLM to help count relevant texts that contain synonyms of the given concepts and resolve ambiguous cases,
confirming that popular VLM datasets like LAION exhibit a long-tailed
concept distribution.
</li>
<li>
<strong>Long-tailed Behaviors Of All Mainstream VLMs:</strong> VLMs (CLIP, OpenCLIP, MetaCLIP), visual chatbots (<a href="https://openai.com/research/gpt-4v-system-card">GPT-4V</a>, <a href="https://llava.hliu.cc/">LLaVA</a>), and text-to-image models (<a href="https://openai.com/dall-e-3">DALL-E 3</a>, <a href="https://stablediffusionweb.com/">SD-XL</a>)
struggle with recognizing and generating rare concepts identified by our method.
</li>
<li>
<strong>REtrieval Augmented Learning (REAL) Achieves SOTA Zero-Shot Performance:</strong>
We propose two solutions to boost zero-shot performance over both tail and head classes, without leveraging downstream data.
<ul>
<li><strong>REAL-Prompt:</strong> We prompt VLMs with the most frequent synonym of a downstream concept (e.g., "ATM" instead of "cash machine").
This simple change outperforms other ChatGPT-based prompting methods such as DCLIP and CuPL.
<li><strong>REAL-Linear:</strong> REAL-Linear retrieves a small, class-balanced set of pretraining data from LAION to train a robust linear classifier,
surpassing recent state-of-the-art REACT, using <strong>400x</strong> less storage and <strong>10,000x</strong> less training time!</li>
</ul>
</li>
Vision-language models (VLMs) excel in zero-shot
recognition but their performance varies greatly across
different visual concepts. For example, although CLIP
achieves impressive accuracy on ImageNet (60-80%), its
performance drops below 10% for more than ten concepts
like night snake, presumably due to their limited presence in the pretraining data. However, measuring the frequency of concepts in VLMs’ large-scale datasets is challenging. We address this by using large language models
(LLMs) to count the number of pretraining texts that contain synonyms of these concepts. Our analysis confirms
that popular datasets, such as LAION, exhibit a long-tailed
concept distribution, yielding biased performance in VLMs.
We also find that downstream applications of VLMs, including visual chatbots (e.g., GPT-4V) and text-to-image models
(e.g., Stable Diffusion), often fail to recognize or generate
images of rare concepts identified by our method. To mitigate the imbalanced performance of zero-shot VLMs, we
propose REtrieval-Augmented Learning (REAL). First, instead of prompting VLMs using the original class names,
REAL uses their most frequent synonyms found in pretraining texts. This simple change already outperforms costly
human-engineered and LLM-enriched prompts over nine
benchmark datasets. Second, REAL trains a linear classifier on a small yet balanced set of pretraining data retrieved using concept synonyms. REAL surpasses the previous zero-shot SOTA, using 400× less storage and 10,000×
less training time!
</ul>
<!-- In this work, we make the first attempt to measure
the concept frequency in VLMs' pretraining data by analyzing pretraining
Expand Down Expand Up @@ -775,7 +767,7 @@ <h2 class="title is-4">Benchmarking REAL</h2>
<section class="section hero is-light" id="BibTeX">
<div class="container is-max-desktop content ">
<h2 class="title">BibTeX</h2>
<pre><code>@misc{parashar2023tailvlm, <!--Fill once available-->
<pre><code>@misc{parashar2024neglected, <!--Fill once available-->
title={The Neglected Tails of Vision Language Models.},
author={Shubham Parashar and Zhiqiu Lin and Tian Liu and Xiangjue Dong and Yanan Li and Deva Ramanan and James Caverlee and Shu Kong},
year={2023}, <!--Fill once available-->
Expand Down

0 comments on commit d44d0ea

Please sign in to comment.