push

shubhamprshr27 · Feb 7, 2024 · d44d0ea · d44d0ea
1 parent 338ac54
commit d44d0ea
Showing 1 changed file with 21 additions and 29 deletions.
diff --git a/index.html b/index.html
@@ -66,13 +66,14 @@ <h1 class="title is-1 publication-title">The Neglected Tails of Vision Language
                 <a target="_blank">Xiangjue Dong</a><sup>1</sup>,</span>
               <!-- <span class="author-block">
                 <a target="_blank">Tiffany Ling</a><sup>2</sup>,</span> -->
+              <br />
               <span class="author-block">
                 <a target="_blank">Yanan Li</a><sup>4</sup>,</span>
               <span class="author-block">
-                <a target="_blank">Deva Ramanan</a><sup>2</sup>
+                <a target="_blank">Deva Ramanan</a><sup>2</sup>,</span>
               </span>
               <span class="author-block">
-                <a target="_blank">James Caverlee</a><sup>1</sup>
+                <a target="_blank">James Caverlee</a><sup>1</sup>,</span>
               </span>
               <span class="author-block">
                 <a target="_blank">Shu Kong</a><sup>1</sup><sup>,</sup><sup>3</sup>
@@ -137,32 +138,23 @@ <h1 class="title is-1 publication-title">The Neglected Tails of Vision Language
           <h2 class="title is-3">Abstract</h2>
           <div class="content has-text-justified">
             <p style="text-align: justify;">
-              Vision-language models (VLMs) such as CLIP excel in zero-shot recognition but 
-              exhibit drastically imbalanced performance across visual concepts in downstream 
-              tasks. Despite high zero-shot accuracy on ImageNet (72.7%), CLIP performs poorly (&lt;10%) on certain concepts like night snake and gyromitra, 
-              likely due to underrepresentation in VLM's pretraining datasets. Assessing this imbalance is complex due to the difficulty in calculating concept frequency in large-scale pretraining data. 
-            </p>
-            <ul>
-              <li>
-                <strong>Estimating Concept Frequency Using Large Language Models (LLMs):</strong> 
-                We use an LLM to help count relevant texts that contain synonyms of the given concepts and resolve ambiguous cases,
-                confirming that popular VLM datasets like LAION exhibit a long-tailed 
-                concept distribution.
-              </li>
-              <li>
-                <strong>Long-tailed Behaviors Of All Mainstream VLMs:</strong> VLMs (CLIP, OpenCLIP, MetaCLIP), visual chatbots (<a href="https://openai.com/research/gpt-4v-system-card">GPT-4V</a>, <a href="https://llava.hliu.cc/">LLaVA</a>), and text-to-image models (<a href="https://openai.com/dall-e-3">DALL-E 3</a>, <a href="https://stablediffusionweb.com/">SD-XL</a>) 
-                struggle with recognizing and generating rare concepts identified by our method.
-              </li>
-              <li>
-                <strong>REtrieval Augmented Learning (REAL) Achieves SOTA Zero-Shot Performance:</strong>  
-                We propose two solutions to boost zero-shot performance over both tail and head classes, without leveraging downstream data.
-                <ul>
-                  <li><strong>REAL-Prompt:</strong>  We prompt VLMs with the most frequent synonym of a downstream concept (e.g., "ATM" instead of "cash machine"). 
-                  This simple change outperforms other ChatGPT-based prompting methods such as DCLIP and CuPL.
-                  <li><strong>REAL-Linear:</strong> REAL-Linear retrieves a small, class-balanced set of pretraining data from LAION to train a robust linear classifier, 
-                  surpassing recent state-of-the-art REACT, using <strong>400x</strong> less storage and <strong>10,000x</strong> less training time!</li>
-                </ul>
-              </li>
+              Vision-language models (VLMs) excel in zero-shot
+recognition but their performance varies greatly across
+different visual concepts. For example, although CLIP
+achieves impressive accuracy on ImageNet (60-80%), its
+performance drops below 10% for more than ten concepts
+like night snake, presumably due to their limited presence in the pretraining data. However, measuring the frequency of concepts in VLMs’ large-scale datasets is challenging. We address this by using large language models
+(LLMs) to count the number of pretraining texts that contain synonyms of these concepts. Our analysis confirms
+that popular datasets, such as LAION, exhibit a long-tailed
+concept distribution, yielding biased performance in VLMs.
+We also find that downstream applications of VLMs, including visual chatbots (e.g., GPT-4V) and text-to-image models
+(e.g., Stable Diffusion), often fail to recognize or generate
+images of rare concepts identified by our method. To mitigate the imbalanced performance of zero-shot VLMs, we
+propose REtrieval-Augmented Learning (REAL). First, instead of prompting VLMs using the original class names,
+REAL uses their most frequent synonyms found in pretraining texts. This simple change already outperforms costly
+human-engineered and LLM-enriched prompts over nine
+benchmark datasets. Second, REAL trains a linear classifier on a small yet balanced set of pretraining data retrieved using concept synonyms. REAL surpasses the previous zero-shot SOTA, using 400× less storage and 10,000×
+less training time!
             </ul>  
               <!-- In this work, we make the first attempt to measure 
               the concept frequency in VLMs' pretraining data by analyzing pretraining 
@@ -775,7 +767,7 @@ <h2 class="title is-4">Benchmarking REAL</h2>
   <section class="section hero is-light" id="BibTeX">
     <div class="container is-max-desktop content ">
       <h2 class="title">BibTeX</h2>
-      <pre><code>@misc{parashar2023tailvlm, <!--Fill once available-->
+      <pre><code>@misc{parashar2024neglected, <!--Fill once available-->
         title={The Neglected Tails of Vision Language Models.}, 
         author={Shubham Parashar and Zhiqiu Lin and Tian Liu and Xiangjue Dong and Yanan Li and Deva Ramanan and James Caverlee and Shu Kong},
         year={2023}, <!--Fill once available-->