fix wording and add additional clip fine tune reference

ethen8181 · Jan 4, 2024 · 004f52d · 004f52d
1 parent 7a5067e
commit 004f52d
Show file tree

Hide file tree

Showing 2 changed files with 42 additions and 20 deletions.
diff --git a/deep_learning/contrastive/clip/clip.html b/deep_learning/contrastive/clip/clip.html
@@ -13638,7 +13638,7 @@ <h2 id="Dataset">Dataset<a class="anchor-link" href="#Dataset">&#182;</a></h2>
         <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;image&quot;</span><span class="p">,</span> <span class="s2">&quot;caption_number&quot;</span><span class="p">,</span> <span class="s2">&quot;caption&quot;</span><span class="p">]</span>
     <span class="p">)</span>
     <span class="c1"># indicate these are labeled pairs, useful for calculating</span>
-    <span class="c1"># offline evaluatoin metrics</span>
+    <span class="c1"># offline evaluation metrics</span>
     <span class="n">df</span><span class="p">[</span><span class="s2">&quot;label&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="mf">1.0</span>
 
     <span class="c1"># remove extra white space up front</span>
@@ -14153,7 +14153,7 @@ <h2 id="CLIP-Model">CLIP Model<a class="anchor-link" href="#CLIP-Model">&#182;</
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p>CLIP model comprises of three components <a href="https://openai.com/blog/clip/">[6]</a>: image encoder, text encoder and projection head (absorbed inside encoder block in the diagram).</p>
+<p>CLIP model comprises of three components <a href="https://openai.com/blog/clip/">[6]</a> <a href="https://arxiv.org/abs/2103.00020">[8]</a>: image encoder, text encoder and projection head (absorbed inside encoder block in the diagram).</p>
 <img src="imgs/clip_contrastive_pretraining.png" width="50%" height="50%">
 <p>During training we'll need to feed our batches of text and images through its own respective encoder. Given a batch of $n$ text, image pairs, $\text{text}_i, \text{image}_i$, CLIP is trained to predict which of the $n × n$ possible (image, text) pairings across a batch actually occurred. To do this, CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the $N$ real image and text embedding pairs' (cosine) similarity in a given batch while minimizing cosine similarity of $n^2 − n$ incorrect image and text embedding pairings. This is commonly referred to as InfoNCE loss.</p>
 \begin{align}
@@ -14166,16 +14166,23 @@ <h2 id="CLIP-Model">CLIP Model<a class="anchor-link" href="#CLIP-Model">&#182;</
 </ul>
 <p>Some of other key learnings from the work includes:</p>
 <ul>
-<li>Data: One of key ingredients of pre-training is large scale data. CLIP collected a new dataset comprised of 400 million image text pairs from public internet.</li>
-<li>Objective: Choosing proxy task that is training efficient was to key to scaling learning image representations via natural language supervision. As illustrated in this document, CLIP chose a two tower contrastive approach of predicting which text as a whole is paired with which image, instead of predictive objective such as predicting exact words of that caption or generative models.</li>
-<li>Encoder: We can always experiment with different encoder architectures, authors reported a 3x gain in compute efficiency by adopting vision transformer over a standard ResNet for the image encoder, and found that the model is less sensitive to the text encoder's capacity. They also reported using a higher 336 pixel resolution for images.</li>
-<li>Training recipe:<ul>
+<li>Data: One key ingredients of pre-training is large scale data. CLIP collected a new dataset comprised of 400 million image text pairs from public internet.</li>
+<li>Objective: Choosing proxy task that is training efficient was also key to scaling learning image representations via natural language supervision. As illustrated in this document, CLIP chose a two tower contrastive learning approach of aligning which text as a whole is paired with which image, instead of predictive objective such as predicting exact words of that caption or generative models.</li>
+<li>Encoder: We can always experiment with different encoder architectures, authors reported a 3x gain in compute efficiency by adopting vision transformer over a standard ResNet for image encoder, and found that this model is less sensitive to the text encoder's capacity. They also reported using a higher 336 pixel resolution for images.</li>
+<li>Training Recipe:<ul>
 <li>Important thing to note is that their contrastive loss uses a very large minibatch size of 32,768, and the calculation of embedding similarities are sharded with individual GPUs.</li>
 <li>Their largest Vision Transformer took 12 days on 256 V100 GPUs.</li>
 <li>They train CLIP model completely from scratch without initializing image or text encoder with pre-trained weights.</li>
 </ul>
 </li>
-<li>Zero shot capabilties: Given CLIP leverages natural langauge supervision, this enables far stronger generalization and zero shot capabilities. e.g. Given a task of classifying photos of objects, we can check each image whether CLIP predicts which of the caption &quot;a photo of a dog&quot; or &quot;a photo of a car&quot;, etc. is more likely to be paired with it (depicted in the diagram below). We can imagine swapping out the dog and car part with any other class in our prompt making this applicable to potentially arbitrary classification tasks. Caveat: this may require trail and error &quot;prompt engineering&quot; to work well, and still has poor generalization to images not covered in its pre-training dataset.</li>
+<li>Zero Shot Capabilties: Given CLIP leverages natural langauge supervision, this enables far stronger generalization and zero shot capabilities. e.g. Given a task of classifying photos of objects, we can check each image whether CLIP predicts which of the caption &quot;a photo of a dog&quot; or &quot;a photo of a car&quot;, etc. is more likely to be paired with it (depicted in the diagram below). We can imagine swapping out the dog and car part with any other class in our prompt making this applicable to potentially arbitrary classification tasks. Caveat: this may require trail and error &quot;prompt engineering&quot; to work well, and still has poor generalization to images not covered in its pre-training dataset.</li>
+<li>Transfer Learning: CLIP's vision encoder which is trained on noisy image-text pairs from the web also offers very solid fine-tuning performance on image classification tasks with the right choice of hyperparameters <a href="https://arxiv.org/abs/2212.06138">[11]</a>.<ul>
+<li>Smaller learning rate.</li>
+<li>Exponential moving average: keeping a moving average of all model parameters' weight.</li>
+<li>Layer wise learning rate decay: setting different learning rates for each backbone layer. Top layers have higher learning rate to adapt to new tasks, while bottom layers have smaller learning rate so strong features learned from pre-training is preserved).</li>
+<li>Data Augmentation: Removing strong random augmentation such as mixup, cutmix.</li>
+</ul>
+</li>
 </ul>
 <img src="imgs/clip_zero_shot.png" width="60%" height="60%">
 <p>Apart from CLIP, we'll also use this opportuniy to introduce LiT, a potentially more efficient way of training text-image with contrastive learning. As well as VIT, the image encoder that we'll be using.</p>
@@ -14217,7 +14224,7 @@ <h3 id="ViT">ViT<a class="anchor-link" href="#ViT">&#182;</a></h3>
 <p>Transformer/BERT style model was originally proposed in natural language domain, and quickly became the de facto standard model architecture. Its reach into computer vision field came much later, where vision transformers (ViT) <a href="https://arxiv.org/abs/2010.11929">[10]</a> showed that a pure transformer applied to suquence of image patches is capable of achieving remarkable results for computer vision tasks. We'll elaborate upon its architecture and performance.</p>
 <p>Architecture:</p>
 <img src="imgs/vit.png" width="60%" height="60%">
-<p>The main modification ViT made was show images are fed to a Transformer. Compared to natural language domain where we first tokenized input text before feeding these tokenized ids through our transformer module, for image, we would convert an image into square sized non-overlapping spatches, each of which gets turned into a vector/patch embedding. In the architecture diagram above, this is referred to as linear projection, and in practice these patch embedding are often times generated via convolutaionl 2D layer. e.g. If we have a 224x224 pixel images, we would end put with a suquence of 196 16x16 flattened image patches. This is why in public pre-trained models, e.g. <code>google/vit-base-patch16-224-in21k</code>, we'll see information such as <code>patch16-224</code> indicating the number of patches as well as image resolution in which it was pre-trained on. Another example is <code>ViT-B/16</code> indicating it's a based model trained on 16x16 input patch size. Reason behind this patching is directly applying transformer's self attention to image would require each pixel attending to every other pixel. Given self attention quadratic cost, this does not scale to realistic input sizes.
+<p>The main modification ViT made was show images are fed to a Transformer. Compared to natural language domain where we first tokenized input text before feeding these tokenized ids through our transformer module, for image, we would convert an image into square sized non-overlapping spatches, each of which gets turned into a vector/patch embedding. In the architecture diagram above, this is referred to as linear projection, and in practice these patch embedding are often times generated via convolutional 2D layer. e.g. If we have a 224x224 pixel images, we would end put with a suquence of 196 16x16 flattened image patches. This is why in public pre-trained models, e.g. <code>google/vit-base-patch16-224-in21k</code>, we'll see information such as <code>patch16-224</code> indicating the patch size as well as image resolution in which it was pre-trained on. Another example is <code>ViT-B/16</code> indicating it's a base model trained on 16x16 input patch size. Reason behind this patching is directly applying transformer's self attention to image would require each pixel attending to every other pixel. Given self attention quadratic cost, this does not scale to realistic input sizes.
 After this preprocessing, there's a special <code>[CLS]</code> token added to the beginning of patch embedding, which can be used as embedding input for downstream task. As well as a learnable position embedding. Both of these are similar to BERT.</p>
 <p>Performance:</p>
 <img src="imgs/vit_performance.png" width="40%" height="40%">
@@ -14234,6 +14241,11 @@ <h3 id="ViT">ViT<a class="anchor-link" href="#ViT">&#182;</a></h3>
 <li>The experiment result reinforces our intuition that convolutional inductive bias is useful for smaller datasets, but for larger ones, learning the relevant patterns directly from global context is sufficient, even beneficial.</li>
 <li>Note,  different from BERT, which relied on self-supervised pre-training via masked language modeling (predicting masked tokens), the original ViT is still based on supervised pre-training.</li>
 </ul>
+<p>Other notable learnings at the time includes:</p>
+<ul>
+<li>Unlike in NLP domain, where self-supervised pre-training were employed. In the original ViT work, the best result was still obtained via supervised pre-training.</li>
+<li>Compared to pre-training, we can use a higher image resolution during fine-tuning. When doing so, 2D interpolation is needed to adjust the positional embedding.</li>
+</ul>
 
 </div>
 </div>
@@ -15027,7 +15039,6 @@ <h2 id="Evaluation">Evaluation<a class="anchor-link" href="#Evaluation">&#182;</
 <span class="sd">    ```</span>
 <span class="sd">    &quot;&quot;&quot;</span>
     <span class="n">recall_list</span> <span class="o">=</span> <span class="p">[]</span>
-    <span class="n">weighted_recall_list</span> <span class="o">=</span> <span class="p">[]</span>
     <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">k_candidates</span><span class="p">:</span>        
         <span class="n">df_eval_input_at_k</span> <span class="o">=</span> <span class="p">(</span>
             <span class="n">df_eval_input</span><span class="p">[</span><span class="n">df_eval_input</span><span class="p">[</span><span class="s2">&quot;rank&quot;</span><span class="p">]</span> <span class="o">&lt;=</span> <span class="n">k</span><span class="p">]</span>
@@ -15314,6 +15325,7 @@ <h1 id="Reference">Reference<a class="anchor-link" href="#Reference">&#182;</a><
 <li><a href="https://arxiv.org/abs/2103.00020">[8]</a> Alec Radford, Jong Wook Kim, et. al - Learning Transferable Visual Models From Natural Language Supervision - 2021</li>
 <li><a href="https://arxiv.org/abs/2111.07991">[9]</a> Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Lucas Beyer, et al. - LiT: Zero-Shot Transfer with Locked-image text Tuning (2021)</li>
 <li><a href="https://arxiv.org/abs/2010.11929">[10]</a> Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Neil Houlsby, et al. - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020)</li>
+<li><a href="https://arxiv.org/abs/2212.06138">[11]</a> Xiaoyi Dong, et al. - CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet (2022)</li>
 </ul>
 
 </div>