code and description update

LancasterLi · Oct 31, 2024 · 41c763e · 41c763e
1 parent ac878fe
commit 41c763e
Show file tree

Hide file tree

Showing 92 changed files with 8,924 additions and 35 deletions.
diff --git a/Figs/Ablation study of different module designs.png b/Figs/Ablation study of different module designs.png
diff --git a/Figs/Ablation study of different module designs2.png b/Figs/Ablation study of different module designs2.png
diff --git a/Figs/Ablation study of the key compoents of RefSAM.png b/Figs/Ablation study of the key compoents of RefSAM.png
diff --git a/Figs/AblationstudyofthekeycompoentsofRefSAM.png b/Figs/AblationstudyofthekeycompoentsofRefSAM.png
diff --git a/Figs/Inference speed of different models..png b/Figs/Inference speed of different models..png
diff --git a/Figs/Inferencespeedofdifferentmodels.png b/Figs/Inferencespeedofdifferentmodels.png
diff --git a/Figs/Influence of the model size of Visual Encoder.png b/Figs/Influence of the model size of Visual Encoder.png
diff --git a/Figs/Learning Rate of Cross Modal MLP.png b/Figs/Learning Rate of Cross Modal MLP.png
diff --git a/Figs/Learning Rate of Dense Attention Conv.png b/Figs/Learning Rate of Dense Attention Conv.png
diff --git a/Figs/Learning Rate of Mask Decoder.png b/Figs/Learning Rate of Mask Decoder.png
diff --git a/Figs/Linguistic Feature of Dense Attention.png b/Figs/Linguistic Feature of Dense Attention.png
diff --git a/Figs/Model Scale Analysis Experiment.png b/Figs/Model Scale Analysis Experiment.png
diff --git a/Figs/Number of learnable parameters of different Models.png b/Figs/Number of learnable parameters of different Models.png
diff --git a/Figs/Results on Ref-DAVIS17.png b/Figs/Results on Ref-DAVIS17.png
diff --git a/Figs/Results on Ref-Youtube-VOS.jpeg b/Figs/Results on Ref-Youtube-VOS.jpeg
diff --git a/Figs/Results on Ref-Youtube-VOS.png b/Figs/Results on Ref-Youtube-VOS.png
diff --git a/Figs/ResultsonRef-DAVIS17.png b/Figs/ResultsonRef-DAVIS17.png
diff --git a/Figs/The Number of Hidden Layer in Cross Modal MLP.png b/Figs/The Number of Hidden Layer in Cross Modal MLP.png
diff --git a/...e influence of different learning rates for the learnable modules of RefSAM.png b/...e influence of different learning rates for the learnable modules of RefSAM.png
diff --git a/Figs/overall_network.jpg b/Figs/overall_network.jpg
diff --git a/Figs/overall_network.png b/Figs/overall_network.png
diff --git a/README.md b/README.md
@@ -2,65 +2,43 @@
 <p align="center">
 <a href="https://arxiv.org/abs/2307.00997"><img src="https://img.shields.io/badge/arXiv-Paper-<color>"></a>
 </p>
-<h5 align="center"><em>Yonglin Li, Jing Zhang, Xiao Teng, Long Lan</em></h5>
+<h5 align="center"><em>Yonglin Li, Jing Zhang, Xiao Teng, Long Lan, Xinwang Liu</em></h5>
 <p align="center">
-  <a href="#news">News</a> |
   <a href="#introduction">Abstract</a> |
   <a href="#usage">Usage</a> |
   <a href="#results">Results</a> |
   <a href="#statement">Statement</a>
 </p>
 
-
-
-
-# News
-
-**2023.07.04**
-
-- The paper is post on arxiv!
-
-<!-- > other news -->
-
-
-
-
 # Introduction
 
 This is the official repository of the paper <a href=""> RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation </a>
 
 <figure>
-<img src="Figs/overall_network.png">
-<figcaption align = "center"><b>Figure 1: The overall pipeline of RefSAM. It mainly consists of five key components: Visual Encoder of SAM , Text Encoder, Cross Modal MLP, Dense Attention, Mask Decoder of SAM.
+<img src="Figs/overall_network.jpg">
+<figcaption align = "center"><b>The overall pipeline of RefSAM. It mainly consists of five key components: 1) Backbone: Visual Encoder of SAM with Adapter and Text Encoder; 2) Cross-Modal MLP; 3) Hierarchical Dense Attention; 4) Mask Decoder of SAM; and 5) Implicit Tracking Module.
 </b></figcaption>
 </figure>
 
 <p>
 
 <p align="left"> In this study, we present the RefSAM model, which for the first time explores the potential of <a href="https://arxiv.org/abs/2304.02643"> SAM </a> for RVOS by incorporating multi-view information from diverse modalities and successive frames at different timestamps. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-Modal MLP that projects the text embedding of the referring expression into sparse and dense embeddings, serving as user-interactive prompts. Subsequently, a parameter-efficient tuning strategy is employed to effectively align and fuse the language and vision features. Through comprehensive ablation studies, we demonstrate the practical and effective design choices of our strategy. Extensive experiments conducted on Ref-Youtu-VOS and Ref-DAVIS17 datasets validate the superiority and effectiveness of our RefSAM model over existing methods.
 
-
-
-
 # Usage
-The code will be released soon.
-
-
-
 
 # Results
 ## Results on RVOS datasets
 
 <figure style="text-align: center;">
-<img src="Figs/Results%20on%20Ref-DAVIS17.png">
+<img src="Figs/ResultsonRef-DAVIS17.png">
 <figcaption align = "center"><b>Figure 2: Results on Ref-DAVIS17. 
  </b></figcaption>
 </figure>
 
 <p>
 
 <figure style="text-align: center;">
-<img src="Figs/Results on Ref-Youtube-VOS.jpeg">
+<img src="Figs/Results on Ref-Youtube-VOS.png">
 <figcaption align = "center"><b>Figure 3: Results on Ref-Youtube-VOS. 
  </b></figcaption>
 </figure>
@@ -107,18 +85,15 @@ Furthermore, we present the results of differnt models. It is clear that our Ref
 ### The influence of different learning rates for the learnable modules
 
 <figure>
-<img src="Figs/The influence of different learning rates for the learnable modules of RefSAM.png">
+<img src="Figs/Ablation study of different module designs.png">
 <figcaption align = "center"><b>Figure 6: The influence of different learning rates for the learnable modules of RefSAM.</a>  
  </b></figcaption>
 </figure>
 
-
-
-
 ### Ablation study of different module designs.
 
 <figure>
-<img src="Figs/Ablation study of different module designs.png">
+<img src="Figs/Ablation study of different module designs2.png">
 <figcaption align = "center"><b>Figure 7: Ablation study of different module designs.
 </a>  
  </b></figcaption>
@@ -133,7 +108,7 @@ Furthermore, we present the results of differnt models. It is clear that our Ref
 
 
 <figure style="text-align: center;">
-<img src="Figs/Ablation%20study%20of%20the%20key%20compoents%20of%20RefSAM.png">
+<img src="Figs/AblationstudyofthekeycompoentsofRefSAM.png">
 <figcaption align = "center"><b>Figure 8: Ablation study of the key compoents of RefSAM</a>  
  </b></figcaption>
 </figure>
@@ -145,7 +120,7 @@ Furthermore, we present the results of differnt models. It is clear that our Ref
 
 <figure style="text-align: center;">
 <img src="Figs/Influence of the model size of Visual Encoder.png">
-<figcaption align = "center"><b>Figure 9: Influence of the model size of Visual Encoder. We do not use the data augmentation in this experiment. </a>  
+<figcaption align = "center"><b>Figure 9: Influence of the model size of Visual Encoder. </a>  
  </b></figcaption>
 </figure>
 
@@ -166,7 +141,7 @@ Furthermore, we present the results of differnt models. It is clear that our Ref
 ### Inference speed of different models.
 
 <figure style="text-align: center;">
-<img src="Figs/Inference%20speed%20of%20different%20models..png">
+<img src="Figs/Inferencespeedofdifferentmodels.png">
 <figcaption align = "center"><b>Figure 11: Inference speed of different models.</a>  
  </b></figcaption>
 </figure>

diff --git a/datasets/categories.py b/datasets/categories.py
@@ -0,0 +1,45 @@
+# -------------------------------------------------------------------------------------------------------------------
+# 1. Ref-Youtube-VOS
+ytvos_category_dict = {
+    'airplane': 0, 'ape': 1, 'bear': 2, 'bike': 3, 'bird': 4, 'boat': 5, 'bucket': 6, 'bus': 7, 'camel': 8, 'cat': 9, 
+    'cow': 10, 'crocodile': 11, 'deer': 12, 'dog': 13, 'dolphin': 14, 'duck': 15, 'eagle': 16, 'earless_seal': 17, 
+    'elephant': 18, 'fish': 19, 'fox': 20, 'frisbee': 21, 'frog': 22, 'giant_panda': 23, 'giraffe': 24, 'hand': 25, 
+    'hat': 26, 'hedgehog': 27, 'horse': 28, 'knife': 29, 'leopard': 30, 'lion': 31, 'lizard': 32, 'monkey': 33, 
+    'motorbike': 34, 'mouse': 35, 'others': 36, 'owl': 37, 'paddle': 38, 'parachute': 39, 'parrot': 40, 'penguin': 41, 
+    'person': 42, 'plant': 43, 'rabbit': 44, 'raccoon': 45, 'sedan': 46, 'shark': 47, 'sheep': 48, 'sign': 49, 
+    'skateboard': 50, 'snail': 51, 'snake': 52, 'snowboard': 53, 'squirrel': 54, 'surfboard': 55, 'tennis_racket': 56, 
+    'tiger': 57, 'toilet': 58, 'train': 59, 'truck': 60, 'turtle': 61, 'umbrella': 62, 'whale': 63, 'zebra': 64
+}
+
+ytvos_category_list = [
+    'airplane', 'ape', 'bear', 'bike', 'bird', 'boat', 'bucket', 'bus', 'camel', 'cat', 'cow', 'crocodile', 
+    'deer', 'dog', 'dolphin', 'duck', 'eagle', 'earless_seal', 'elephant', 'fish', 'fox', 'frisbee', 'frog', 
+    'giant_panda', 'giraffe', 'hand', 'hat', 'hedgehog', 'horse', 'knife', 'leopard', 'lion', 'lizard', 
+    'monkey', 'motorbike', 'mouse', 'others', 'owl', 'paddle', 'parachute', 'parrot', 'penguin', 'person', 
+    'plant', 'rabbit', 'raccoon', 'sedan', 'shark', 'sheep', 'sign', 'skateboard', 'snail', 'snake', 'snowboard', 
+    'squirrel', 'surfboard', 'tennis_racket', 'tiger', 'toilet', 'train', 'truck', 'turtle', 'umbrella', 'whale', 'zebra'
+]
+
+# -------------------------------------------------------------------------------------------------------------------
+# 2. Ref-DAVIS17
+davis_category_dict = {
+    'airplane': 0, 'backpack': 1, 'ball': 2, 'bear': 3, 'bicycle': 4, 'bird': 5, 'boat': 6, 'bottle': 7, 'box': 8, 'bus': 9, 
+    'camel': 10, 'car': 11, 'carriage': 12, 'cat': 13, 'cellphone': 14, 'chamaleon': 15, 'cow': 16, 'deer': 17, 'dog': 18, 
+    'dolphin': 19, 'drone': 20, 'elephant': 21, 'excavator': 22, 'fish': 23, 'goat': 24, 'golf cart': 25, 'golf club': 26, 
+    'grass': 27, 'guitar': 28, 'gun': 29, 'helicopter': 30, 'horse': 31, 'hoverboard': 32, 'kart': 33, 'key': 34, 'kite': 35, 
+    'koala': 36, 'leash': 37, 'lion': 38, 'lock': 39, 'mask': 40, 'microphone': 41, 'monkey': 42, 'motorcycle': 43, 'oar': 44, 
+    'paper': 45, 'paraglide': 46, 'person': 47, 'pig': 48, 'pole': 49, 'potted plant': 50, 'puck': 51, 'rack': 52, 'rhino': 53, 
+    'rope': 54, 'sail': 55, 'scale': 56, 'scooter': 57, 'selfie stick': 58, 'sheep': 59, 'skateboard': 60, 'ski': 61, 'ski poles': 62, 
+    'snake': 63, 'snowboard': 64, 'stick': 65, 'stroller': 66, 'surfboard': 67, 'swing': 68, 'tennis racket': 69, 'tractor': 70, 
+    'trailer': 71, 'train': 72, 'truck': 73, 'turtle': 74, 'varanus': 75, 'violin': 76, 'wheelchair': 77
+}
+
+davis_category_list = [
+    'airplane', 'backpack', 'ball', 'bear', 'bicycle', 'bird', 'boat', 'bottle', 'box', 'bus', 'camel', 'car', 'carriage', 
+    'cat', 'cellphone', 'chamaleon', 'cow', 'deer', 'dog', 'dolphin', 'drone', 'elephant', 'excavator', 'fish', 'goat', 
+    'golf cart', 'golf club', 'grass', 'guitar', 'gun', 'helicopter', 'horse', 'hoverboard', 'kart', 'key', 'kite', 'koala', 
+    'leash', 'lion', 'lock', 'mask', 'microphone', 'monkey', 'motorcycle', 'oar', 'paper', 'paraglide', 'person', 'pig', 
+    'pole', 'potted plant', 'puck', 'rack', 'rhino', 'rope', 'sail', 'scale', 'scooter', 'selfie stick', 'sheep', 'skateboard', 
+    'ski', 'ski poles', 'snake', 'snowboard', 'stick', 'stroller', 'surfboard', 'swing', 'tennis racket', 'tractor', 'trailer', 
+    'train', 'truck', 'turtle', 'varanus', 'violin', 'wheelchair'
+]