Grounding DINO aims to merge concepts found in the DINO and GLIP papers. DINO, a transformer-based detection method, offers state-of-the-art object detection performance and end-to-end optimization, eliminating the need for handcrafted modules like NMS (Non-Maximum Suppression).
On the other hand, GLIP focuses on phrase grounding. This task involves associating phrases or words from a given text with corresponding visual elements in an image or video, effectively linking textual descriptions to their respective visual representations.
The Segment Anything Model (SAM) is an instance segmentation model developed by Meta Research and released in April, 2023. Segment Anything was trained on 11 million images and 1.1 billion segmentation masks.
To initiate the annotation process, begin by preparing the desired image. Subsequently, utilize the Grounding DINO model to generate bounding boxes around the objects depicted in the image. These initial bounding boxes will serve as the initial reference for the subsequent instance segmentation procedure.
Once the bounding boxes have been established, the SAM model can be employed to convert them into instance segmentation masks. The SAM model utilizes the input of bounding box data and produces accurate segmentation masks for individual objects within the image.