Skip to content

Latest commit

 

History

History
221 lines (174 loc) · 18.9 KB

README.md

File metadata and controls

221 lines (174 loc) · 18.9 KB


Yonghao He1,*,🌟 , Hu Su2,*,📧, Haiyong Yu1,*, Cong Yang3, Wei Sui1, Cong Wang1, Song Liu4,📧

* Equal contribution, 🌟 Project lead, 📧 Corresponding author

1 D-Robotics,
2 State Key Laboratory of Multimodal Artificial Intelligence Systems(MAIS), Institute of Automation of Chinese Academy of Sciences,
3 BeeLab, School of Future Science and Engineering, Soochow University,
4 the School of Information Science and Technology, ShanghaiTech University

arxiv paper license

🔥 Updates

[2024-12-27]: Decoupled Open-Set Object Detector (DOSOD) with ultra real-time speed and superior accuracy is released. We sincerely welcome all kinds of contributions, such as porting DOSOD to more edge-side platforms, and also welcome all kinds of opinions.

1. Introduction

1.1 Brief Introduction of DOSOD

Thanks to the new SOTA in open-vocabulary object detection established by YOLO-World, open-vocabulary detection has been extensively applied in various scenarios. Real-time open-vocabulary detection has attracted significant attention. In our paper, Decoupled Open-Set Object Detection (DOSOD) is proposed as a practical and highly efficient solution for supporting real-time OSOD tasks in robotic systems. Specifically, DOSOD is constructed based on the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to convert text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of class-agnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD functions like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and open-set detection.

1.2 Repo Structure

Our implementation is based on YOLO-World, the newly added code can be found in the following scripts:

2. Model Overview

Following YOLO-World, we also pre-trained DOSOD-S/M/L from scratch on public datasets and conducted zero-shot evaluation on the LVIS minival and COCO val2017. All pre-trained models are released.

2.1 Zero-shot Evaluation on LVIS minival

model Pre-train Data Size APmini APr APc APf weights
O365+GoldG 640 24.3 16.6 22.1 27.7 HF Checkpoints 🤗
O365+GoldG 640 28.6 19.7 26.6 31.9 HF Checkpoints 🤗
O365+GoldG 640 32.5 22.3 30.6 36.1 HF Checkpoints 🤗
O365+GoldG 640 26.2 19.1 23.6 29.8 HF Checkpoints 🤗
O365+GoldG 640 31.0 23.8 29.2 33.9 HF Checkpoints 🤗
O365+GoldG 640 35.0 27.1 32.8 38.3 HF Checkpoints 🤗
YOLO-Worldv2-S O365+GoldG 640 22.7 16.3 20.8 25.5 HF Checkpoints 🤗
YOLO-Worldv2-M O365+GoldG 640 30.0 25.0 27.2 33.4 HF Checkpoints 🤗
YOLO-Worldv2-L O365+GoldG 640 33.0 22.6 32.0 35.8 HF Checkpoints 🤗
DOSOD-S O365+GoldG 640 26.7 19.9 25.1 29.3 HF Checkpoints 🤗
DOSOD-M O365+GoldG 640 31.3 25.7 29.6 33.7 HF Checkpoints 🤗
DOSOD-L O365+GoldG 640 34.4 29.1 32.6 36.6 HF Checkpoints 🤗

NOTE: The results of YOLO-Worldv1 from repo and paper are different.

2.2 Zero-shot Inference on COCO dataset

model Pre-train Data Size AP AP50 AP75
O365+GoldG 640 37.6 52.3 40.7
O365+GoldG 640 42.8 58.3 46.4
O365+GoldG 640 44.4 59.8 48.3
YOLO-Worldv2-S O365+GoldG 640 37.5 52.0 40.7
YOLO-Worldv2-M O365+GoldG 640 42.8 58.2 46.7
YOLO-Worldv2-L O365+GoldG 640 45.4 61.0 49.4
DOSOD-S O365+GoldG 640 36.1 51.0 39.1
DOSOD-M O365+GoldG 640 41.7 57.1 45.2
DOSOD-L O365+GoldG 640 44.6 60.5 48.4

2.3 Latency On RTX 4090

We utilize the tool of trtexec in TensorRT 8.6.1.6 to assess the latency in FP16 mode. All models are re-parameterized with 80 categories from COCO. Log info can be found by clicking the FPS.

model Params FPS
YOLO-Worldv1-S 13.32M 1007
YOLO-Worldv1-M 28.93M 702
YOLO-Worldv1-L 47.38M 494
YOLO-Worldv2-S 12.66M 1221
YOLO-Worldv2-M 28.20M 771
YOLO-Worldv2-L 46.62M 553
DOSOD-S 11.48M 1582
DOSOD-M 26.31M 922
DOSOD-L 44.19M 632

NOTE: FPS = 1000 / GPU Compute Time[mean]

2.4 Latency On RDK X5

We evaluate the real-time performance of the YOLO-World-v2 model and our DOSOD model on the development kit of D-Robotics RDK X5. The models are re-parameterized with 1203 categories defined in LVIS. We run the models on the RDK X5 using either 1 thread or 8 threads with INT8 or INT16 quantization modes.

model FPS (1 thread) FPS (8 threads)
YOLO-Worldv2-S
(INT16/INT8)
5.962/11.044 6.386/12.590
YOLO-Worldv2-M
(INT16/INT8)
4.136/7.290 4.340/7.930
YOLO-Worldv2-L
(INT16/INT8)
2.958/5.377 3.060/5.720
DOSOD-S
(INT16/INT8)
12.527/31.020 14.657/47.328
DOSOD-M
(INT16/INT8)
8.531/20.238 9.471/26.36
DOSOD-L
(INT16/INT8)
5.663/12.799 6.069/14.939

3. Getting Started

Most of the steps are consistent with those in YOLO-World README.md file. Some extra things need attention are as follows:

  • clone project: git clone https://github.com/D-Robotics-AI-Lab/DOSOD.git
  • latency evaluation: we provide script to evaluate the latency on NVIDIA GPU
  • note: We pre-train DOSOD on 8 NVIDIA RTX 4090 GPUs with a batchsize of 128 while YOLO-World uses 32 NVIDIA V100 GPUs with the batchsize of 512.

4. Reparameterization and Inference

4.1 On NVIDIA RTX 4090

  • Step 1: generate texts embeddings
python tools/generate_text_prompts_dosod.py path_to_config_file path_to_model_file --text path_to_texts_json_file --out-dir dir_to_save_embedding_npy_file

path_to_config_file is the config for training
path_to_model_file the pth model file corresponding to path_to_config_file
path_to_texts_json_file contains the vocabulary, for example data/texts/coco_class_texts.json

  • Step 2: reparameterize model weights
python tools/reparameterize_dosod.py --model path_to_model_file --out-dir dir_to_save_rep_model_file --text-embed path_to_embedding_npy_file

path_to_embedding_npy_file is the output from step 1

  • Step 3: export onnx using rep-style config
python deploy/export_onnx.py path_to_rep_config_file path_to_rep_model_file --without-nms --work-dir dir_to_save_rep_onnx_file

path_to_rep_config_file is the modified config for rep, for example configs/dosod/rep_dosod_mlp3x_s_100e_1x8gpus_obj365v1_goldg_train_lvis_minival.py path_to_rep_model_file is the output from step 2

  • Step 4: run onnx demo
python deploy/onnx_demo.py path_to_rep_onnx_file path_to_test_image path_to_texts_json_file --output-dir dir_to_save_result_image --onnx-nms

path_to_rep_onnx_file is the output from step 3

4.2 On RDK X5

To make the model available for RDK X5, we need to use another config file in Step 3:
path_to_rep_config_file should be files with suffix _d-robotics.py, for exmaple configs/dosod/rep_dosod_mlp3x_s_d-robotics.py For more details, you can refer to code file.

To run the model on RDK X5, you can use the pre-prepared model with 80 COCO categories and the corresponding vocabulary JSON file. Follow the Usage instructions to run it on the RDK X5.

model Vocabulary json file RDK X5 INT16 Model RDK X5 INT8 Modlel
DOSOD-S
(INT16/INT8)
offline_vocabulary.json dosod_mlp3x_s_rep-int16.bin dosod_mlp3x_s_rep-int8.bin
DOSOD-M
(INT16/INT8)
offline_vocabulary.json dosod_mlp3x_m_rep-int16.bin dosod_mlp3x_m_rep-int8.bin
DOSOD-L
(INT16/INT8)
offline_vocabulary.json dosod_mlp3x_l_rep-int16.bin dosod_mlp3x_l_rep-int8.bin

Furthermore, if you wish to use the model with custom categories, you can refer to the Instructions_EN or Instructions_CN for more help.

Acknowledgement

We sincerely thank YOLO-World, mmyolo, mmdetection, GLIP, and transformers for providing their wonderful code to the community!

Citations

If you find DOSOD is useful in your research or applications, please consider giving us a star 🌟 and citing it.

@inproceedings{He2024DOSOD,
  title={A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space},
  author={He, Yonghao and Su, Hu and Yu, Haiyong and Yang, Cong and Sui, Wei and Wang, Cong and Liu, Song},
  booktitle={arXiv:2412.14680},
  year={2024}
}

Licence

DOSOD is under the GPL-v3 Licence and is supported for commercial usage. If you need a commercial license for DOSOD, please feel free to contact us.