OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation
Dongming Wu, Tiancai Wang, Yuang Zhang, Xiangyu Zhang, Jianbing Shen
Referring video object segmentation (RVOS) aims at segmenting an object in a video following human instruction. Current state-of-the-art methods fall into an offline pattern, in which each clip independently interacts with text embedding for cross-modal understanding. They usually present that the offline pattern is necessary for RVOS, yet model limited temporal association within each clip. In this work, we break up the previous offline belief and propose a simple yet effective online model using explicit query propagation, named OnlineRefer. Specifically, our approach leverages target cues that gather semantic information and position prior to improve the accuracy and ease of referring predictions for the current frame. Furthermore, we generalize our online model into a semi-online framework to be compatible with video-based backbones. To show the effectiveness of our method, we evaluate it on four benchmarks, \ie, Refer-Youtube-VOS, Refer-DAVIS17, A2D-Sentences, and JHMDB-Sentences. Without bells and whistles, our OnlineRefer with a Swin-L backbone achieves 63.5 J&F and 64.8 J&F on Refer-Youtube-VOS and Refer-DAVIS17, outperforming all other offline methods.
- (2023/07/18) OnlineRefer is accepted by ICCV2023. The online mode is released.
The main setup of our code follows Referformer.
Please refer to install.md for installation.
Please refer to data.md for data preparation.
If you want to train and evaluate our online model on Ref-Youtube-VOS using backbone ResNet50, please run the following command:
sh ./scripts/online_ytvos_r50.sh
If you want to train and evaluate our online model on Ref-Youtube-VOS using backbone Swin-L, please run the following command:
sh ./scripts/online_ytvos_swinl.sh
If you want to use your own video sequence, please run the following command:
python inference_long_videos.py
Note: The models with ResNet50 are trained using 8 NVIDIA 2080Ti GPU, and the models with Swin-L are trained using 8 NVIDIA Tesla V100 GPU.
Please upload the zip file to the competition server.
Backbone | J&F | J | F | Pretrain | Model | Submission |
---|---|---|---|---|---|---|
ResNet-50 | 57.3 | 55.6 | 58.9 | weight | model | link |
Swin-L | 63.5 | 61.6 | 65.5 | weight | model | link |
Video Swin-B | 62.9 | 61.0 | 64.7 | - | - | link |
As described in the paper, we report the results using the model trained on Ref-Youtube-VOS without finetune.
Backbone | J&F | J | F | Model |
---|---|---|---|---|
ResNet-50 | 59.3 | 55.7 | 62.9 | model |
Swin-L | 64.8 | 61.6 | 67.7 | model |
If you find OnlineRefer useful in your research, please consider citing:
@inproceedings{wu2023onlinerefer,
title={OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation},
author={Wu, Dongming and Wang, Tiancai and Zhang, Yuang and Zhang, Xiangyu and Shen, Jianbing},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={2761--2770},
year={2023}
}