MUTR: A Unified Temporal Transformer for Multi-Modal Video Object Segmentation

Official implementation of 'Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation'.

The paper has been accepted by AAAI 2024 🔥.

Introduction

We propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals, which are low-level temporal aggregation (MTA) and high-level temporal interaction (MTI). On Ref-YouTube-VOS and AVSBench with respective text and audio references, MUTR achieves +4.2% and +4.2% J&F improvements to state-of-the-art methods, demonstrating our significance for unified multi-modal VOS.

Update

TODO: Release the code and checkpoints on AV-VOS with audio reference 📌.
We release the code and checkpoints of MUTR on RVOS with language reference 🔥.

Requirements

We test the codes in the following environments, other versions may also be compatible:

CUDA 11.1
Python 3.7
Pytorch 1.8.1

Installation

Please refer to install.md for installation.

Data Preparation

Please refer to data.md for data preparation.

After the organization, we expect the directory struture to be the following:

MUTR/
├── data/
│   ├── ref-youtube-vos/
│   ├── ref-davis/
├── davis2017/
├── datasets/
├── models/
├── scipts/
├── tools/
├── util/
├── train.py
├── engine.py
├── inference_ytvos.py
├── inference_davis.py
├── opts.py
...

Get Started

Please see Ref-YouTube-VOS and Ref-DAVIS 2017 for details.

Model Zoo and Results

Note:

--backbone denotes the different backbones (see here).

--backbone_pretrained denotes the path of the backbone's pretrained weight (see here).

Ref-YouTube-VOS

To evaluate the results, please upload the zip file to the competition server.

Backbone	J&F	J	F	Model	Submission
ResNet-50	61.9	60.4	63.4	model	link
ResNet-101	63.6	61.8	65.4	model	link
Swin-L	68.4	66.4	70.4	model	link
Video-Swin-T	64.0	62.2	65.8	model	link
Video-Swin-S	65.1	63.0	67.1	model	link
Video-Swin-B	67.5	65.4	69.6	model	link
ConvNext-L	66.7	64.8	68.7	model	link
ConvMAE-B	66.9	64.7	69.1	model	link

Ref-DAVIS17

As described in the paper, we report the results using the model trained on Ref-Youtube-VOS without finetune.

Backbone	J&F	J	F	Model
ResNet-50	65.3	62.4	68.2	model
ResNet-101	65.3	61.9	68.6	model
Swin-L	68.0	64.8	71.3	model
Video-Swin-T	66.5	63.0	70.0	model
Video-Swin-S	66.1	62.6	69.8	model
Video-Swin-B	66.4	62.8	70.0	model
ConvNext-L	69.0	65.6	72.4	model
ConvMAE-B	69.2	65.6	72.8	model

Acknowledgement

This repo is based on ReferFormer. We also refer to the repositories Deformable DETR and MTTR. Thanks for their wonderful works.

Citation

@inproceedings{yan2024referred,
  title={Referred by multi-modality: A unified temporal transformer for video object segmentation},
  author={Yan, Shilin and Zhang, Renrui and Guo, Ziyu and Chen, Wenchao and Zhang, Wei and Li, Hongyang and Qiao, Yu and Dong, Hao and He, Zhongjiang and Gao, Peng},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={6},
  pages={6449--6457},
  year={2024}
}

Contact

If you have any question about this project, please feel free to contact [email protected].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MUTR: A Unified Temporal Transformer for Multi-Modal Video Object Segmentation

Introduction

Update

Requirements

Installation

Data Preparation

Get Started

Model Zoo and Results

Ref-YouTube-VOS

Ref-DAVIS17

Acknowledgement

Citation

Contact

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
datasets		datasets
davis2017		davis2017
docs		docs
models		models
scripts		scripts
tools		tools
util		util
LICENSE.txt		LICENSE.txt
README.md		README.md
engine.py		engine.py
inference_davis.py		inference_davis.py
inference_ytvos.py		inference_ytvos.py
opts.py		opts.py
requirements.txt		requirements.txt
train.py		train.py

License

OpenGVLab/MUTR

Folders and files

Latest commit

History

Repository files navigation

MUTR: A Unified Temporal Transformer for Multi-Modal Video Object Segmentation

Introduction

Update

Requirements

Installation

Data Preparation

Get Started

Model Zoo and Results

Ref-YouTube-VOS

Ref-DAVIS17

Acknowledgement

Citation

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages