VSS-CFFM & CFFM++

Official PyTorch implementation of CFFM, proposed in CVPR 2022 paper: Coarse-to-Fine Feature Mining for Video Semantic Segmentation, paper

Official PyTorch implementation of an extension method CFFM++ (Learning Local and Global Temporal Contexts for Video Semantic Segmentation, TPAMI 2024), built upon CFFM by additionally exploiting global temporal contexts from the whole video, paper

Introduction

CFFM

The contextual information plays a core role in semantic segmentation. As for video semantic segmentation, the contexts include static contexts and motional contexts, corresponding to static content and moving content in a video clip, respectively. The static contexts are well exploited in image semantic segmentation by learning multi-scale and global/long-range features. The motional contexts are studied in previous video semantic segmentation. However, there is no research about how to simultaneously learn static and motional contexts which are highly correlated and complementary to each other. To address this problem, we propose a Coarse-to-Fine Feature Mining (CFFM) technique to learn a unified presentation of static contexts and motional contexts. This technique consists of two parts: coarse-to-fine feature assembling and cross-frame feature mining. The former operation prepares data for further processing, enabling the subsequent joint learning of static and motional contexts. The latter operation mines useful information/contexts from the sequential frames to enhance the video contexts of the features of the target frame. The enhanced features can be directly applied for the final prediction. Experimental results on popular benchmarks demonstrate that the proposed CFFM performs favorably against state-of-the-art methods for video semantic segmentation.

Authors: Guolei Sun, Yun Liu, Henghui Ding, Thomas Probst, Luc Van Gool.

CFFM++

Contextual information plays a core role for video semantic segmentation (VSS). This work summarizes contexts for VSS in two-fold: local temporal contexts (LTC) which define the contexts from neighboring frames, and global temporal contexts (GTC) which represent the contexts from the whole video. As for LTC, it includes static and motional contexts, corresponding to static and moving content in neighboring frames, respectively. Previously, both static and motional contexts have been studied. However, there is no research about simultaneously learning static and motional contexts (highly complementary). Hence, we propose a Coarse-to-Fine Feature Mining (CFFM) technique to learn a unified presentation of LTC. CFFM contains two parts: Coarse-to-Fine Feature Assembling (CFFA) and Cross-frame Feature Mining (CFM). CFFA abstracts static and motional contexts, and CFM mines useful information from nearby frames to enhance target features. To further exploit more temporal contexts, we propose CFFM++ by additionally learning GTC from the whole video. Specifically, we uniformly sample certain frames from the video and extract global contextual prototypes by k-means. The information within those prototypes is mined by CFM to refine target features.

Authors: Guolei Sun, Yun Liu, Henghui Ding, Min Wu, Luc Van Gool.

Installation

Please follow the guidelines in MMSegmentation v0.13.0.

Other requirements: timm==0.3.0, CUDA11.0, pytorch==1.7.1, torchvision==0.8.2, mmcv==1.3.0, opencv-python==4.5.2

Download this repository and install by:

cd VSS-CFFM && pip install -e . --user

Usage: CFFM

Data preparation

Please follow VSPW to download VSPW 480P dataset. After correctly downloading, the file system is as follows:

vspw-480
├── video1
    ├── origin
        ├── .jpg
    └── mask
        └── .png

The dataset should be put in /repo_path/data/vspw/. Or you can use Symlink:

cd VSS-CFFM
mkdir -p data/vspw/
ln -s /dataset_path/VSPW_480p data/vspw/

Test

Download the trained weights (CFFM) from here.
Run the following commands:

# Multi-gpu testing
./tools/dist_test.sh local_configs/cffm/B1/cffm.b1.480x480.vspw2.160k.py /path/to/CFFM_checkpoint_file <GPU_NUM> \
--out /path/to/save_results/res.pkl

Training

Download weights ( google drive | onedrive ) pretrained on ImageNet-1K (provided by SegFormer), and put them in a folder pretrained/.
Training requires 4 Nvidia GPUs, each of which has > 20G GPU memory.

# Multi-gpu training
./tools/dist_train.sh local_configs/cffm/B1/cffm.b1.480x480.vspw2.160k.py 4 --work-dir model_path/vspw2/work_dirs_4g_b1

Usage: CFFM++

Data preparation

The same dataset as CFFM is used. Before trying CFFM++, please first try CFFM to get familiar with the procedures.

Test

Download the trained weights and global conextural prototypes (cluster centers) from here. For each backbone, there are two files: a trained weight and a prototypes file containing the global contextual prototypes for all videos.
Run the following commands:

# Multi-gpu testing, take MiT-B1 as an example 
unzip cluster_centers_b1_100.zip
./tools/dist_test.sh local_configs/cffm/B1/cffm.b1.480x480.vspw2_fine_w_proto.40k.py /path/to/CFFM++_checkpoint_file <GPU_NUM> \
--out /path/to/save_results/res.pkl

Training

For training CFFM++, you need to have a CFFM trained model ready since CFFM++ is built upon CFFM.

Generate global contextual prototypes using CFFM model (take MiT-B1 as an example)

./tools/dist_test.sh local_configs/cffm/B1/cffm.b1.480x480.vspw2_gene_prototype.py  /path/to/CFFM_checkpoint_file 4 --out /path/to/output/res.pkl --eval None

After running the above command, the prototypes per video will be save in './cluster_centers/'. You could also skip this step and directly use the prototypes generated by us, which could be downloaded from here.

Finetuning and obtaining CFFM++ model: requires 4 Nvidia GPUs, each of which has > 20G GPU memory. Take MiT-B1 as an example.

./tools/dist_train.sh local_configs/cffm/B1/cffm.b1.480x480.vspw2_fine_w_proto.40k.py 4 --load-from /path/to/CFFM_checkpoint_file  \
--work-dir model_path/vspw2/work_dirs_4g_b1_CFFM++

License

This project is only for academic use.

Acknowledgement

The code is heavily based on the following repositories:

https://github.com/open-mmlab/mmsegmentation
https://github.com/NVlabs/SegFormer
https://github.com/microsoft/Focal-Transformer

Thanks for their amazing works.

Citation

If you use our materials, please consider citing:

@inproceedings{sun2022vss,
    title={Coarse-to-Fine Feature Mining for Video Semantic Segmentation},
    author={Sun, Guolei and Liu, Yun and Ding, Henghui and Probst, Thomas and Van Gool, Luc},
    booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision and Patern Recognition (CVPR)},
    year={2022}
}

@article{sun2024learning,
    title={Learning Local and Global Temporal Contexts for Video Semantic Segmentation},
    author={Sun, Guolei and Liu, Yun and Ding, Henghui and Wu, Min and Van Gool, Luc},
    journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
    year={2024}
}

Contact

Guolei Sun, sunguolei.kaust@gmail.com
Yun Liu, yun.liu@vision.ee.ethz.ch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

VSS-CFFM & CFFM++

Introduction

CFFM

CFFM++

Installation

Usage: CFFM

Data preparation

Test

Training

Usage: CFFM++

Data preparation

Test

Training

License

Acknowledgement

Citation

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

VSS-CFFM & CFFM++

Introduction

CFFM

CFFM++

Installation

Usage: CFFM

Data preparation

Test

Training

Usage: CFFM++

Data preparation

Test

Training

License

Acknowledgement

Citation

Contact