We introduce audio-visual class-incremental learning, a class-incremental learning scenario for audio-visual video recognition, and propose a method AV-CIL. [paper]
We conduct experiments with Python 3.8.13 and Pytorch 1.13.0.
To setup the environment, please simply run
pip install -r requirements.txt
The original AVE dataset can be downloaded through link.
Please put the downloaded AVE videos in ./raw_data/AVE/videos/.
The original Kinetics dataset can be downloaded through link. After downloading the Kinetics dataset, please apply our provided video id list (here) to extract the Kinetics-Sounds dataset used in our experiments.
Please put the downloaded videos in ./raw_data/kinetics-sounds/videos/.
The original VGGSound dataset can be downloaded through link. After downloading the VGGSound dataset, please apply our provided video id list (here) to extract the VGGSound100 dataset used in our experiments.
Please put the downloaded videos in ./raw_data/VGGSound/videos/.
After downloading the datasets to the folds, please run the following command to extract the audios and frames
sh extract_audios_frames.sh 'dataset'
where the 'dataset' should be in [AVE, ksounds, VGGSound_100].
For the audio encoder, please download the pre-trained AudioMAE and put it in ./model/pretrained/.
For the pre-trained audio features extraction, please run
sh extract_pretrained_features 'dataset'
where the 'dataset' should be in [AVE, ksounds, VGGSound_100].
For the running environment of the AudioMAE, we follow the official implementation and use timm==0.3.2, for which a fix is needed to work with Pytorch 1.8.1+.
We also released the pre-trained features, you can use them directly instead of pre-processing and extracting them from the raw data: AVE, Kinetics-Sounds [part-1, part-2, part-3], VGGSound100[part-1, part-2, part-3, part-4, part-5, part-6].
For Kinetics-Sounds and VGGSound100, please download all the parts and concatenate them before unzipping.
After obtaining the pre-trained audio and visual features, please put them to ./data/'dataset'/audio_pretrained_feature/ and ./data/'dataset'/visual_pretrained_feature/.
For vanilla fine-tuning strategy, please run
sh run_incremental_fine_tuning.sh 'dataset' 'modality'
where the 'dataset' should be in [AVE, ksounds, VGGSound_100], and the 'modality' should be in [audio, visual, audio-visual].
For the upper bound, please run
sh run_incremental_upper_bound.sh 'dataset' 'modality'
For LwF, please run
sh run_incremental_lwf.sh 'dataset' 'modality'
For iCaRL, please run
sh run_incremental_lwf.sh 'dataset' 'modality' 'classifier'
where the 'classifier' should be in [NME, FC].
For SS-IL, please run
sh run_incremental_ssil.sh 'dataset' 'modality'
For AFC, please run
sh run_incremental_afc.sh 'dataset' 'modality' 'classifier'
where the 'classifier' should be in [NME, LSC].
For our AV-CIL, please run
sh run_incremental_ours.sh 'dataset'
If you find this work useful, please consider citing it.
@inproceedings{pian2023audio,
title={Audio-Visual Class-Incremental Learning},
author={Pian, Weiguo and Mo, Shentong and Guo, Yunhui and Tian, Yapeng},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={7799--7811},
year={2023}
}