This is an official page of "MedleyVox: An Evaluation Dataset for Multiple Singing Voices Separation" in ICASSP 2023.
A camera-ready version of the paper is uploaded in arXiv. Please check the icons below.
Now, MedleyVox is just uploaded on Zenodo!! Please check this website (https://zenodo.org/record/7984549).
Since we provide the metadata of MedleyVox in this code repository, you can easily obtain the MedleyVox with our code and existing MedleyDB v1 and v2. You should manually check some directory parameters in testset/testset_save code.
python -m testset.testset_save
Our code is heavily based on asteroid. You first have to install asteroid as a python package.
pip install git+https://github.com/asteroid-team/asteroid
and other remaining packages in ‘requirements.txt’. fairseq package is not needed for training but you need to use it when you use the chunk-wise processing based on wav2vec representation. It will be introduced in the last section of this page.
In svs/preprocess folder, you can find a number of preprocessing codes. For the preparation of train data, most of the codes are just simple downsampling/save processes. For the preparation of validation data, you can ignore it because we already made json files of metadata for validation.
For mixture construction strategy, we have a total of 5 arguments in svs/main.py for 6 training input construction strategy. Each of them are
- sing_sing_ratio (float) : Case 1. Ratio of 'different singing + singing' in training data sampling process.
- sing_speech_ratio (float) : Case 2. Ratio of 'different singing + speech' in training data sampling process.
- same_song_ratio (float) : Case 3. Ratio of 'same song of different singers’ in training data sampling process.
- same_singer_ratio (float) : Case 4. Ratio of 'different songs of same singer’ in training data sampling process.
- same_speaker_ratio (float) : Case 5. Ratio of 'different speeches of same speaker’ in training data sampling process.
- speech_speech_ratio (float) : Case 6. Ratio of 'different speech + speech’ in training data sampling process. This is not specified by arguments, but automatically calculated by ‘1 - (sum_of_rest_arguments)’.
We first train the standard Conv-TasNet (for 200 epochs).
python -m svs.main --exp_name=your_exp_name --patience=50\
--use_wandb=True --mixture_consistency=mixture_consistency\
--train_loss_func pit_snr multi_spectral_l1\
Then, we start joint training of the pre-trained Conv-TasNet and the cascaded iSRNet. (for 30 epochs with argument —reduced_training_data_ratio=0.1, for more frequent validation loss checking)
python -m svs.main --exp_name=your_exp_name_iSRNet\
--start_from_best=True --reduced_training_data_ratio=0.1\
--gradient_clip=5 --lr=3e-5 --batch_size=8 --above_freq=3000\
--epochs=230 --lr_decay_patience=6 --patience=15\
--use_wandb=True --mixture_consistency=sfsrnet --srnet=convnext\
--sr_input_res=False --train_loss_func pit_snr multi_spectral_l1 snr\
--continual_train=True --resume=/path/to/your_exp_name
Similar to duet and unison separation model, we first train the standard Conv-TasNet (for 200 epochs). You have to set different --dataset argument.
python -m svs.main --exp_name=your_exp_name --patience=50\
--use_wandb=True --mixture_consistency=mixture_consistency\
--train_loss_func pit_snr multi_spectral_l1\
--dataset=multi_singing_librispeech
After that, also similar to duet and unison separation model, we start joint training of the pre-trained Conv-TasNet and the cascaded iSRNet. (for 30 epochs with argument —reduced_training_data_ratio=0.1, for more frequent validation loss checking)
python -m svs.main --exp_name=your_exp_name_iSRNet\
--start_from_best=True --reduced_training_data_ratio=0.1\
--gradient_clip=5 --lr=3e-5 --batch_size=8 --above_freq=3000\
--epochs=230 --lr_decay_patience=6 --patience=15\
--use_wandb=True --mixture_consistency=sfsrnet --srnet=convnext\
--sr_input_res=False --train_loss_func pit_snr multi_spectral_l1 snr\
--continual_train=True --resume=/path/to/your_exp_name\
--dataset=multi_singing_librispeech
We use a total of 13 different singing datasets of 400 hours and 460 hours of LibriSpeech data for training.
Dataset | Labels (same song (segment) of different singers) | Labels (different songs of same singer) | Lengths[hours] | Notes |
---|---|---|---|---|
Children’s song dataset (CSD) | _ | ✓ | 4.9 | _ |
NUS | _ | ✓ | 1.9 | _ |
TONAS | _ | _ | 0.3 | _ |
VocalSet | _ | ✓ | 8.8 | _ |
Jsut-song | _ | ✓ | 0.4 | _ |
Jvs_music | _ | ✓ | 2.3 | _ |
Tohoku Kiritan | _ | ✓ | 1.1 | _ |
vocadito | _ | _ | 0.2 | _ |
Musdb-hq (train subset) | _ | ✓ | 2.0 | Single singing regions were extracted from the annotations in musdb-lyrics extension |
OpenSinger | _ | ✓ | 51.9 | _ |
MedleyDB v1 | _ | _ | 3.8 | For training, we only used the songs that included in musdb18 dataset. |
K_multisinger | ✓ | ✓ | 169.6 | _ |
K_multitimbre | ✓ | ✓ | 150.8 | _ |
LibriSpeech_train-clean-360 | _ | ✓ | 360 | _ |
LibriSpeech_train-clean-100 | _ | ✓ | 100 | _ |
We use a musdb-hq (test subset) and LibriSpeech_dev-clean for validation data.
Case | Description | Notes |
---|---|---|
1) | Different singing + singing | — |
2) | One singing + its unison | — |
3) | Different songs of same singer | — |
4) | Different speech + speech | — |
5) | One speech + its unison | — |
6) | Different speeches of same speaker | — |
7) | Different speech + singing | — |
Currently, we have no plan to upload the pre-trained weights of our models.
python -m svs.test --singing_task=duet --exp_name=your_exp_name
separate every audio file (.mp3, .flac, .wav) in --inference_data_dir
python -m svs.inference --exp_name=your_exp_name\
--model_dir=/path/where/your/checkpoint/is\
--inference_data_dir=/path/where/the/input/data/is\
--results_save_dir=/path/to/save/output
If the input is too long, it may be impossible to impossible due to lack of VRAM, or performance may be degraded at all. In that case, use --use_overlapadd. Among the --use_overlapadd options, "ola", "ola_norm", and "w2v" all work similarly to LambdaOverlapAdd in asteroid.
- ola: Same as LambdaOverlapAdd in asteroid.
- ola_norm: LambdaOverlapAdd with input applied chunk-wise loudness normalization (we used loudness normalization in training stage). The effect was not good.
- w2v: When calculating the singer assignment in the overlapped region of the chunk in the LambdaOverlapAdd function based on the wave2vec2.0-xlsr model, the LambdaOverlapAdd implemented in the asteroid is simply obtained as L1 in the waveform stage. This is transformed into cosine similarity of w2v feature. You first have to install fairseq and download the weight of wav2vec2.0-xlsr model.
In our paper, we have analyzed several failure cases that standard ola methods cannot handle. To this end, we implemented some useful inference methods for chunk-wise processing based on voice activity detection (VAD).
- w2v_chunk: First use VAD and divide it into chunks, then chunk-wise processing. Unlike asteroid LambdaOverlapAdd, there is no overlapped region of chunk in front and rear, so it should not be implemented as L1 distance in waveform, and the similarity in feature stage is obtained. Calculated by continuously accumulating the w2v feature for each chunk.
- sf_chunk: The principle is the same as w2v_chunk, but instead of w2v, use a spectral feature such as mfcc or spectral centroid.
—vad_method can be used between spectrogram energy based (spec) and py-webrtcvad based (webrtc).