MusicYOLO framework uses the object detection model, YOLOX, to locate notes in the spectrogram. Its performance on the ISMIR2014 dataset, MIR-ST500 dataset and SSVD dataset show that MusicYOLO significantly improves onset/offset detection compared with previous approaches.
Step1. Install pytorch.
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch
Step1. Install YOLOX.
git clone [email protected]:xk-wang/MusicYOLO.git
cd MusicYOLO
pip3 install -U pip && pip3 install -r requirements.txt
pip3 install -v -e . # or python3 develop
Step2. Install apex.
# skip this step if you don't want to train model.
cd apex
pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" .
Step3. Install pycocotools.
pip3 install cython;
cd cocoapi/PythonAPI && pip3 install -v .
Download the pretrained musicyolo1 and musicyolo2 models described in our paper. Put these two models under the models folder. The models are stored in BaiduYun code: 1234
Step1. Download SSVD-v2.0 from
Step2. Onset/offset detection (use musicyolo2.pth)
python3 tools/ -f exps/example/custom/ -c models/musicyolo2.pth --audiodir $SSVD_TEST_SET_PATH --savedir $SAVE_PATH --ext .flac --device gpu
Step3. Evaluate
python3 tools/ --label $SSVD_TEST_SET_PATH --result $SAVE_PATH --offset
Similar process for ISMIR2014 dataset.
Since MIR-ST500 dataset is a mixture of vocals and accompaniments, we need to separate vocals and accompaniments with spleeter first. Besides, since the singing duration of each audio in MIR-ST500 dataset is too long, we will first cut each audio into short audios of about 35s for on/offset detection.
Step1. Audio source seperation
python3 tools/util/ $MIR_ST500_DIR
Step2. Split audio
python3 tools/util/ --mst_path $MST_TEST_VOCAL_PATH --dest_dir $SPLIT_PATH
Step3. Onset/offset detection (use musicyolo1.pth)
python3 tools/ -f exps/example/custom/ -c models/musicyolo1.pth --audiodir $SPLIT_PATH --savedir $SAVE_PATH --ext .wav --device gpu
Step4. Merge results
Because we split the MIR-ST500 test set audio earlier, the results are also splited. Here we merge the split results.
python3 tools/util/ --audio_dir $SPLIT_PATH --origin_dir $SAVE_PATH --final_dir $MERGE_PATH
Step5. Evaluate
python3 tools/ --label $MIR_ST500_TEST_LABEL_PATH --result $MERGE_PATH --offset
Download yolox-s weight from . Put the model weight under models folder.
Step1. Get SSVD train set
Download SSVD-v2.0 from Put the images folder under the datasets folder.
Step2. Train
python3 tools/ -f exps/example/custom/ -d 1 -b 16 --fp16 -o -c models/yolox_s.pth
Because there are a few audios for SSVD training set, we use Labelme software to annotate note object manually. There are a lot of data in MIR-ST500 training set, so we design a set of automatic annotation tools.
Step1. Audio source seperation
python3 tools/util/ $MIR_ST500_TRAIN_DIR
Step2. Split audio
python3 tools/util/ --mst_path $MIR_ST500_TRAIN_DIR --dest_dir $TRAIN_SPLIT_PATH
Step3. Automatic annotation
python3 tools/util/ --audiodir $TRAIN_SPLIT_PATH --imgdir $MST_NOTE_PATH
Step4. Automatic annotation
Divide the training set and validation set by yourself. We break up the images and divide them according to the ratio of 7:3 to get the training set and validation set. The images and annotations are put under $YOU_MIR_ST500_IMAGES folder.
Step4. Coco dataset format
The MIR-st500 note object detection dataset is organized in a format similar to the images folder in SSVD v2.0 dataset.
python3 tools/util/ --annotationpath $YOU_MIR_ST500_IMAGES/train --jsonpath $IMAGE_DIR/train/_annotations.coco.json
python3 tools/util/ --annotationpath $YOU_MIR_ST500_IMAGES/valid --jsonpath $IMAGE_DIR/valid/_annotations.coco.json
then put the MIR-ST500 note object detection dataset under the datasets folder like SSVD.
the similar process like training on SSVD dataset.
title={YOLOX: Exceeding YOLO Series in 2021},
author={Ge, Zheng and Liu, Songtao and Wang, Feng and Li, Zeming and Sun, Jian},
journal={arXiv preprint arXiv:2107.08430},
author={X. Wang, W. Xu, W. Yang and W. Cheng},
booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},