This is a recipe for data based on YFCC100m [1] used in our ICLR 2021 paper [2] on audio-visual on-screen sound separation.
Specifically, we provide CSVs that describe the exact videos and timestamps used for labeled and unlabeled train, validation, and test clips, as well as specification of pairs of clips used to create mixture of mixtures (MoM) validation and test sets. The same train/validation/test splits are used as in these lists. All clips referenced by these lists have been filtered by an unsupervised audio-visual coincidence model trained on AudioSet [3] using a threshold of 0.8 on the predicted coincidence probability.
Split | Label | Count | CSV name |
---|---|---|---|
Train | None | 324970 | filtered_train_clips.csv |
Train | On-screen only | 836 | filtered_train_onscreen_unanimous_clips.csv |
Train | Off-screen only | 3672* | filtered_train_offscreen_unanimous_clips.csv |
Validation | None | 6429 | filtered_validate_clips.csv |
Validation | On-screen only | 735 | filtered_validate_onscreen_unanimous_clips.csv |
Validation | Off-screen only | 836 | filtered_validate_offscreen_unanimous_clips.csv |
Test | None | 3293 | filtered_test_clips.csv |
Test | On-screen only | 295 | filtered_test_onscreen_unanimous_clips.csv |
Test | Off-screen only | 370 | filtered_test_offscreen_unanimous_clips.csv |
* The paper [2] gives the incorrect count of 3681.
Addionally, we provide lists of the pairs of clips used to create MoMs for validation and test, where the MoM video uses video frames from the first clip, and a soundtrack that is the sum of the audio from both clips.
Split | Label | Count | CSV name |
---|---|---|---|
Validation | On-screen + off-screen | 3675 | filtered_validate_onscreen_unanimous_plus_offscreen_unanimous_mom_clips.csv |
Validation | Off-screen + off-screen | 4180 | filtered_validate_offscreen_unanimous_plus_offscreen_unanimous_mom_clips.csv |
Test | On-screen + off-screen | 1475 | filtered_test_onscreen_unanimous_plus_offscreen_unanimous_mom_clips.csv |
Test | Off-screen + off-screen | 1850 | filtered_test_offscreen_unanimous_plus_offscreen_unanimous_mom_clips.csv |
The CSVs are hosted on Google Cloud. They can be downloaded using the following command:
gsutil -m cp -r gs://gresearch/sound_separation/audioscope_yfcc100m_clip_lists .
which will copy the CSVs to the current folder.
The train, validation, and test CSVs contain the following columns:
- Video path: string, path to MP4 video in YFCC100m. Here is an example of a video path:
data/videos/mp4/827/f6b/827f6b53467db2d5218ed8247418c4c.mp4
- Input start time: float, clip start time in seconds.
- Input end time: float, clip end time in seconds.
The validation and test MoM CSVs contain the following columns, with each row describing two clips that are used to construct the MoM:
- Video path 1
- Input start time 1
- Input end time 1
- Video path 2
- Input start time 2
- Input end time 2
These lists are released under a Creative Commons license (CC-BY 4.0).