A dataset for automatic sample identication. Created in 2011 for the research described in [1] and [2].
This dataset contains 105 sample relations (ids starting with S in samples.csv
) between 76 songs that make use of one or more samples, and 68 songs that were sampled (ids starting with T in tracks.csv
).
This dataset contains only metadata, with track titles and a few more annotations. Contact me if you would like to use audio or specific features.
The dataset is intended to be used for evaluation following a standard retrieval paradigm with query and candidate files.
The 76 tracks that contain samples are used as queries. The 68 songs are used as candidates, together with optional 'noise' files.
In [1] and [2], 320 'noise' files similar to the candidates in genre and length were added to challenge the system.
Only samples used in hip hop music were considered. Regarding sample origins, there were no genre restrictions.
For representativeness, the ground truth was chosen to include both short and long samples, tonal and percussive samples, and isolated samples (the only layer in the mix) as well as background samples. So-called ‘interpolations’, i.e. samples that have been re-recorded in the studio, were avoided, as were non-musical samples (e.g. film dialogue).
The dataset was compiled using valuable information from WhoSampled and Hip Hop is Read.
- S102 (T177 sampled by T178)
- T027.wav
- T078.wav
Please cite one of the following when using this dataset.
[1] Van Balen, J. (2011). Automatic Recognition of Samples in Musical Audio. Master Thesis, Universitat Pompeu Fabra, Barcelona, Spain.
[2] Van Balen, J., Serrà, J., & Haro, M. (2012). Automatic Identification of Samples in Hip Hop Music. In Int. Symp. on Computer Music Modeling and Retrieval (CMMR). London, United Kingdom.