-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicates in RadImageNet dataset #17
Comments
Furthermore, there are quite some samples which are just empty:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
first of all, thanks for publishing the RadImageNet dataset!
While working with it, I discovered that there are quite some duplicate entries, when checking the MD5 hash of the files.
CT/lung/interstitial_lung_disease/lung009382.png
andCT/lung/Nodule/lung009382.png
(Note: same filename)MR/af/Plantar_plate_tear/foot040499.png
andMR/af/plantar_fascia_pathology/ankle027288.png
(Note: different filename)MR/af/hematoma/foot079779.png
andMR/af/hematoma/ankle053088.png
US/gb/usn309850.png
andUS/gb/usn309851.png
US/ovary/usn326815.png
andUS/kidney/usn348701.png
So far, I haven't checked if the duplicates are across your utilized dataset split, but since you write in your paper that you split patient wise, this shouldn't be the case.
However, the following questions arise:
Since, from my understanding of the paper, this dataset is intended as a single-label, not a multi-label dataset, I am confused to find samples as in the first case. Now the question arises, can the dataset be considered as a multi-label dataset where all 165 pathologies are labeled in all images if present?
For the cases 2.-4. those duplicates are just creating an imbalance but don't provide additional information. Are you planning to remove them?
In total this results in:
Number of duplicate groups: 62751
Total duplicate files: 126074
I attached a duplicates.json with all the duplicates found.
It's a dictionary where each key is a MD5 hash and its value is a list of image paths with that hash.
Here is the script I wrote to detect the duplicates, to ensure reproducibility.
The text was updated successfully, but these errors were encountered: