Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicates in RadImageNet dataset #17

Open
StefanDenn3r opened this issue Dec 18, 2023 · 1 comment
Open

Duplicates in RadImageNet dataset #17

StefanDenn3r opened this issue Dec 18, 2023 · 1 comment

Comments

@StefanDenn3r
Copy link

StefanDenn3r commented Dec 18, 2023

Hi,
first of all, thanks for publishing the RadImageNet dataset!
While working with it, I discovered that there are quite some duplicate entries, when checking the MD5 hash of the files.

  1. Different pathology (i.e. different folder). This would then essentially be a multi-label setting, e.g.
  • CT/lung/interstitial_lung_disease/lung009382.png and CT/lung/Nodule/lung009382.png (Note: same filename)
  • MR/af/Plantar_plate_tear/foot040499.png and MR/af/plantar_fascia_pathology/ankle027288.png (Note: different filename)
  1. Same pathology, e.g. MR/af/hematoma/foot079779.png and MR/af/hematoma/ankle053088.png
  2. Neighboring samples, e.g. US/gb/usn309850.png and US/gb/usn309851.png
  3. Others, e.g. US/ovary/usn326815.png and US/kidney/usn348701.png

So far, I haven't checked if the duplicates are across your utilized dataset split, but since you write in your paper that you split patient wise, this shouldn't be the case.

However, the following questions arise:

  1. Since, from my understanding of the paper, this dataset is intended as a single-label, not a multi-label dataset, I am confused to find samples as in the first case. Now the question arises, can the dataset be considered as a multi-label dataset where all 165 pathologies are labeled in all images if present?

  2. For the cases 2.-4. those duplicates are just creating an imbalance but don't provide additional information. Are you planning to remove them?

In total this results in:
Number of duplicate groups: 62751
Total duplicate files: 126074

I attached a duplicates.json with all the duplicates found.
It's a dictionary where each key is a MD5 hash and its value is a list of image paths with that hash.

Here is the script I wrote to detect the duplicates, to ensure reproducibility.

import hashlib
from pathlib import Path
from typing import Dict, List, Tuple
from tqdm import tqdm
import json
import argparse


def process_image_md5(image_path: Path):
    """
    Generate MD5 hash for the given image.

    Parameters:
    image_path (Path): The path to the image file.

    Returns:
    tuple: A tuple containing the image path and its MD5 hash.
           If an error occurs, the hash will be None.
    """
    hash_md5: str = hashlib.md5()
    with open(image_path, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return image_path, hash_md5.hexdigest()


def find_duplicates(root_directory: Path):
    """
    Finds duplicate images in the given directory based on MD5 hash.

    Parameters:
    root_directory (Path): The root directory to search for images.

    Returns:
    dict: A dictionary where each key is a MD5 hash and its value is a list of image paths with that hash.
    """
    image_paths: List[Path] = list(root_directory.rglob("*.png"))
    results: List[Tuple(Path, str)] = [
        process_image_md5(image_path) for image_path in tqdm(image_paths)
    ]

    hash_paths_dict: Dict[str, List[str]] = {}
    for image_path, hash in results:
        image_path: Path = str(image_path.relative_to(root_directory))
        if hash:
            if hash in hash_paths_dict:
                hash_paths_dict[hash].append(image_path)
            else:
                hash_paths_dict[hash] = [image_path]

    return hash_paths_dict


def save_duplicates_to_json(duplicates: Dict[str, List], filename: Path):
    """
    Saves the duplicates dictionary to a JSON file.

    Parameters:
    duplicates (dict): The duplicates dictionary.
    filename (Path): The path to the JSON file where the results will be saved.
    """
    with open(filename, "w") as file:
        json.dump(duplicates, file, indent=4)


def main():
    """
    Main function to handle command line arguments and invoke duplicate finding and saving.
    """
    parser = argparse.ArgumentParser(
        description="Find and save duplicates in a dataset."
    )
    parser.add_argument(
        "root_directory", type=Path, help="Root directory of the images"
    )
    parser.add_argument(
        "json_filename", type=Path, help="Filename to save the duplicates JSON"
    )
    args = parser.parse_args()

    print(
        f"Searching for duplicates in {args.root_directory} and writing to {args.json_filename}"
    )

    hash_paths_dict: Dict[str, List[str]] = find_duplicates(args.root_directory)

    duplicates: Dict[str, List[str]] = {
        hash: paths for hash, paths in hash_paths_dict.items() if len(paths) > 1
    }

    save_duplicates_to_json(duplicates, args.json_filename)

    # Number of duplicate groups
    num_duplicate_groups: int = len(duplicates)

    # Number of duplicate files
    num_duplicate_files: int = sum(len(paths) for paths in duplicates.values())

    print(f"Duplicates saved to {args.json_filename}")
    print(f"Number of duplicate groups: {num_duplicate_groups}")
    print(f"Total duplicate files: {num_duplicate_files}")


if __name__ == "__main__":
    main()
@StefanDenn3r
Copy link
Author

Furthermore, there are quite some samples which are just empty:

"ee4d421f59bd462d212ce24753493da4": [
        "US/thyroid/usn418235.png",
        "CT/abd/normal/abd-normal039513.png",
        "CT/abd/normal/abd-normal063889.png",
        "CT/abd/normal/abd-normal053270.png",
        "CT/abd/normal/abd-normal002528.png",
        "CT/abd/normal/abd-normal046244.png",
        "CT/abd/normal/abd-normal016491.png",
        "CT/abd/normal/abd-normal056338.png",
        "CT/abd/normal/abd-normal016490.png",
        "CT/abd/normal/abd-normal068850.png",
        "CT/abd/normal/abd-normal047066.png",
        "CT/abd/normal/abd-normal049461.png",
        "CT/abd/normal/abd-normal039070.png",
        "CT/abd/normal/abd-normal018254.png",
        "CT/abd/normal/abd-normal017359.png",
        "CT/abd/normal/abd-normal063883.png",
        "CT/abd/normal/abd-normal017361.png",
        "CT/abd/normal/abd-normal016112.png",
        "CT/abd/normal/abd-normal021904.png",
        "CT/abd/normal/abd-normal051763.png",
        "CT/abd/normal/abd-normal014918.png",
        "CT/abd/normal/abd-normal035602.png",
        "CT/abd/normal/abd-normal034951.png",
        "CT/abd/normal/abd-normal016113.png",
        "CT/abd/normal/abd-normal068847.png",
        "CT/abd/normal/abd-normal069026.png",
        "CT/abd/normal/abd-normal020266.png",
        "CT/abd/normal/abd-normal058409.png",
        "CT/abd/normal/abd-normal016492.png",
        "CT/abd/normal/abd-normal014914.png",
        "CT/abd/normal/abd-normal016117.png",
        "CT/abd/normal/abd-normal068845.png",
        "CT/abd/normal/abd-normal024872.png",
        "CT/abd/normal/abd-normal063887.png",
        "CT/abd/normal/abd-normal006929.png",
        "CT/abd/normal/abd-normal038923.png",
        "CT/abd/normal/abd-normal016489.png",
        "CT/abd/normal/abd-normal014920.png",
        "CT/abd/normal/abd-normal068849.png",
        "CT/abd/normal/abd-normal002527.png",
        "CT/abd/normal/abd-normal023009.png",
        "CT/abd/normal/abd-normal028309.png",
        "CT/abd/normal/abd-normal068848.png",
        "CT/abd/normal/abd-normal012402.png",
        "CT/abd/normal/abd-normal002529.png",
        "CT/abd/normal/abd-normal027692.png",
        "CT/abd/normal/abd-normal014917.png",
        "CT/abd/normal/abd-normal038921.png",
        "CT/abd/normal/abd-normal063886.png",
        "CT/abd/normal/abd-normal017360.png",
        "CT/abd/normal/abd-normal014922.png",
        "CT/abd/normal/abd-normal016493.png",
        "CT/abd/normal/abd-normal063884.png",
        "CT/abd/normal/abd-normal048245.png",
        "CT/abd/normal/abd-normal026267.png",
        "CT/abd/normal/abd-normal050842.png",
        "CT/abd/normal/abd-normal068196.png",
        "CT/abd/normal/abd-normal059727.png",
        "CT/abd/normal/abd-normal004505.png",
        "CT/abd/normal/abd-normal039068.png",
        "CT/abd/normal/abd-normal025695.png",
        "CT/abd/normal/abd-normal043546.png",
        "CT/abd/normal/abd-normal066519.png",
        "CT/abd/normal/abd-normal038082.png",
        "CT/abd/normal/abd-normal045106.png",
        "CT/abd/normal/abd-normal030263.png",
        "CT/abd/normal/abd-normal066054.png",
        "CT/abd/normal/abd-normal002526.png",
        "CT/abd/normal/abd-normal054393.png",
        "CT/abd/normal/abd-normal034294.png",
        "CT/abd/normal/abd-normal039072.png",
        "CT/abd/normal/abd-normal002524.png",
        "CT/abd/normal/abd-normal019419.png",
        "CT/abd/normal/abd-normal007039.png",
        "CT/abd/normal/abd-normal020776.png",
        "CT/abd/normal/abd-normal018253.png",
        "CT/abd/normal/abd-normal014921.png",
        "CT/abd/normal/abd-normal011770.png",
        "CT/abd/normal/abd-normal050088.png",
        "CT/abd/normal/abd-normal039073.png",
        "CT/abd/normal/abd-normal058178.png",
        "CT/abd/normal/abd-normal056852.png",
        "CT/abd/normal/abd-normal063888.png",
        "CT/abd/normal/abd-normal048689.png",
        "CT/abd/normal/abd-normal017358.png",
        "CT/abd/normal/abd-normal016115.png",
        "CT/abd/normal/abd-normal002651.png",
        "CT/abd/normal/abd-normal014915.png",
        "CT/abd/normal/abd-normal040704.png",
        "CT/abd/normal/abd-normal058408.png",
        "CT/abd/normal/abd-normal002530.png",
        "CT/abd/normal/abd-normal038924.png",
        "CT/abd/normal/abd-normal038770.png",
        "CT/abd/normal/abd-normal063005.png",
        "CT/abd/normal/abd-normal012403.png",
        "CT/abd/normal/abd-normal005817.png",
        "CT/abd/normal/abd-normal038922.png",
        "CT/abd/normal/abd-normal039071.png",
        "CT/abd/normal/abd-normal059091.png",
        "CT/abd/normal/abd-normal048244.png",
        "CT/abd/normal/abd-normal016116.png",
        "CT/abd/normal/abd-normal069025.png",
        "CT/abd/normal/abd-normal039069.png",
        "CT/abd/normal/abd-normal025111.png",
        "CT/abd/normal/abd-normal060343.png",
        "CT/abd/normal/abd-normal021311.png",
        "CT/abd/normal/abd-normal018255.png",
        "CT/abd/normal/abd-normal067181.png",
        "CT/abd/normal/abd-normal014919.png",
        "CT/abd/normal/abd-normal064379.png",
        "CT/abd/normal/abd-normal011865.png",
        "CT/abd/normal/abd-normal048784.png",
        "CT/abd/normal/abd-normal029406.png",
        "CT/abd/normal/abd-normal063885.png",
        "CT/abd/normal/abd-normal016114.png",
        "CT/abd/normal/abd-normal062317.png",
        "CT/abd/normal/abd-normal068846.png",
        "CT/abd/normal/abd-normal054395.png",
        "CT/abd/normal/abd-normal054303.png",
        "CT/abd/normal/abd-normal002400.png",
        "CT/abd/normal/abd-normal044746.png",
        "CT/abd/normal/abd-normal068195.png",
        "CT/abd/normal/abd-normal066049.png",
        "CT/abd/normal/abd-normal033146.png",
        "CT/abd/normal/abd-normal068705.png",
        "CT/abd/normal/abd-normal014916.png",
        "CT/abd/normal/abd-normal057421.png",
        "CT/abd/normal/abd-normal051215.png",
        "CT/abd/normal/abd-normal065509.png",
        "CT/abd/normal/abd-normal009734.png",
        "CT/abd/normal/abd-normal032744.png",
        "CT/abd/normal/abd-normal022428.png",
        "CT/abd/normal/abd-normal028845.png",
        "CT/abd/normal/abd-normal028748.png",
        "CT/abd/normal/abd-normal063890.png",
        "MR/mriabd/normal/mri-abd-normal047494.png",
        "MR/mriabd/normal/mri-abd-normal047532.png",
        "MR/mriabd/normal/mri-abd-normal047530.png",
        "MR/mriabd/normal/mri-abd-normal047531.png",
        "MR/mriabd/normal/mri-abd-normal047529.png",
        "MR/mriabd/normal/mri-abd-normal047492.png",
        "MR/mriabd/normal/mri-abd-normal047493.png",
        "MR/mriabd/normal/mri-abd-normal047491.png",
        "MR/af/chondral_abnormality/ankle026697.png"
    ],

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant