Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add CustomDataset #738

Merged
merged 4 commits into from
Mar 30, 2022
Merged

[Feature] Add CustomDataset #738

merged 4 commits into from
Mar 30, 2022

Conversation

mzr1996
Copy link
Member

@mzr1996 mzr1996 commented Mar 21, 2022

Motivation

Our dataset is a little not friendly. Users usually need to format their dataset as the format of ImageNet, but it's not intuitive.

Modification

In this PR, I add a CustomDataset and add detailed docstring.

The CustomDataset is almost the same as torchvision.datasets.ImageFolder, but also supports ann_file.

Use cases

  1. Use annotation file:
    data_prefix/
    ├── folder_1   # The images can be arranged as any format under the data_prefix
    │   ├── xxx.png
    │   └── xxy.png
    ├── 123.png
    └── nsdf3.png
    
    The ann_file:
    folder_1/xxx.png 0
    folder_1/xxy.png 1
    123.png 5
    nsdf3.png 3
    
    Initialize from ann_file:
    >>> from mmcls.datasets import build_dataset
    >>> cfg = dict(type='CustomDataset', data_prefix="data_prefix", ann_file="ann_file.txt")
    >>> dataset = build_dataset(cfg)
    >>> print(len(dataset))
    4
    >>> print(dataset[0])
    {'img_prefix': 'data_prefix',
     'img_info': {'filename': 'folder_1/xxx.png'},
     'gt_label': array(0)}
  2. The samples are arranged in the specific way:
    data_prefix/
    ├── cat
    │   ├── sub_folder
    │   │   └── 111.jpg
    │   ├── xxx.png
    │   └── xxy.png
    └── dog
        ├── 123.png
        └── nsdf3.png
    
    Initialize:
    >>> from mmcls.datasets import build_dataset
    >>> cfg = dict(type='CustomDataset', data_prefix="data_prefix")
    >>> dataset = build_dataset(cfg)
    >>> print(len(dataset))
    5
    >>> print(dataset.CLASSES)
    ['cat', 'dog']
    >>> print(dataset[0])
    {'img_prefix': 'data_prefix',
     'img_info': {'filename': 'cat/sub_folder/111.jpg'},
     'gt_label': array(0)}

Checklist

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues.
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
  • The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects, like MMDet or MMSeg.
  • CLA has been signed and all committers have signed the CLA in this PR.

@mzr1996 mzr1996 requested a review from Ezra-Yu March 21, 2022 14:46
@codecov
Copy link

codecov bot commented Mar 21, 2022

Codecov Report

Merging #738 (bc6e164) into dev (7856141) will increase coverage by 1.55%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev     #738      +/-   ##
==========================================
+ Coverage   85.03%   86.58%   +1.55%     
==========================================
  Files         123      125       +2     
  Lines        7604     7700      +96     
  Branches     1311     1327      +16     
==========================================
+ Hits         6466     6667     +201     
+ Misses        946      837     -109     
- Partials      192      196       +4     
Flag Coverage Δ
unittests 86.49% <100.00%> (+1.53%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
mmcls/datasets/__init__.py 100.00% <100.00%> (ø)
mmcls/datasets/base_dataset.py 98.97% <100.00%> (+7.67%) ⬆️
mmcls/datasets/cifar.py 84.93% <100.00%> (+42.67%) ⬆️
mmcls/datasets/custom.py 100.00% <100.00%> (ø)
mmcls/datasets/imagenet.py 100.00% <100.00%> (+44.68%) ⬆️
mmcls/datasets/imagenet21k.py 100.00% <100.00%> (+17.56%) ⬆️
mmcls/datasets/voc.py 33.33% <0.00%> (-5.13%) ⬇️
mmcls/core/evaluation/eval_metrics.py 80.00% <0.00%> (-2.36%) ⬇️
mmcls/datasets/builder.py 91.04% <0.00%> (ø)
mmcls/models/backbones/__init__.py 100.00% <0.00%> (ø)
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7856141...bc6e164. Read the comment docs.

Copy link
Collaborator

@Ezra-Yu Ezra-Yu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. please update the dataset config of imagenet21k
  2. please update the Colab notebooks to use the CustomDataset.

@@ -131,3 +135,21 @@ class CIFAR100(CIFAR10):
'key': 'fine_label_names',
'md5': '7973b15100ade9c7d40fb424638fde48',
}
CLASSES = [
'apple', 'aquarium_fish', 'baby', 'bear', 'beaver', 'bed', 'bee',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the meta info of CLASSES is in our code, it is so long, especially in 'imagenet.py'. It may be better to create a metafile.bin that saves all the CLASSES info. In that way, the code will be purer and users may read our code easily.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think so too, but move the categories info to another file may cause unexpected problem, especially for deployment.
The code of ImageNet is short, which is only a CustomDataset with preset attributes. I think we can keep it.

mmcls/datasets/custom.py Show resolved Hide resolved
mmcls/datasets/custom.py Outdated Show resolved Hide resolved
@mzr1996 mzr1996 requested a review from Ezra-Yu March 30, 2022 02:15
Copy link
Collaborator

@Ezra-Yu Ezra-Yu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@mzr1996 mzr1996 merged commit d0d6f73 into open-mmlab:dev Mar 30, 2022
mzr1996 added a commit to mzr1996/mmpretrain that referenced this pull request Nov 24, 2022
* Add custom dataset and refactor ImageNet dataset

* Add default CLASSES for CIFAR dataset

* Add unit tests

* Imporve according to comments
@mzr1996 mzr1996 deleted the custom-dataset branch December 7, 2022 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants