Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add a tool to find broken files. #482

Merged
merged 4 commits into from
Oct 27, 2021

Conversation

Ezra-Yu
Copy link
Collaborator

@Ezra-Yu Ezra-Yu commented Oct 9, 2021

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

After preparing the dataset, the data may be broken. add a tool to find out all the broken files.

Modification

Add a verify_dataset.py tool

BC-breaking (Optional)

No.

Use cases (Optional)

python tools/misc/verify_dataset.py ${CONFIG_PATH}  --num-process ${CPU_TO_USE} --phase ${PHASE} --out-path ${OUT}

Checklist

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues.
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
  • The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects, like MMDet or MMSeg.
  • CLA has been signed and all committers have signed the CLA in this PR.

@codecov
Copy link

codecov bot commented Oct 9, 2021

Codecov Report

Merging #482 (60dc641) into master (6fba107) will increase coverage by 0.76%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #482      +/-   ##
==========================================
+ Coverage   78.01%   78.77%   +0.76%     
==========================================
  Files         101      103       +2     
  Lines        5617     5702      +85     
  Branches      923      927       +4     
==========================================
+ Hits         4382     4492     +110     
+ Misses       1108     1088      -20     
+ Partials      127      122       -5     
Flag Coverage Δ
unittests 78.77% <ø> (+0.76%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
mmcls/datasets/pipelines/formating.py 0.00% <0.00%> (-43.48%) ⬇️
mmcls/models/utils/attention.py 98.72% <0.00%> (-1.28%) ⬇️
mmcls/datasets/builder.py 42.55% <0.00%> (-1.20%) ⬇️
mmcls/apis/train.py 22.72% <0.00%> (ø)
mmcls/apis/inference.py 19.64% <0.00%> (ø)
mmcls/models/backbones/vgg.py 86.58% <0.00%> (ø)
mmcls/models/backbones/resnet.py 100.00% <0.00%> (ø)
mmcls/models/backbones/__init__.py 100.00% <0.00%> (ø)
mmcls/models/backbones/res2net.py 95.50% <0.00%> (ø)
mmcls/datasets/pipelines/formatting.py 43.47% <0.00%> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6fba107...60dc641. Read the comment docs.

Copy link
Member

@mzr1996 mzr1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit and make some changes, please check it. @Ezra-Yu

@mzr1996 mzr1996 changed the title Add tool to find broken files [Feature] Add a tool to find broken files. Oct 27, 2021
@mzr1996 mzr1996 merged commit 52e6256 into open-mmlab:master Oct 27, 2021
@Ezra-Yu Ezra-Yu deleted the broken-files branch October 27, 2021 03:25
@Wu-570
Copy link

Wu-570 commented Jun 8, 2022

I have used verify_dataset.py to check my datasets, and there were 1885 broken files. My datasets were organized according to ImageNet and they were all jpg images. I wander why they were broken, and I want to know how to handle the 1885 broken files to make them pass the verify_dataset.py.

mzr1996 added a commit to mzr1996/mmpretrain that referenced this pull request Nov 24, 2022
* add verify dataset

* add phase

* rm attr of single_process

* Use `mmcv.track_parallel_progress` to track the validation.

Co-authored-by: mzr1996 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants