AutoDC: Automated data-centric processing

This repository is the official Python implementation of position paper "AutoDC: Automated data-centric processing". The implementation will continue being updated in the coming months. Note that only image data is supported.

AutoDC is a framework to enable domain experts to automatically and systematically improve datasets without much coding requirement and manual process, the idea similar with AutoML (automated machine learning).

By using the AutoML system, such as Google Cloud AutoML, domain experts only need to bring in the input data, and AutoML takes care of the manual ML processes, then produces output predictions, along with user-defined evaluation metrics. With a similar idea, AutoDC is designed for domain experts to bring in a labeled dataset, such as annotated images, to the system; AutoDC takes care of the manual data improvement processes, and produces the improved dataset, by automatically correcting the incorrect labels (with user feedbacks), detecting and selecting edge cases, and augmenting edge cases.

Full Paper:
Zac Yung-Chun Liu, Shoumik Roychowdhury, Scott Tarlow, Akash Nair, Shweta Badhe, and Tejas Shah. AutoDC: Automated data-centric processing, NeurIPS 2021: DCAI workshop, arXiv: 2111.12548.

Dependencies

opencv-python >= 4.5.3.56
scikit-learn >= 0.24.2
numpy >= 1.19.5
python-magic-bin >= 0.4.14
augly >= 0.2.1

Img2Vec is utilized here. You also need Python >= 3.6.

Install

pip install using the requirements.txt:

pip install -r requirements.txt

Getting Started

Using starter script starter_image_data.py, you only need to provide the directory path that has input images and the directory path that the output images (improved dataset) will be stored.

python starter_image_data.py --input Users/sample_data/ --output Users/sample_data/

Optional, you can also specify:

--o_ratio: outlier data percent, default: 100 (include all outlier data into the final improved dataset)
--n_ratio: non-outlier data percent, default: 100 (include all non-outlier data into the final improved dataset)
--a_ratio: augmented data percent, default: 20 (augment 20% of outlier data into the final improved dataset)

Using starter notebook starter_image_data.ipynb, just follow the steps.

Contributing

If you are interested in contributing to AutoDC project, please check out the Contributor Guide and the Code of Conduct.

License

AutoDC is released under Apache 2.0 License.

Sponsorship

AutoDC project has been sponsored by Hypergiant LLC since Nov 2021.

Citation

See our paper describing the framework:

Zac Yung-Chun Liu, Shoumik Roychowdhury, Scott Tarlow, Akash Nair, Shweta Badhe, and Tejas Shah (2021), "AutoDC: Automated data-centric processing", arXiv:2111.12548

@misc{liu2021autodc,
      title={AutoDC: Automated data-centric processing},
      author={Zac Yung-Chun Liu and Shoumik Roychowdhury and Scott Tarlow and Akash Nair and Shweta Badhe and Tejas Shah},
      year={2021},
      eprint={2111.12548},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github		.github
autodc		autodc
starter		starter
.gitignore		.gitignore
Fig_1.png		Fig_1.png
LICENSE		LICENSE
README.md		README.md
code-of-conduct.md		code-of-conduct.md
contributing.md		contributing.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoDC: Automated data-centric processing

Dependencies

Install

Getting Started

Contributing

License

Sponsorship

Citation

About

Releases

Packages

Contributors 2

Languages

License

gohypergiant/AutoDC

Folders and files

Latest commit

History

Repository files navigation

AutoDC: Automated data-centric processing

Dependencies

Install

Getting Started

Contributing

License

Sponsorship

Citation

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages