This project uses Tesseract software and Python language to resolve OCR problems.
Starting from an image (or a folder with images), the target is to convert visual information (e.g. words, numbers inside the image) into text information (e.g. csv file with relevant information).
All the scripts are based on Adrian Rosebrock and his website pyimagesearch. So, thanks Adrian!
-
Install Tesseract technology on your machine, you can find how to do this here
-
Configure your development environment. It is highly recommended to use pyenv, so you can manage your python projects easily. More information here
-
Once you have your environment ready, you must install requirements. To do this, in the root of this project, execute:
pip install -r requirements.txt
-
That's all! Yoy could run any python script. For example:
python scripts/ocr_folder_process.py --folder folder_path_where_you_have_your_images_to_process --blacklist "|/\[](){}"