Form Segmentation

Let's explore how we can extract text from any forms / scanned pages.

Objectives

The goal is to find an algorithm that can extract the maximum information from a given page (jpg format). So, we can feed it to another system. (Business logic, neural network, classifier, etc.) The overall process may not be perfect. But it would be great if it can find enough information to identify the type of document and the involve identities.

Parse any form / scanned page and extract any text data (printed text and handwriting text). So, no prior knowledge of the layout / structure of the document.
Automatic extraction process (no human interaction. So, it can scale out)
Somehow fast (or the ability to speed up the task with more machines or CPU)

Challenges

There are many challenges to overcome. But the main problem is to identify which part of the form contains text.

Some other challenges:

Black Border Removal
ICR (Intelligent Character Recognition): recognize and convert hand-drawn characters into text
Scanned page (Detect edges and apply a perspective transform to obtain the top-down view of the document)
Remove noise (blur, OTSU, adaptivethreshold with opencv)
Shape detection and extraction
OCR (Not a real issue since we can use : Tesseract 4 great for printed text)
Handwriting recognition
Minimize errors

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
files		files
img		img
.gitignore		.gitignore
Form cleaner.ipynb		Form cleaner.ipynb
Intelligent Character Recognition.ipynb		Intelligent Character Recognition.ipynb
LICENSE		LICENSE
Page detection.ipynb		Page detection.ipynb
Probabilistic Line Transformation.ipynb		Probabilistic Line Transformation.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Form Segmentation

Objectives

Challenges

About

Releases

Packages

Languages

License

doxakis/form-segmentation

Folders and files

Latest commit

History

Repository files navigation

Form Segmentation

Objectives

Challenges

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages