-
Notifications
You must be signed in to change notification settings - Fork 72
Introduction
Computer vision combines many fields from computer science to extract (high-level) information from (low-level) image data.
Many computer vision systems use the pipeline pattern. This pipeline can usually be broken down into the following steps:
When preprocessing the image you usually apply remove noise, hot pixels, etc. Think of it as preparing the image with GIMP and G'MIC. It's all about making life easier for algorithms that follow.
Formally speaking segmentation is a process where you assign each pixel to a group. When working with text this is usually "text is black" and "paper is white" - this is also called binarization.
When working with complex images usually more than two segments get extracted - this is usually done using algorithms such as region growing and statistical region merging.
In the end it's all about making life easier for algorithms that follow by dropping useless information.
During feature extraction, features such as lines, shapes, motion, etc. get extracted from a single segmented image or a series of segmented images (note: here be dragons!). To name some particularly popular algorithms: the Canny edge detector and Hough transform can be used to extract edges and lines.
During classification, all extracted features get classified to useful information, such as "this is an 'X' with a confidence of 90% or an 'x' with a confidence of 60%". This step usually involves some kind of decision tree and lots of statistics and sometimes machine learning.
Since we're talking about real world images and real world is tricky, it's good to have some kind of confidence involved. Think of it as insurance or trust factor that prevents your application from fatal misapprehension.
Since DocumentVision isn't about soccer robots or augmented reality games, let's get down to business. Documents are usually printed on paper and get scanned under controlled conditions. Information that can be found on paper is ranging from figures and tables to barcodes, form elements and text.
Luminance won't be a big issue, since it's pretty much controlled by the scanning device. Sometimes you have to do background normalization (bad scanner, bad copy). Segmentation depends on what you want to do. If you're interested in simple text, Otsu's algorithm applied to the whole page will usually do the job. When the printing is bad on a shiny form paper it might be a good idea to use an adaptive version of Otsu or remove to form paper (e.g. form paper is red, printed text is black) from the image. After that you can simply throw Tesseract, ZXing or TickReader at it to get your information.
Let's say you want to extract all of those things from our image. Let's start by [thinning](http://en.wikipedia.org/wiki/Thinning_(morphology%29) the thresholded image; basically, this makes thin white lines even thinner. We can then dilate the remaining gaps away to get a set of region blobs:
As you can see, the QR-Code and it's framing rectangle have almost touched. But no worry, we can separate them again: by doing a distance transform we get a kind of height map, where each pixel contains the shortest distance to a white pixel (first image). Thresholding this image will then effectively remove all elements with less than the specified width – voilá, the border is gone. Of course, this will also cut off the borders of larger elements, but we can mostly restore them by using erode (second image). Finally we're ready to get rectangles for cropping, by computing the connected components (and if we overcompensate in the 'erode' step, we even get a neat margin around our detected elements, as shown in the last image).
If you want to try for yourself, here is the code:
var dv = require('dv');
var fs = require('fs');
var barcodes = new dv.Image('png', fs.readFileSync(__dirname + '/fixtures/barcodes.png'));
var open = barcodes.thin('bg', 8, 5).dilate(3, 3);
var openMap = open.distanceFunction(8);
var openMask = openMap.threshold(10).erode(22, 22);
var boxes = openMask.invert().connectedComponents(8);
for (var i in boxes) {
var boxImage = barcodes.crop(
boxes[i].x, boxes[i].y,
boxes[i].width, boxes[i].height);
// Do something useful with our image.
}