Skip to content
WolfgangFellger edited this page Mar 15, 2013 · 9 revisions

Introduction to Computer Vision

Computer vision combines many fields from computer science to extract (high-level) information from (low-level) image data.

Pipeline

Many computer vision systems use the pipeline pattern. This pipeline can usually be broken down into the following steps:

Computer vision pipeline

Preprocessing

When preprocessing the image you usually apply remove noise, hot pixels, etc. Think of it as preparing the image with GIMP and G'MIC. It's all about making life easier for algorithms that follow.

Segmentation

Formally speaking segmentation is a process where you assign each pixel to a group. When working with text this is usually "text is black" and "paper is white" - this is also called binarization.

Segmentation of text

When working with complex images usually more than two segments get extracted - this is usually done using algorithms such as region growing and statistical region merging.

Segmentation of Lena

In the end it's all about making life easier for algorithms that follow by dropping useless information.

Feature extraction

During feature extraction, features such as lines, shapes, motion, etc. get extracted from a single segmented image or a series of segmented images (note: here be dragons!). To name some particularly popular algorithms: the Canny edge detector and Hough transform can be used to extract edges and lines.

Feature extraction using an edge detector and hough transform

Classification

During classification, all extracted features get classified to useful information, such as "this is an 'X' with a confidence of 90% or an 'x' with a confidence of 60%". This step usually involves some kind of decision tree and lots of statistics and sometimes machine learning.

Since we're talking about real world images and real world is tricky, it's good to have some kind of confidence involved. Think of it as insurance or trust factor that prevents your application from fatal misapprehension.

Application to documents

Since DocumentVision isn't about soccer robots or augmented reality games, let's get down to business. Documents are usually printed on paper and get scanned under controlled conditions. Information that can be found on paper is ranging from figures and tables to barcodes, form elements and text.

Luminance won't be a big issue, since it's pretty much controlled by the scanning device. Sometimes you have to do background normalization (bad scanner, bad copy). Segmentation depends on what you want to do. If you're interested in simple text, Otsu's algorithm applied to the whole page will usually do the job. When the printing is bad on a shiny form paper it might be a good idea to use an adaptive version of Otsu or remove to form paper (e.g. form paper is red, printed text is black) from the image. After that you can simply throw Tesseract, ZXing or TickReader at it to get your information.

Example

Let's say you want to extract all of those things from our image. Let's start by [thinning](http://en.wikipedia.org/wiki/Thinning_(morphology%29) the thresholded image; basically, this makes thin white lines even thinner. We can then dilate the remaining gaps away to get a set of region blobs:

Page segmentation example

As you can see, the QR-Code and it's framing rectangle have almost touched. But no worry, we can separate them again: by doing a distance transform we get a kind of height map, where each pixel contains the shortest distance to a white pixel (first image). Thresholding this image will then effectively remove all elements with less than the specified width – voilá, the border is gone. Of course, this will also cut off the borders of larger elements, but we can mostly restore them by using erode (second image). Finally we're ready to get rectangles for cropping, by computing the connected components (and if we overcompensate in the 'erode' step, we even get a neat margin around our detected elements, as shown in the last image).

Page segmentation example

Code

If you want to try for yourself, here is the code:

var dv = require('dv');
var fs = require('fs');

var barcodes = new dv.Image('png', fs.readFileSync(__dirname + '/fixtures/barcodes.png'));
var open = barcodes.thin('bg', 8, 5).dilate(3, 3);
var openMap = open.distanceFunction(8);
var openMask = openMap.threshold(10).erode(22, 22);
var boxes = openMask.invert().connectedComponents(8);
for (var i in boxes) {
	var boxImage = barcodes.crop(
		boxes[i].x, boxes[i].y,
		boxes[i].width, boxes[i].height);
	// Do something useful with our image.
}
Clone this wiki locally