Skip to content
rashfael edited this page Mar 15, 2013 · 9 revisions

Introduction to Computer Vision

Computer vision combines many fields from computer science to extract (high-level) information from (low-level) image data.

Pipeline

Many computer vision systems use the pipeline pattern. This pipeline can usually be broken down into the following steps:

Computer vision pipeline

Preporcessing

When preprocessing the image you usually apply remove noise, hot pixels, etc. Think of it as preparing the image with GIMP and G'MIC. It's all about making life easier for algorithms that follow.

Segmentation

Formally speaking segmentation is a process where you assign each pixel to a group. When working with text this is usually "text is black" and "paper is white" - this is also called binarization.

Segmentation of text

When working with complex images usually more than two segments get extracted - this is usually done using algorithms such as region growing and statistical region merging.

Segmentation of Lena

In the end it's all about making life easier for algorithms that follow by dropping useless information.

Feature extraction

During feature extraction features such as lines, shapes, motion, etc. get extracted from a single segmented image or a series of segmented images (note: here be dragons!). To name some particularly popular algorithms: the Canny edge detector and Hough transform can be used to extract edges and lines.

Feature extraction using an edge detector and hough transform

Classification

During classification all extracted features get classified to useful information, such as "this is an 'X' with a confidence of 90% or an 'x' with a confidence of 60%". This step usually involves some kind of decision tree and lots of statistics and sometimes machine learning.

Since we're talking about real world images and real world is tricky, it's good to have some kind of confidence involved. Think of it as insurance or trust factor, that prevents your application from fatal misapprehension.

Application to documents

Since DocumentVision isn't about soccer robots or augmented reality games, so let's get down to business. Documents are usually printed on paper and get scanned under controlled conditions. Information that can be found on paper is ranging from figures and tables to barcodes, form elements and text.

Luminance won't be a big issue, since it's pretty much controlled by the scanned. Sometimes you have to do background normalization (bad scanner, bad copy). Segmentation depends on what you want to do. If you're interested in simple text Otsu's algorithm applied to the whole page will usually do the job. When printing is bad on a shiny form paper it might be a good idea to use an adaptive version of Otsu or remove to form paper (e.g. form paper is red, printed text is black) from the image. After that you can simply throw Tesseract, ZXing or TickReader at to get your information.

Example

Lets say you want to extract all of those things from our image. Lets start by thinning thresholding the image, thinning the background a little and dilate small white lines away:

Page segmentation example

As you can see the QR-Code has almost touched the border of it's framing rectangle. If they did, we can separate them using a distance transform and thresholding the distance again. Voila the border is gone. To get rectangles we can use for cropping we're computing the connected components:

Page segmentation example

Code

If you want to try for yourself, here is the code:

var dv = require('dv');
var fs = require('fs');

var barcodes = new dv.Image('png', fs.readFileSync(__dirname + '/fixtures/barcodes.png'));
var open = barcodes.thin('bg', 8, 5).dilate(3, 3);
var openMap = open.distanceFunction(8);
var openMask = openMap.threshold(10).erode(22, 22);
var boxes = openMask.invert().connectedComponents(8);
for (var i in boxes) {
	var boxImage = barcodes.crop(
		boxes[i].x, boxes[i].y,
		boxes[i].width, boxes[i].height);
	// Do something useful with our image.
}
Clone this wiki locally