Skip to content

OCR using tesseract, ImageMagick, EmguCV, an advanced query language and a fluent query interface for C#

License

Notifications You must be signed in to change notification settings

albertoarmida/DocumentLab

 
 

Repository files navigation

Optical character recognition, image processing and intelligent data extraction

NuGet version (DocumentLab-x64) License) Platform

Comming soon: Realtime script editor and visualization tool


This is a solution for data extraction from images of documents. You send in a bitmap, a set of queries and get extracted data in structured json.

  • DocumentLab takes care of image analysis and processing, optical character recognition, text classification
  • It provides a query language created specifically for document data extraction
    • Defining documents by queries, patterns and tables allows DocumentLab to understand documents intelligently
    • Intelligently meaning that the solution has the capacity to extract data from documents with layouts it has never seen before
    • A C# interface to the query language is also supported
  • Extensive configuration for all parts of the process are available to facilitate your requirements


Getting started

  • Important: DocumentLab has been optimized to expect a certain size range when it comes to analysing text from images. Which means that if the picture you pass in has text that is too big or too small (pixel wise) then you will not get optimal results. You can use the fakeinvoice.png from the example above for a scale reference if you need to resize your input data.

  • Important: A drawback of using the prebuilt binary application is that on each execution DocumentLab needs to load configuration files and initialize a tesseract engine for each thread. This initialization overhead might take 3x the amount of time DocumentLab would otherwise need to scan a single page. It's in the plans to provide a better prebuilt binary application for integration.

You can download the prebuilt binary and get started immediately. Parameter usage: (Picture file path) (Query file path) ((Optional) Output file path)

The (Query file path) parameter needs to be a text file containing a set of queries that DocumentLab can perform on the picture provided alongside as an argument. This still requires investigating the DocumentLab query language. The script used in the example above was the same as was used for the topmost picture with the fake invoice.

... Or if you want to integrate with code,

Quickly

// To use the scripting interface
var documentLab = new DocumentInterpreter();
var interpretedJsonResult = documentLab.InterpretToJson(script, (Bitmap)Image.FromFile(imagePath));
// To use the FluentQuery API
using (var document = new Document((Bitmap)Image.FromFile(imagePath))) 
{
 ...

You can write scripts in the query language or use the C# API. The C# fluent interface is easier to get started quickly. The raw text scripting interface allows more versatility and configurability in a production context.

Definitions

  • Pattern: A description of how information is presented in a document as well as which data to capture
    • e.g: Text(Total amount) Right [Amount]
  • Table: A description of which table column labels to match in a document and which text types are represented in each column
    • e.g: Table 'ItemNo': [Number(Item number)] 'Description': [Text(Description)] 'Price': [Amount(...
  • Query: A named set of patterns prioritized first to last
    • e.g: IncoiceNumber: *pattern 1*; *pattern 2*; ... *pattern n*;
  • Script: A collection of queries to execute in one go. Output properties will have the query name

From here...

Quick examples

Below are a few select examples with comments on how DocumentLab can be used.

C# Fluent Query Example

using (var dl = new Document((Bitmap)Image.FromFile("pathToSomeImage.png")))
{
  // Here we ask DocumentLab to specifically find a date value for the specified possible labels
  string dueDate = dl.Query().FindValueForLabel(TextType.Date, "Due date", "Payment date");

  // Here we ask DocumentLab to specifically find a date value for the specified label in a specific direction 
  string customerNumber = dl.Query().GetValueForLabel(Direction.Right, "Customer number");

  // We can build patterns using predicates, directions and capture operations that return the value matched in the document
  // Patterns allow us to recognize and capture data by contextual information, i.e., how we'd read for example receiver information from an invoice
  string receiverName = dl
    .Query()
    .Match("PostCode") // Text classification using contextual data files can be referenced by string
    .Up()
    .Match("Town")
    .Up()
    .Match("City")
    .Up()
    .Capture(TextType.Text); // All text type operations can also use the statically defined text type enum
} 

Script example

  • Your document has a label "Customer number:" and a value to the right of it
    • Query: CustomerNumber: Text(Customer number) Right [Text];
    • Match text labels with implicit starts with and Levensthein distance 2 comparison
  • Your document has a label "Invoice date" and a date below it
    • Query: InvoiceDate: Text(Invoice date) Down [Date];
    • You can capture a variety of text types. Even if the document contains additional text at the capture you'll only get back a standardized ISO date.
  • Want to capture invoice receiver info in one query?
    • Query: Receiver: 'Name': [Text] Down 'Address': [StreetAddress] Down 'City': [Town] Down 'PostalCode': [PostalCode];
    • Json output will name properties according to the query predicate naming parameters
  • You want to capture all amounts in a document?
    • Query: AllAmounts: Any [Amount];
    • When we use any, results are returned in a json array

Documentation

The case with any OCR process is that the quality of the output depends entirely on the quality of source image you pass into it. If the original image quality is very low then you can expect very low quality OCR results.

Depending on your image source, you may want to upsample low dpi images to a range between 220 - 300. This often also becomes a question of quality vs. time in terms of execution time and should be adjusted to your requirements.

About

OCR using tesseract, ImageMagick, EmguCV, an advanced query language and a fluent query interface for C#

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C# 90.4%
  • JavaScript 6.4%
  • HTML 1.1%
  • CSS 1.0%
  • PowerShell 0.6%
  • ANTLR 0.5%