hocr-detect-columns

Detects columns and connects indented lines in hOCR files. This Node.js module is used in the NYPL's NYC Space/Time Directory project to extract data from digitized New York City directories.

Most OCR tools can produce hOCR files — we are using Tesseract.

But how does it work?

First, hocr-detect-columns uses Cheerio to read all pages in the hOCR file (an hOCR file is just an HTML file with special properties).

Per page, the X positions of the bounding boxes of all OCR lines are clustered, using Simple Statistics. A page with n columns should have many lines with bounding boxes on or around n different X values. If clustering finds n clusters containing most of the OCR lines, we can expect the page has n columns.

To connect indented lines with the previous line they belong to, hocr-detect-columns uses a spatial index and tries to find, for each line which doesn't belong to a column, the closest line in the upper-left direction. The algorithms we need are implemented by RBush (spatial index) and rbush-knn (nearest neighbor search). You can read more about spatial search algorithms for JavaScript on Mapbox's blog.

Installation & Usage

Standalone

npm install -g nypl-spacetime/hocr-detect-columns

hocr-detect-columns can produce the following output formats:

Log to stdout (default):

hocr-detect-columns /path/to/file.hocr

Output JSON:

hocr-detect-columns --mode json /path/to/file.hocr

Output NDJSON:

hocr-detect-columns --mode ndjson /path/to/file.hocr

Output HTML visualization:

hocr-detect-columns --mode html /path/to/file.hocr

As a Node.js module

npm install --save nypl-spacetime/hocr-detect-columns

const fs = require('fs')
const detectColumns = require('hocr-detect-columns')

const hocr = fs.readFileSync('/path/to/file.hocr', 'utf8')

const config = {}

const pages = detectColumns(hocr, config)

Configuration

You can configure hocr-detect-columns by supplying a JSON configuration object or file:

{
  "columnCount": 2, // Amount of expected columns
  "characterWidth": 25, // Width of character, in pixels
  "minLinesPerColumn": 50 // Minimum expected lines, per column
}

Example

In the directory example, you can find an hOCR file of page 418 of the 1850 city directory, as well as a JSON file and HTML visualization generated by hocr-detect-columns.

To generate these files yourself, run:

hocr-detect-columns --mode json example/example.hocr

Or:

hocr-detect-columns --mode html example/example.hocr

Data

The format of the resulting JSON pages object is as follows:

{
  "config": {
    …
  },
  "pages": [
    {
      "number": 0,
      "properties": {
        …
      },
      "lines": [
        {
          "properties": {
            "bbox": [
              …
            ],
            …
          }
          "text": "contents of line"
          "columnIndex": 0,
          "completeText": "contents of line, appended with text from indented next lines"
        },
        …
      ]
    },
    {
      "number": 1,
      "properties": {
        …
      },
      "lines": [
        …
      ]
    },
    …
  ]
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
example		example
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
default-config.json		default-config.json
hocr-detect-columns.gif		hocr-detect-columns.gif
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json
visualization.template.html		visualization.template.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hocr-detect-columns

But how does it work?

Installation & Usage

Standalone

As a Node.js module

Configuration

Example

Data

About

Releases

Packages

Languages

License

nypl-spacetime/hocr-detect-columns

Folders and files

Latest commit

History

Repository files navigation

hocr-detect-columns

But how does it work?

Installation & Usage

Standalone

As a Node.js module

Configuration

Example

Data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages