Detects columns and connects indented lines in hOCR files. This Node.js module is used in the NYPL's NYC Space/Time Directory project to extract data from digitized New York City directories.
Most OCR tools can produce hOCR files — we are using Tesseract.
First, hocr-detect-columns uses Cheerio to read all pages in the hOCR file (an hOCR file is just an HTML file with special properties).
Per page, the X positions of the bounding boxes of all OCR lines are clustered, using Simple Statistics. A page with n
columns should have many lines with bounding boxes on or around n
different X values. If clustering finds n
clusters containing most of the OCR lines, we can expect the page has n
columns.
To connect indented lines with the previous line they belong to, hocr-detect-columns uses a spatial index and tries to find, for each line which doesn't belong to a column, the closest line in the upper-left direction. The algorithms we need are implemented by RBush (spatial index) and rbush-knn (nearest neighbor search). You can read more about spatial search algorithms for JavaScript on Mapbox's blog.
npm install -g nypl-spacetime/hocr-detect-columns
hocr-detect-columns can produce the following output formats:
- Log to stdout (default):
hocr-detect-columns /path/to/file.hocr
- Output JSON:
hocr-detect-columns --mode json /path/to/file.hocr
- Output NDJSON:
hocr-detect-columns --mode ndjson /path/to/file.hocr
- Output HTML visualization:
hocr-detect-columns --mode html /path/to/file.hocr
npm install --save nypl-spacetime/hocr-detect-columns
const fs = require('fs')
const detectColumns = require('hocr-detect-columns')
const hocr = fs.readFileSync('/path/to/file.hocr', 'utf8')
const config = {}
const pages = detectColumns(hocr, config)
You can configure hocr-detect-columns by supplying a JSON configuration object or file:
{
"columnCount": 2, // Amount of expected columns
"characterWidth": 25, // Width of character, in pixels
"minLinesPerColumn": 50 // Minimum expected lines, per column
}
In the directory example
, you can find an hOCR file of page 418 of the 1850 city directory, as well as a JSON file and HTML visualization generated by hocr-detect-columns.
To generate these files yourself, run:
hocr-detect-columns --mode json example/example.hocr
Or:
hocr-detect-columns --mode html example/example.hocr
The format of the resulting JSON pages
object is as follows:
{
"config": {
…
},
"pages": [
{
"number": 0,
"properties": {
…
},
"lines": [
{
"properties": {
"bbox": [
…
],
…
}
"text": "contents of line"
"columnIndex": 0,
"completeText": "contents of line, appended with text from indented next lines"
},
…
]
},
{
"number": 1,
"properties": {
…
},
"lines": [
…
]
},
…
]
}