A practical guide to the OCR-D framework
The "OCR-D guide" helps developers writing software and using tools within the OCR-D ecosystem.
The OCR-D guide is a collection of concise recipes that provide pragmatic advise on how to
- bootstrap a development environment,
- work with the
ocrd
command line tool, - manipulate METS and PAGE documents,
- create spec-compliant software
Lines in code examples
- starting with
#
are comments; - starting with
$
are typed shell input (everything after$
is); - are output otherwise.
Words in ALL CAPS with a preprended $
are variable names:
-
$METS_URL
: URL or file path to amets.xml
file, e.g.https://github.com/OCR-D/assets/raw/master/data/kant_aufklaerung_1784/mets.xml
-
$WORKSPACE_DIR
: File path of the workspace created, e.g.$WORKSPACE_DIR
/data/ocrd-workspaces/kant-aufklaerung-2018-07-11
When referring to a "something
command", it is actually ocrd something
on
the command line.
- Specification: Formal specifications
- Glossary: A glossary of terms in the OCR domain as used throughout our documentation
OCR-D development is targeted towards Ubuntu Linux >= 18.04 since it is free, widely used and well-documented.
Most of the setup will be the same for other Debian-based Linuxes and older Ubuntu versions. You might run into problems with outdated system packages though.
In particular, it can be tricky at times to install tesseract
at the right
version. Try alex-p's PPA or build
tesseract from source.
sudo apt install \
git \
build-essential \
python python-pip \
python3 python3-pip
git
: Version control, OCR-D uses git extensivelybuild-essential
: Installsmake
and C/C++ compilerpython
: Python 2.7 for legacy applications likeocropy
python3
: Current version of Python on which the OCR-D software core stack is builtpip
/pip3
: Python package management
The OCR-D toolkit is based on a Python API that you can reuse if you are developing software in Python.
This API is exposed via a command line tool ocrd
. This CLI offers much of the
same functionality of the API without the need to write Python code and can be readily
integrated into shell scripts and external command callouts in your code.
So, If you do not intend to code in Python or want to wrap
existing/legacy tools, a major part of the functionality of the API is
available as a command line tool ocrd
.
We strongly recommend using virtualenv
(or similar tools if they are more
familiar to you) over system-wide installation of python packages. It reduces
the amount of pain supporting multiple Python versions and allows you to test
your software in various configurations while you develop it, spinning up and
tearing down environments as necessary.
sudo apt install \
python3-virtualenv \
python-virtualenv # If you require Python2 compat
Create a virtualenv
in an easy to remember or easy-to-search-shell-history-for location:
$ virtualenv -p python3.6 $HOME/ocrd-venv3
$ virtualenv -p python2.7 $HOME/ocrd-venv2 # If you require Python2 compat
You need to activate this virtual environment whenever you open a new terminal:
$ source $HOME/ocrd-venv3/bin/activate
If you tend to forget sourcing the script before working on your code, add
source $HOME/ocrd-venv3
to the end of your .bashrc
/.zshrc
file and log
out and back in.
Make sure, the virtualenv
is activated and install ocrd
with pip:
$ pip install ocrd
In this variant, you still need to install the ocrd
Python package. But since
it's only used for its CLI (and as a depencency for Python-based OCR-D
software), you can install it system-wide:
$ sudo pip install ocrd
If you want to build the ocrd
package from
source to stay up-to-date on unreleased changes
or to contribute code, you can clone the repository and build from source:
$ git clone https://github.com/OCR-D/core
$ cd core
If you are using the python setup:
$ pip install -r requirements.txt
$ pip install -e .
If you are using the generic setup:
$ sudo pip install -r requirements.txt
$ sudo pip install .
After setting up, check that these commands do not throw errors and have the minimum version:
$ git --version
# Version 1.7 or higher?
$ make --version
# Version 9.0.1 or higher?
$ ocrd --version
# ocrd, version 0.4.0
MP are [git repositories](TODO spec) with at least a description of the MP and
its provided tools (ocrd-tool.json
and a
Makefile
for installing the MP into a suitable OS.
This is a JSON file that describes the software of a particular MP. It serves mainly three purposes:
- providing a machine-actionable description of MP and the bundled tools and their parameters
- concise human-targeted descriptions as the foundation for the application documentation
- ensuring compatible definitions and interfaces, which is essential for sustainable, scalable workflows
This document is mainly focusing on the first point.
The structure and syntax of the ocrd-tool.json
is defined by a JSON
Schema and expects JSON Schema
for the parameter definitions. In addition to the schema, the ocrd
command
line tool can help you validate the ocrd-tool.json
🔥 TODO 🔥
[kba] Wir brauchen einen besseren Namen, ich kann das schon nicht mehr schreiben dauernd,
ocrd-tool.json
. Vielleicht einfachmanifest.json
oderpackage.json
odertool-desc
odr irgendwas.
🔥 TODO 🔥
The ocrd-tool.json
has two conceptual levels:
- Information about the MP as a whole and the people and processes involved
- Technical metadata on the level of the individual tools
Beyond the ocrd-tool.json
file, it is part of the requirements that the tools
can provide the section of the ocrd-tool.json
about 'themselves' at runtime
with the -J
/--dump-json
flags.
The reason for this redundancy is to make the tools inspectable at runtime and to prevent "feature drift" where the software evolves to the point where the description/documentation is out-of-date with the actual implementation.
From a developer's perspective, the easiest way to handle this is by bundling
the ocrd-tool.json
into your software, e.g. by the following pattern:
- Store the
ocrd-tool.json
at a location where it is easy to deploy and access after installation - Symlink it to the root of the repository:
ln -sr src/ocrd-tool.json .
- Handle
--dump-json
by parsing theocrd-tool.json
and sending out the relevant section - Validate input and provide defaults based on the JSON schema mechanics
Required properties are bold.
version
: Version of the tool, adhering to Semantic Versioninggit_url
: URL of the Githubtool
: See next sectiondockerhub
: The project's DockerHub URLcreators
: 🚨 TODO 🚨:institution
: 🚨 TODO 🚨:synopsis
: 🚨 TODO 🚨:
Example:
{
"version": "0.0.1",
"name": "ocrd-blockissifier",
"synopsis": "Tools for reasoning about how these blocks fit on this here page",
"git_url": "https://githbub.com/johndoe/ocrd_blocksifier",
"dockerhub": "https://hub.docker.com/r/johndoe/ocrd_blocksifier",
"authors": [{
"name": "John Doe",
"email": "[email protected]",
"url": "johndoe.github.io"
}],
"bugs": {
"url": "https://github.com/sindresorhus/temp-dir/issues"
},
"tools": {
/* see next section */
}
}
The tools
section is an object with the key being the name of the executable described and the value being an object with the following properties (bold means required):
executable
: Name of the exceutable. Must match the key and start withocrd-
parameters
: Description of the parameters this tool acceptsdescription
: Concise description what the tool doescategories
: Tools belong to these categories, representing modules within the OCR-D project structure, list is part of the specssteps
: This tool can be used at these steps in the OCR-D functional model, list of values in the specs
Required properties are bold.
type
: What kind of parameter this is, either astring
, anumber
or aboolean
format
: Subtype defining the syntax of the value such asfloat
/integer
for numbers oruri
forstring
required
: If true, this parameter must be provided by the userdefault
: Default value if not requiredenum
: List of possible values if a fixed list
required: true
and setting default
are mutually exclusive.
All MP should provide a Makefile with at least two targets: deps
and install
.
make deps
should install any dependencies, such as required python modules.
make install
should install the executable(s) into $(PREFIX)/bin
.
make test
should start the unit/regression test suite if provided.
make deps
should install dependencies with pip
.
make install
should call python setup.py install
.
See the makefile of the ocrd_kraken
project for an example.
make deps
should install dependencies either by compiling from source or using apt-get
.
make install
should
- Copy the executables to
$(PREFIX)/bin
, creating$(PREFIX)/bin
if necessary. - Copy any required files to
$(PREFIX)/share/<name-of-the-package>
, creating the latter if necessary
See the makefile of the ocrd_olena
project for an example.
METS is the container format of choice for OCR-D because it is widely used in digitzation workflows in cultural heritage institutions.
A METS file references files in file groups and can contain a variety of metadata, the details can be found in the specs.
Within the OCR-D toolkit, we use the term "workspace", a folder containing a
file mets.xml
and any number of the files referenced by the METS.
One can think of the mets.xml
as the MANIFEST of a JAR or the .git
folder
of a git repository.
The workspace
command of the ocrd
tool allows various manipulations of
workspaces and therefore METS files.
The workspace
command's syntax and mechanics are strongly inspired by
git
so if you know git
, this should be familiar.
git |
ocrd workspace |
---|---|
init |
init |
clone |
clone |
add |
add |
ls-files |
find |
fetch |
find --download |
archive |
pack |
For most commands, workspace
assumes the workspace is the current working
directory. If you want to use a different directory, use the -d / --directory
option
# Listing files in the workspace at $PWD
$ ocrd workspace find
# Listing files in the workspace at $WORKSPACE_DIR
$ ocrd workspace -d $WORKSPACE_DIR find
According to convention, the METS of a workspace is named mets.xml
.
To select a different basename for that file, use the -M / --mets-basename
option:
# Assume this workspace structure
$ find $WORKSPACE_DIR
$WORKSPACE_DIR
$WORKSPACE_DIR/mets3000.xml
# This will fail in a loud and unpleasant manner
$ ocrd workspace -d $WORKSPACE_DIR find
# This will not
$ ocrd workspace -d $WORKSPACE_DIR -M mets3000.xml find
To create an empty workspace to which you can add files, use the workspace init
command
$ ocrd workspace init ws1
/home/ocr/ws1
To create a workspace and save a METS file, use the workspace clone
command:
$ ocrd workspace clone $METS_URL new-workspace
/home/ocr/new-workspace
$ find new-workspace
new-workspace
new-workspace/mets.xml
To not only clone the METS but also
download the contained files, use workspace clone
with the --download
flag:
$ ocrd workspace clone --download $METS_URL $WORKSPACE_DIR
$ find $WORKSPACE_DIR
$WORKSPACE_DIR
$WORKSPACE_DIR/mets.xml
$WORKSPACE_DIR/OCR-D-GT-ALTO
$WORKSPACE_DIR/OCR-D-GT-ALTO/kant_aufklaerung_1784_0020.xml
$WORKSPACE_DIR/OCR-D-GT-PAGE
$WORKSPACE_DIR/OCR-D-GT-PAGE/kant_aufklaerung_1784_0020.xml
$WORKSPACE_DIR/OCR-D-IMG
$WORKSPACE_DIR/OCR-D-IMG/kant_aufklaerung_1784_0020.tif
NOTE: This will download all files, which can mean hundreds of
high-resolution images. If you want more fine-grained control,
clone the bare workspace
and then
use the workspace find
command with the download
flag
You can search the files in a METS file with the workspace find
command.
- All files:
ocrd workspace find
- All TIFF files:
ocrd workspace find --mimetype image/tiff
- All TIFF files in the OCR-D-IMG-BIN group:
ocrd workspace find --mimetype image/tiff --file-grp OCR-D-IMG-BIN
See ocrd workspace --find
for the full range of selection options
To download remote or copy local files referenced in the mets.xml
to the
workspace, append the --download
flag to the workspace find
command:
# Clone Bare workspace:
$ ocrd workspace clone $METS_URL
$ find $WORKSPACE_DIR
$WORKSPACE_DIR
$WORKSPACE_DIR/mets.xml
# Download all files in the `OCR-D-IMG` file group
$ ocrd workspace -d $WORKSPACE_DIR find --file-grp OCR-D-IMG --download
[...]
$ find $WORKSPACE_DIR
$WORKSPACE_DIR
$WORKSPACE_DIR/mets.xml
$WORKSPACE_DIR/OCR-D-IMG
$WORKSPACE_DIR/OCR-D-IMG/kant_aufklaerung_1784_0020.tif
The convention is that files will be downloaded to $WORKSPACE_DIR/$FILE_GROUP/$BASENAME
where
$FILE_GROUP
is the@USE
attribute of themets:fileGrp
$BASENAME
is the last URL segment of the@xlink:href
attribute of themets:FLocat
NOTE Downloading a file not only copies the file to the $WORKSPACE_DIR
but also changes the URL of the file from its original to the absolute file
path of the downloaded file.
When running a module project, new files are created (PAGE XML, images ...). To
register these new files, they need to be added to the mets.xml
as a
mets:file
with a mets:FLocat
within a mets:fileGrp
, each with the right
attributes. The workspace add
command makes this possible:
$ ocrd workspace -d $WORKSPACE_DIR find -k local_filename
$WORKSPACE_DIR/OCR-D-IMG/page0013.tif
$ ocrd workspace -d $WORKSPACE_DIR add \
--file-grp OCR-D-IMG-BIN \
--file-id PAGE-0013-BIN \
--mimetype image/png \
--group-id PAGE-0013 \
page0013binarized.png
$ ocrd workspace -d $WORKSPACE_DIR find -k local_filename
$WORKSPACE_DIR/OCR-D-IMG/page0013.tif
$WORKSPACE_DIR/OCR-D-IMG-BIN/page0013binarized.tif
To ensure a METS file and the workspace it describes adheres to the OCR-D
specs, use the workspace validate
command:
# Create a bare workspace
ocrd workspace init $WORKSPACE_DIR
# Validate
<report valid="false">
<error>METS has no unique identifier</error>
<error>No files</error>
</report>
# Oops, let's set the identifier ...
$ ocrd workspace -d $WORKSPACE_DIR set-id 'scheme://my/identifier/syntax/kant_aufklaerung_1784'
# ... and add a file
$ ocrd workspace -d $WORKSPACE_DIR add -G OCR-D-IMG-BIN -i PAGE-0013-BIN -m image/png -g PAGE-0013 page0013binarized.png
# Validate again
<report valid="true">
</report>
This command helps you to explore and validate the information in any ocrd-tool.json.
The syntax is ocrd ocrd-tool /path/to/ocrd-tool.json SUBCOMMAND
Validate that an ocrd-tool.json
is syntactically valid and adheres to the schema.
This is useful while developing to make sure there are no typos and all required properties are set.
$ ocrd ocrd-tool /path/to/ocrd_wip/ocrd-tool.json validate
<report valid="false">
<error>[tools.ocrd-wip-xyzzy] 'steps' is a required property</error>
<error>[tools.ocrd-wip-xyzzy] 'categories' is a required property</error>
<error>[] 'version' is a required property</error>
</report>
This example shows that the ocrd-wip-xyzzy
executable is missing the required steps
and
categories
properties and the root level object is missing the version
property.
Adding them should result in
$ ocrd ocrd-tool /path/to/ocrd_wip/ocrd-tool.json validate
<report valid="true">
</report>
These commands are used for enumerating the executables contained in an
ocrd-tool.json
and get root level metadata, such as the version.
$ ocrd ocrd-tool /path/to/ocrd_wip/ocrd-tool.json version
0.0.1
# Lists all the tools (executables) one per-line
$ ocrd ocrd-tool /path/to/ocrd_wip/ocrd-tool.json list-tools
ocrd-wip-xyzzy
ocrd-wip-frobozz
This set of commands allows introspection of the metadata on individual
tools within an ocrd-tool.json
.
The syntax is ocrd ocrd-tool /path/to/ocrd-tool.json tool EXECUTABLE SUBCOMMAND
$ ocrd ocrd-tool /path/to/ocrd_wip/ocrd-tool.json tool ocrd-wip-xyzzy dump
{
"description": "Nothing happens",
"categories": ["Text recognition and optimization", "Arcane Magic"],
"steps": ["recognition/text-recognition"],
"exceutable": "ocrd-wip-xyzzy"
}
# Description
$ ocrd ocrd-tool /path/to/ocrd_wip/ocrd-tool.json tool ocrd-wip-xyzzy description
Nothing happens
# List categories one per line
$ ocrd ocrd-tool /path/to/ocrd_wip/ocrd-tool.json tool ocrd-wip-xyzzy categories
Text recognition and optimization
Arcane Magic
# List steps one per line
$ ocrd ocrd-tool /path/to/ocrd_wip/ocrd-tool.json tool ocrd-wip-xyzzy steps
recognition/text-recognition
The details of how a tool is configured at run-time are determined by parameters. When a parameter file is passed to a tool, it should:
- ensure it is valid JSON
- validate according to the parameter schema
- add default values when no explicit values were provided
The ocrd ocrd-tool tool parse-params
command does just that and can output
the resulting default-enriched parameter as either JSON or as shell script
assignments to evaluate:
# Get JSON
$ ocrd ocrd-tool /path/to/ocrd_wip/ocrd-tool.json tool ocrd-wip-xyzzy parse-params --json -p <(echo '{"val1": 42, "val2": false}')
{
"val1": 42,
"val2": false,
"val-with-default": 23
}
# Get back shell assignments to an associative array "params"
$ ocrd ocrd-tool /path/to/ocrd_wip/ocrd-tool.json tool ocrd-wip-xyzzy parse-params -p <(echo '{"val1": 42, "val2": false}')
params["val1"]="42"
params["val2"]="true"
params["val-with-default"]="23"
OCR requires multiple steps, such as binarization, layout recognition, text recognition etc. These steps are implemented with command line tools that adhere to the same command line interface which makes it straightforward to chain these calls.
For example, to run kraken binarization and tesseract block segmentation, one could execute:
ocrd-kraken-binarize -l DEBUG -I OCR-D-IMG -O OCR-D-IMG-BIN
ocrd-tesserocrd-segment-block -l DEBUG -I OCR-D-IMG-BIN -O OCR-D-SEG-BLOCK -p tesseract-params.json
The disadvantage of individual calls is that it requires the user to check whether runs were
actually successful. To remedy this, users can use the ocrd process
CLI which
- simplifies the CLI syntax for multiple calls
- checks for required and expected-to-be-produced file groups
- checks for return value
- sets logging levels uniformly across tools
The same calls mentioned before can be passed to ocrd process
as follows:
ocrd process -l DEBUG \
"kraken-binarize -l DEBUG -I OCR-D-IMG -O OCR-D-IMG-BIN" \
"tesserocrd-segment-block -l DEBUG -I OCR-D-IMG-BIN -O OCR-D-SEG-BLOCK -p tesseract-params.json"
This section describes how you can make an existing tool OCR-D compliant, i.e. provide a CLI which implements all the specs and calls out to another executable.
For this purpose, the ocrd
offers a bash
library that handles:
- command line option parsing
- on-line help
- parsing and providing defaults for parameters
The shell library is bundled with the ocrd
command line tool and can be accessed with the
ocrd bashlib
command.
To get the filename of the shell lib, use ocrd bashlib filename
, which you
can employ to source the shell code in a wrapper script. After sourcing this script
you will have access to a number of shell functions that begin with ocrd__
.
The only function you definitely need is ocrd__wrap
which parses an
ocrd-tool.json
and scaffolds a spec-compliant CLI, parses command line
arguments and parameters and lets the developer then react to the inputs.
In combination with the ocrd workspace
command this allows
you to write CLI applications without touching any METS or PAGE/XML files by hand.
ocrd__wrap
has this signature:
ocrd__wrap OCRD_TOOL_JSON EXECUTABLE_NAME ...ARGS
where
OCRD_TOOL_JSON
is the path to theocrd-tool.json
EXECUTABLE_NAME
is the name of an executable withinOCRD_TOOL_JSON
...ARGS
are 0..n command line arguments passed on from the user
Example:
ocrd__wrap /usr/share/ocrd-wip/ocrd-tool.json ocrd-wip-xyzzy "$@"