Skip to content
Stefan Weil edited this page Mar 24, 2020 · 1 revision

Training Fraktur and historic handwriting from Primers (German: "Fibeln")

This is an intermediate report on training which was done at Mannheim University Library (@UB-Mannheim). It is still unfinished, so new results will be added in the future.

What are primers?

Primers ("Fibeln") are books used for teaching reading and writing. Typically they start with easy and frequent characters, character combinations and words and add more difficult and special ones over time. Therefore we expected that those books might be a good training material for neural networks, too.

Usually those primers contain Fraktur text, sometimes also historic Antiqua, but also handwritten text.

In Germany, the Georg-Eckert-Institut (GEI) collects school books. Their online collection contains lots of primers.

Requirements

Use a recent Linux distribution. Newer distributions like Debian Buster provide Tesseract, so it is not necessary to build your own Leptonica or Tesseract.

Training requires much disk space, so use a working directory with at least 24 GiB free space.

Training also requires much CPU resources. There must be at least 4 CPU cores available. A fast CPU with AVX support is highly recommended.

Preparing the data for training

Training data is available from https://github.com/UB-Mannheim/Fibeln. It can be directly used for training with tesstrain.

Training for fine tuning

Fine tuning was done based on a GT4HistOCR model. That model works pretty well for most of the Fraktur and Antiqua texts, but was surprisingly bad for some parts of the Fraktur texts.

First trained models are available from https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/Fibeln/. They are still untested.

License

GEI metadata uses CC0, their digital images are public domain (see http://gei-digital.gei.de/viewer/pages/disclaimer/).