-
Notifications
You must be signed in to change notification settings - Fork 190
Fibeln
This is an intermediate report on training which was done at Mannheim University Library (@UB-Mannheim). It is still unfinished, so new results will be added in the future.
Primers ("Fibeln") are books used for teaching reading and writing. Typically they start with easy and frequent characters, character combinations and words and add more difficult and special ones over time. Therefore we expected that those books might be a good training material for neural networks, too.
Usually those primers contain Fraktur text, sometimes also historic Antiqua, but also handwritten text.
In Germany, the Georg-Eckert-Institut (GEI) collects school books. Their online collection contains lots of primers.
Use a recent Linux distribution. Newer distributions like Debian Buster provide Tesseract, so it is not necessary to build your own Leptonica or Tesseract.
Training requires much disk space, so use a working directory with at least 24 GiB free space.
Training also requires much CPU resources. There must be at least 4 CPU cores available. A fast CPU with AVX support is highly recommended.
Training data is available from https://github.com/UB-Mannheim/Fibeln. It can be directly used for training with tesstrain.
Fine tuning was done based on a GT4HistOCR model. That model works pretty well for most of the Fraktur and Antiqua texts, but was surprisingly bad for some parts of the Fraktur texts.
First trained models are available from https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/Fibeln/. They are still untested.
GEI metadata uses CC0, their digital images are public domain (see http://gei-digital.gei.de/viewer/pages/disclaimer/).