Replies: 1 comment
-
Dear Janusz, this is an interesting problem, and I think that it can be addressed with Gamera. We already have page segmentation algorithms, so you can segment a page into its text blocks (e.g. paragraphs). If I understand you correctly, each paragraph can use a different font, and you would like to classify the font automatically. A simple approach would be to create training data with the font id as class names instead of the character names. Then you could count the fractions of fonts per text block and classify accordingly. A more reliable way would be presumably, to automatically measure the distance between fonts and identify characteristic glyphs for each font. The "confidence" values returned by the classifier should be helpful for this purpose. Concerning the unknown characters: are these a reminiscence from medieval abbreviatiosn, e.g. "et" and "con". All in all, this is an interesting use case that makes a nice project for a bachelor thesis. Maybe we can discuss this directly via email (see the Gamera homepage for my contact information). Christoph |
Beta Was this translation helpful? Give feedback.
-
The impressive progress in handwritten text recognition (Transkribus is claimed to recognize better 19th century German handwriting than a typical non-trained person) is not accompanied by a similar progress in the OCR proper. By the OCR proper I mean systems which identify printed characters individually and, moreover, can identify the font of the characters.
In recent years my interests focused on old Polish prints, in particular those which were documented in the series "Polonia Typographica"; unfortunately only two volumes are available online (https://polona.pl/item/54430363, https://polona.pl/item/54430364).
I hope to use Gamera to solve two specific problems (I don't have any deadline, so I can make haste slowly):
There is a text (Zaborowski's treatise on Polish spelling) typeset in 16th century reportedly with 4 fonts, the specimens of those font are provided in "Polonia Typographica". The first question is: which part of text is typeset with which font (I don't trust my eyes). The second question is: are the font specimens complete and not redundant (they contain some characters I am unable to identify).
The font named in volume III of "Polonia Typographica" as number 1 was used only in three publications. All of them has been digitized and are available freely in the Internet. This allowed to confront the inventory of the font presented in "Polonia Typographica" with the actual texts. There seems to be surprisingly large number of discrepancies. I would like to confirm my impression by OCRing the texts in question.
More information on this problems can be found at https://www.researchgate.net/profile/Janusz-Bien/research. The papers are in Polish, but they contained a lot illustrations which can give some idea about their content.
I will appreciate both general comments and practical hints (unfortunately I was never fluent with Gamera). My quick and dirty attemps are illustrated by the files attached to #14.
Beta Was this translation helpful? Give feedback.
All reactions