Issues with Tesseract / ocrd-train and GT4HistOCR #73

stweil · 2019-08-14T12:53:02Z

I start this issue to collect my experiences when trying to train Tesseract from GT4HistOCR using ocrd-train. So problems reported here can be caused by Tesseract, ocrd-train or by GT4HistOCR.

stweil · 2019-08-14T12:56:24Z

GT4HistOCR contains more than 300000 pairs of line images and ground truth text. This requires much processing time for make training. It starts much faster without any intrinsic rules by running make -r training.

In addition pull request #72 avoids unnecessary child processes which also improves the performance.

stweil · 2019-08-14T12:57:27Z

From #72:

@jbaiter, dta19/1882-keller_sinngedicht/04970.nrm.png from GT4HistOCR is broken. convertcannot read it, and it looks like this in the browser.

stweil · 2019-08-14T12:59:21Z

Tesseract has an encoding problem with the ground truth text for the image dta19/1879-vischer_auch02/03739.nrm.png.

The text Ἰάϰχε, Ἰάϰχε! Wie blitzen ihre großen Augen! Noch is not an exact transcription of the image (wrong accent on the 2nd word). Is there a Git repository for GT4HistOCR, or how can we get wrong ground truth fixed?

wrznr · 2019-08-15T15:17:58Z

I think there is no repository available for GT4HistOCR yet. Maybe @tboenig and @kba could integrate into the OCR-D-GT-repository? (Not sure if it is possible from the legal side of things.)

wrznr · 2019-08-15T15:18:56Z

Is the encoding problem caused by the incorrect transcription?

wrznr · 2019-08-15T15:20:45Z

@stweil Btw. it is great that you share your insights here. Are you starting from scratch or from frk?

stweil · 2019-08-15T15:26:03Z

Is the encoding problem caused by the incorrect transcription?

No, Tesseract already complains about the first character.

stweil · 2019-08-15T15:32:30Z

Are you starting from scratch [...]?

Yes, I started from scratch. The results are online. They used 10000, 100000 and 300000 iterations. Currently I try 900000 iterations, although I have the impression that more than 100000 iterations does not improve the model.

All those trainings were still made with random lists of training data, so they cannot be reproduced.

I'd like to document the process, but not here under issues. Would the Wiki be a better place? It is currently empty.

stweil · 2019-08-15T15:34:30Z

I have also run a first fine tuning with a very small amount of data from GT4HistOCR. It was based on script/Fraktur, and the result looks promissing.

wrznr · 2019-08-15T15:36:55Z

I sent you an invitation for collaboration. We would be delighted if you could share your experiences in the Wiki.

stweil · 2019-08-16T07:46:30Z

See GT4HistOCR in the Wiki. It is still work in progress.

wollmers · 2019-10-05T17:24:36Z

@stweil You should note in the Wiki which version of tesseract or whatever you used in the very first lines.

Are performance results (precision, recall, f-score) available?

stweil · 2019-10-05T18:16:55Z

Unless otherwise noted I always use latest Tesseract (git master).

And no, sorry, I did not measure performance up to now. It is not difficult, but costs time nevertheless, and often a non quantitative evaluation is sufficient for me.

See tesseract-ocr/tessdata#102 (comment) for a small recent test result.

We'll do qualitative evaluation of Tesseract in the OCR-D context.

wollmers · 2019-10-06T03:32:43Z

@stweil It doesn't help a reader after some months or years that YOU know the used version.

stweil · 2019-10-06T07:12:35Z

Sure. Therefore I have added that information after your previous e-mail, not in the very first lines because not all tests used the same version. But I wrote in the very first lines that this is still ongoing work.

wollmers · 2019-10-06T07:45:24Z

@stweil The results in tesseract-ocr/tessdata#102 (comment) are mostly problems of the ground truth (GT) files. It's always important which level of GT is used or created. E. g. DTA (Deutsches Text Archiv) tries to transcribe as accurate as possible to the original, including the spelling errors and faults of manual typesetting. Of course some problems like reversed (head down) u or n are difficult to transcribe or document. There is one book at DTA where this happens very often. I assume that the molded letters of this time didn't have a nick (DE: Signatur), see https://en.wikipedia.org/wiki/Sort_(typesetting), at this time.

A second useful level of normalisation for GT would be at the level of current Unicode using combining characters for e. g. small e above. Some ligatures and combinings are not available in Unicode. Same for some symbols, e.g. botanical. That's why MUFI using the PUA still can make sense for some academic scientists. But I wouldn't recommend it for broad usage, because all tools in the workflow must be able to deal with it.

It would be nice if the experts could agree to some standard for GT files covering most of the corpora in the period 1750-1945. It should be Unicode compliant and pragmatically near to the original. In my experience before ~1750 orthography, typography, layout, line spacing, type designs, even roman numerals are more chaotic and cannot be covered by a 'one for all' solution easily.

IMHO it's easier to convert a near original GT-file to one near current German than the opposite direction.

Such a standard could be very practical as a central point for training tools/data, word lists, quality measures of training/tesseract, post-processing, utilities and validating fonts (this is an own story).

wollmers · 2019-10-06T09:03:03Z

@stweil In the case of

Ἰάϰχε, Ἰάϰχε! Wie blitzen ihre großen Augen! Noch

the problem seems caused by the quality of image processing. https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/ground-truth/GT4HistOCR/dta19/1879-vischer_auch02/03739.nrm.png is wrong.

If you go into DTA (register free) you will see in the page http://www.deutschestextarchiv.de/dtaq/book/view/vischer_auch02_1879/?p=161&hl=Wie that the original scan of high quality shows both Greek words with the exact spelling of the transcription:

ohnedieß taumelnden Kahn, trunken von Luſt ſchnalzt
ſie mit den Fingern, als ſchlüge ſie Caſtagnetten, und
jauchzt in den brauſenden Wind hinaus: Evoë! Evoë!
Ἰάϰχε, Ἰάϰχε! Wie blitzen ihre großen Augen! Noch
muthwilliger als vorhin, halbwild trifft mich ihr
Strahl! — Angſt wegen des Sturms kann ſie mir
nicht anſehen. Darum kann ſie mich nicht auslachen.

stweil · 2019-10-06T10:28:27Z

Thank for for that information. This might be a systematic problem of the DTA images. CC'ing @jbaiter, @uvius and @chreul.

stweil · 2019-10-06T10:39:17Z

It would be nice if the experts could agree to some standard for GT files covering most of the corpora in the period 1750-1945.

Maybe http://www.ocr-d.de/gt_guidelines?

wollmers · 2019-10-06T18:29:28Z

It would be nice if the experts could agree to some standard for GT files covering most of the corpora in the period 1750-1945.

Maybe http://www.ocr-d.de/gt_guidelines?

Thanks for the link. The guidelines have some problems, which make them not usable as a standard in a broader range. But it's better than most.

stweil · 2019-10-06T19:40:07Z

I'm sure that the authors would be interested in suggestions for improving those guidelines. Maybe you can write them?

wollmers · 2019-10-07T07:55:45Z

My idea is more to collect existing guidelines of some serious quality (the 4 mentioned in https://arxiv.org/ftp/arxiv/papers/1809/1809.05501.pdf - DTA, OCR-D, etc.), also MUFI and some non-German large projects. Review and compare them. Start it as an open standard at github (like hOCR).

Roughly there should be 3 levels:

Use the character set of current writing for the language (per word) supported by standard keyboards. For German this would mean no long-s, no round-r, but keep the spelling as good as possible (seyn/sein, roth/rot, Thaal/Tal, Cactus/Kaktus, Photographie/Fotografie). Purpose: Make texts better available for search engines, information retrieval, searchable PDF, semantic processing.
Use only Unicode assigned characters. Most of MUFI can be built by combining a base character with combing characters. Maybe it renders ugly, depending on the quality of the font. Experts using MUFI usually design their own fonts, i. e. need a glyph for MUFI specific characters. It's pre-composed in the font. You just need to connect the sequence of codepoints to a precomposed glyph, e. g. a+combining small e, m+COMBINING TILDE (your example in the wiki https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#inherited-characters), or even translate the sequence Ue to a glyph representing a non existing ligature Ue. Same for ck,ch, sch etc. You can turn this OpenType (OT) features on or off if the software supports them. Modern browser support it via CSS. Even the user can influence it via some JavaScript/CSS, if the developer provides it. For ligatures there are the standard OT features 'liga', 'hlig' (historical) and also tailored ones are possible. It's even possible to implement spelling rules for round versus long s. For the writer (human or program) it has a little disadvantage, because it needs to insert a Unicode control character (U+200C ZERO WIDTH NON-JOINER) to avoid automatic ligatures, e. g. if double long-s is used in the scanned document systematically instead of sharp-s. Unfortunately most of the historic fonts freely available are digitized by amateurs (LIGA-Fraktur), mapping ASCII characters like $ to long-s, or implement s spelling rules in way, that it is not possible to write a single long-s (e.g. in docs) - the long-s from the fallback font is rendered in this case (a sans-serif in my case). Only few are brought to good quality by Google fonts like this: https://fonts.google.com/specimen/UnifrakturCook.

Using only Unicode (without PUA) has a lot of advantages. Developers can use standard functionality tested against Unicode. No moving target, no reinventing the wheel. E. g. in Perl5 which has the best Unicode support to my knowledge, you can use Unicode properties in Regex-rules, normalize to NFC, tokenize into graphemes with \X, use Unicode property arithmetics. OK, Perl6 is even better by using graphemes as default (I wrote the specs and spec-tests for this part ). Personally I do not understand, that languages like Python, C++ and Java are so popular in academic context. OO-notation like string.match() is not really convenient. Using ICU has errors. That's why Perl5 and Perl6 compile directly from Unicode tables. Mostly in C++ wchar (a 16 bit unsigned integer) is used. Unicode needs 21 bits. Java I don't know exactly, seems reliable (full UFT-16 support). But from experience academic code doesn't build or install easily.

Using PUA has a great disadvantage: You can not use Unicode properties, because a private codepoint has no properties, unless you provide and support them by own complicated code.

Expert researchers maybe need more than Unicode. That's a level of tailored guidelines, where they cannot expect standard support and need to develop their own tools. That's not a level a standard OCR-system or standard training should support.

wollmers · 2020-01-26T18:13:20Z

@stweil Should we open an extra issue for ONB? Or an extra repository? I don't want to edit the wiki.

Just downloaded it, look into the first page ONB_aze_18950706_1, saw a darker spot and bingo,

XML line 779 has a wrong transcription:

XML:  [...] Postspackessen-Check-Konto
CORR: [...] Postsparkassen-Check-Konto

Line 488

XML:  [...] Tabak= Trafiken u. Verschleitz¬
CORR: [...] Tabak=Trafiken u. Verschleiß¬

The image quality is not so bad. The JPGs have 300 dpi or better. They are very compressed which explains the small file size and the visible artefacts. The TIFs are obviously digitised by Google, as Google digitises for ONB. Seem to be binarised by a global method, which destroys information.

My plan is, to filter or correct some of the errors automatically. My method is promising. But I have not enough unpaid time, to make this happen in the next weeks. Still working on normalisation of transcription standards and automatic normalisation of fonts, which is related.

stweil · 2020-01-26T20:30:55Z

Ideally the ÖNB would offer a repository and accept pull requests to improve the ground truth texts. Do you want to ask them?

wollmers · 2020-01-27T00:27:12Z

The authors of the GT are from Uni Innsbruck. Don't know, if ONB has OCR experts at scientific level. Sooner or later I will inform them of my work. The GT files are open source with a friendly license. So I can put them onto github and add the diffs to apply and scripts for corrections.

stweil · 2020-01-27T13:38:04Z

I just wrote to the colleagues at Innsbruck and expect their answer next week. When they provide a public repository I'd send our fixes as a pull request. Otherwise I could upload our internal repository which not only fixes transcriptions but also adds the long s to GitHub.

wollmers · 2020-01-27T16:13:20Z

Sounds good. If it is based on the original version from Zenodo, then it's perfect.

stweil · 2020-01-28T13:06:27Z

https://github.com/UB-Mannheim/AustrianNewspapers is now online.

stweil · 2020-01-28T15:18:43Z

If it is based on the original version from Zenodo, then it's perfect.

See release 1.0.0.

wollmers · 2020-01-29T17:40:59Z

Perfect.

wrznr · 2021-09-08T06:45:39Z

@wollmers @stweil Can we close this issue?

bertsky mentioned this issue Sep 26, 2019

move to AlternativeImage feature selectors in OCR-D/core#294: OCR-D/ocrd_tesserocr#75

Merged

wrznr added the pinned Eternal issues which are save from becoming stale label Oct 1, 2019

Shreeshrii mentioned this issue Dec 1, 2019

Report on RTL training with OCR_GS_Data for Arabic #128

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with Tesseract / ocrd-train and GT4HistOCR #73

Issues with Tesseract / ocrd-train and GT4HistOCR #73

stweil commented Aug 14, 2019

stweil commented Aug 14, 2019

stweil commented Aug 14, 2019

stweil commented Aug 14, 2019

wrznr commented Aug 15, 2019

wrznr commented Aug 15, 2019

wrznr commented Aug 15, 2019 •

edited

Loading

stweil commented Aug 15, 2019

stweil commented Aug 15, 2019

stweil commented Aug 15, 2019

wrznr commented Aug 15, 2019

stweil commented Aug 16, 2019

wollmers commented Oct 5, 2019

stweil commented Oct 5, 2019

wollmers commented Oct 6, 2019

stweil commented Oct 6, 2019 •

edited

Loading

wollmers commented Oct 6, 2019

wollmers commented Oct 6, 2019 •

edited

Loading

stweil commented Oct 6, 2019

stweil commented Oct 6, 2019

wollmers commented Oct 6, 2019

stweil commented Oct 6, 2019

wollmers commented Oct 7, 2019

wollmers commented Jan 26, 2020

stweil commented Jan 26, 2020

wollmers commented Jan 27, 2020

stweil commented Jan 27, 2020

wollmers commented Jan 27, 2020

stweil commented Jan 28, 2020

stweil commented Jan 28, 2020

wollmers commented Jan 29, 2020

wrznr commented Sep 8, 2021

Issues with Tesseract / ocrd-train and GT4HistOCR #73

Issues with Tesseract / ocrd-train and GT4HistOCR #73

Comments

stweil commented Aug 14, 2019

stweil commented Aug 14, 2019

stweil commented Aug 14, 2019

stweil commented Aug 14, 2019

wrznr commented Aug 15, 2019

wrznr commented Aug 15, 2019

wrznr commented Aug 15, 2019 • edited Loading

stweil commented Aug 15, 2019

stweil commented Aug 15, 2019

stweil commented Aug 15, 2019

wrznr commented Aug 15, 2019

stweil commented Aug 16, 2019

wollmers commented Oct 5, 2019

stweil commented Oct 5, 2019

wollmers commented Oct 6, 2019

stweil commented Oct 6, 2019 • edited Loading

wollmers commented Oct 6, 2019

wollmers commented Oct 6, 2019 • edited Loading

stweil commented Oct 6, 2019

stweil commented Oct 6, 2019

wollmers commented Oct 6, 2019

stweil commented Oct 6, 2019

wollmers commented Oct 7, 2019

wollmers commented Jan 26, 2020

stweil commented Jan 26, 2020

wollmers commented Jan 27, 2020

stweil commented Jan 27, 2020

wollmers commented Jan 27, 2020

stweil commented Jan 28, 2020

stweil commented Jan 28, 2020

wollmers commented Jan 29, 2020

wrznr commented Sep 8, 2021

wrznr commented Aug 15, 2019 •

edited

Loading

stweil commented Oct 6, 2019 •

edited

Loading

wollmers commented Oct 6, 2019 •

edited

Loading