Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with Tesseract / ocrd-train and GT4HistOCR #73

Open
stweil opened this issue Aug 14, 2019 · 31 comments
Open

Issues with Tesseract / ocrd-train and GT4HistOCR #73

stweil opened this issue Aug 14, 2019 · 31 comments
Labels
pinned Eternal issues which are save from becoming stale

Comments

@stweil
Copy link
Member

stweil commented Aug 14, 2019

I start this issue to collect my experiences when trying to train Tesseract from GT4HistOCR using ocrd-train. So problems reported here can be caused by Tesseract, ocrd-train or by GT4HistOCR.

@stweil
Copy link
Member Author

stweil commented Aug 14, 2019

GT4HistOCR contains more than 300000 pairs of line images and ground truth text. This requires much processing time for make training. It starts much faster without any intrinsic rules by running make -r training.

In addition pull request #72 avoids unnecessary child processes which also improves the performance.

@stweil
Copy link
Member Author

stweil commented Aug 14, 2019

From #72:

@jbaiter, dta19/1882-keller_sinngedicht/04970.nrm.png from GT4HistOCR is broken. convertcannot read it, and it looks like this in the browser.

@stweil
Copy link
Member Author

stweil commented Aug 14, 2019

Tesseract has an encoding problem with the ground truth text for the image dta19/1879-vischer_auch02/03739.nrm.png.

The text Ἰάϰχε, Ἰάϰχε! Wie blitzen ihre großen Augen! Noch is not an exact transcription of the image (wrong accent on the 2nd word). Is there a Git repository for GT4HistOCR, or how can we get wrong ground truth fixed?

@wrznr
Copy link
Collaborator

wrznr commented Aug 15, 2019

I think there is no repository available for GT4HistOCR yet. Maybe @tboenig and @kba could integrate into the OCR-D-GT-repository? (Not sure if it is possible from the legal side of things.)

@wrznr
Copy link
Collaborator

wrznr commented Aug 15, 2019

Is the encoding problem caused by the incorrect transcription?

@wrznr
Copy link
Collaborator

wrznr commented Aug 15, 2019

@stweil Btw. it is great that you share your insights here. Are you starting from scratch or from frk?

@stweil
Copy link
Member Author

stweil commented Aug 15, 2019

Is the encoding problem caused by the incorrect transcription?

No, Tesseract already complains about the first character.

@stweil
Copy link
Member Author

stweil commented Aug 15, 2019

Are you starting from scratch [...]?

Yes, I started from scratch. The results are online. They used 10000, 100000 and 300000 iterations. Currently I try 900000 iterations, although I have the impression that more than 100000 iterations does not improve the model.

All those trainings were still made with random lists of training data, so they cannot be reproduced.

I'd like to document the process, but not here under issues. Would the Wiki be a better place? It is currently empty.

@stweil
Copy link
Member Author

stweil commented Aug 15, 2019

I have also run a first fine tuning with a very small amount of data from GT4HistOCR. It was based on script/Fraktur, and the result looks promissing.

@wrznr
Copy link
Collaborator

wrznr commented Aug 15, 2019

I sent you an invitation for collaboration. We would be delighted if you could share your experiences in the Wiki.

@stweil
Copy link
Member Author

stweil commented Aug 16, 2019

See GT4HistOCR in the Wiki. It is still work in progress.

@wrznr wrznr added the pinned Eternal issues which are save from becoming stale label Oct 1, 2019
@wollmers
Copy link

wollmers commented Oct 5, 2019

@stweil You should note in the Wiki which version of tesseract or whatever you used in the very first lines.

Are performance results (precision, recall, f-score) available?

@stweil
Copy link
Member Author

stweil commented Oct 5, 2019

Unless otherwise noted I always use latest Tesseract (git master).

And no, sorry, I did not measure performance up to now. It is not difficult, but costs time nevertheless, and often a non quantitative evaluation is sufficient for me.

See tesseract-ocr/tessdata#102 (comment) for a small recent test result.

We'll do qualitative evaluation of Tesseract in the OCR-D context.

@wollmers
Copy link

wollmers commented Oct 6, 2019

@stweil It doesn't help a reader after some months or years that YOU know the used version.

@stweil
Copy link
Member Author

stweil commented Oct 6, 2019

Sure. Therefore I have added that information after your previous e-mail, not in the very first lines because not all tests used the same version. But I wrote in the very first lines that this is still ongoing work.

@wollmers
Copy link

wollmers commented Oct 6, 2019

@stweil The results in tesseract-ocr/tessdata#102 (comment) are mostly problems of the ground truth (GT) files. It's always important which level of GT is used or created. E. g. DTA (Deutsches Text Archiv) tries to transcribe as accurate as possible to the original, including the spelling errors and faults of manual typesetting. Of course some problems like reversed (head down) u or n are difficult to transcribe or document. There is one book at DTA where this happens very often. I assume that the molded letters of this time didn't have a nick (DE: Signatur), see https://en.wikipedia.org/wiki/Sort_(typesetting), at this time.

A second useful level of normalisation for GT would be at the level of current Unicode using combining characters for e. g. small e above. Some ligatures and combinings are not available in Unicode. Same for some symbols, e.g. botanical. That's why MUFI using the PUA still can make sense for some academic scientists. But I wouldn't recommend it for broad usage, because all tools in the workflow must be able to deal with it.

It would be nice if the experts could agree to some standard for GT files covering most of the corpora in the period 1750-1945. It should be Unicode compliant and pragmatically near to the original. In my experience before ~1750 orthography, typography, layout, line spacing, type designs, even roman numerals are more chaotic and cannot be covered by a 'one for all' solution easily.

IMHO it's easier to convert a near original GT-file to one near current German than the opposite direction.

Such a standard could be very practical as a central point for training tools/data, word lists, quality measures of training/tesseract, post-processing, utilities and validating fonts (this is an own story).

@wollmers
Copy link

wollmers commented Oct 6, 2019

@stweil In the case of

Ἰάϰχε, Ἰάϰχε! Wie blitzen ihre großen Augen! Noch

the problem seems caused by the quality of image processing. https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/ground-truth/GT4HistOCR/dta19/1879-vischer_auch02/03739.nrm.png is wrong.

If you go into DTA (register free) you will see in the page http://www.deutschestextarchiv.de/dtaq/book/view/vischer_auch02_1879/?p=161&hl=Wie that the original scan of high quality shows both Greek words with the exact spelling of the transcription:

ohnedieß taumelnden Kahn, trunken von Luſt ſchnalzt
ſie mit den Fingern, als ſchlüge ſie Caſtagnetten, und
jauchzt in den brauſenden Wind hinaus: Evoë! Evoë!
Ἰάϰχε, Ἰάϰχε! Wie blitzen ihre großen Augen! Noch
muthwilliger als vorhin, halbwild trifft mich ihr
Strahl! — Angſt wegen des Sturms kann ſie mir
nicht anſehen. Darum kann ſie mich nicht auslachen.

@stweil
Copy link
Member Author

stweil commented Oct 6, 2019

Thank for for that information. This might be a systematic problem of the DTA images. CC'ing @jbaiter, @uvius and @chreul.

@stweil
Copy link
Member Author

stweil commented Oct 6, 2019

It would be nice if the experts could agree to some standard for GT files covering most of the corpora in the period 1750-1945.

Maybe http://www.ocr-d.de/gt_guidelines?

@wollmers
Copy link

wollmers commented Oct 6, 2019

It would be nice if the experts could agree to some standard for GT files covering most of the corpora in the period 1750-1945.

Maybe http://www.ocr-d.de/gt_guidelines?

Thanks for the link. The guidelines have some problems, which make them not usable as a standard in a broader range. But it's better than most.

@stweil
Copy link
Member Author

stweil commented Oct 6, 2019

I'm sure that the authors would be interested in suggestions for improving those guidelines. Maybe you can write them?

@wollmers
Copy link

wollmers commented Oct 7, 2019

My idea is more to collect existing guidelines of some serious quality (the 4 mentioned in https://arxiv.org/ftp/arxiv/papers/1809/1809.05501.pdf - DTA, OCR-D, etc.), also MUFI and some non-German large projects. Review and compare them. Start it as an open standard at github (like hOCR).

Roughly there should be 3 levels:

  1. Use the character set of current writing for the language (per word) supported by standard keyboards. For German this would mean no long-s, no round-r, but keep the spelling as good as possible (seyn/sein, roth/rot, Thaal/Tal, Cactus/Kaktus, Photographie/Fotografie). Purpose: Make texts better available for search engines, information retrieval, searchable PDF, semantic processing.

  2. Use only Unicode assigned characters. Most of MUFI can be built by combining a base character with combing characters. Maybe it renders ugly, depending on the quality of the font. Experts using MUFI usually design their own fonts, i. e. need a glyph for MUFI specific characters. It's pre-composed in the font. You just need to connect the sequence of codepoints to a precomposed glyph, e. g. a+combining small e, m+COMBINING TILDE (your example in the wiki https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#inherited-characters), or even translate the sequence Ue to a glyph representing a non existing ligature Ue. Same for ck,ch, sch etc. You can turn this OpenType (OT) features on or off if the software supports them. Modern browser support it via CSS. Even the user can influence it via some JavaScript/CSS, if the developer provides it. For ligatures there are the standard OT features 'liga', 'hlig' (historical) and also tailored ones are possible. It's even possible to implement spelling rules for round versus long s. For the writer (human or program) it has a little disadvantage, because it needs to insert a Unicode control character (U+200C ZERO WIDTH NON-JOINER) to avoid automatic ligatures, e. g. if double long-s is used in the scanned document systematically instead of sharp-s. Unfortunately most of the historic fonts freely available are digitized by amateurs (LIGA-Fraktur), mapping ASCII characters like $ to long-s, or implement s spelling rules in way, that it is not possible to write a single long-s (e.g. in docs) - the long-s from the fallback font is rendered in this case (a sans-serif in my case). Only few are brought to good quality by Google fonts like this: https://fonts.google.com/specimen/UnifrakturCook.

Using only Unicode (without PUA) has a lot of advantages. Developers can use standard functionality tested against Unicode. No moving target, no reinventing the wheel. E. g. in Perl5 which has the best Unicode support to my knowledge, you can use Unicode properties in Regex-rules, normalize to NFC, tokenize into graphemes with \X, use Unicode property arithmetics. OK, Perl6 is even better by using graphemes as default (I wrote the specs and spec-tests for this part ). Personally I do not understand, that languages like Python, C++ and Java are so popular in academic context. OO-notation like string.match() is not really convenient. Using ICU has errors. That's why Perl5 and Perl6 compile directly from Unicode tables. Mostly in C++ wchar (a 16 bit unsigned integer) is used. Unicode needs 21 bits. Java I don't know exactly, seems reliable (full UFT-16 support). But from experience academic code doesn't build or install easily.

Using PUA has a great disadvantage: You can not use Unicode properties, because a private codepoint has no properties, unless you provide and support them by own complicated code.

  1. Expert researchers maybe need more than Unicode. That's a level of tailored guidelines, where they cannot expect standard support and need to develop their own tools. That's not a level a standard OCR-system or standard training should support.

@wollmers
Copy link

@stweil Should we open an extra issue for ONB? Or an extra repository? I don't want to edit the wiki.

Just downloaded it, look into the first page ONB_aze_18950706_1, saw a darker spot and bingo,

XML line 779 has a wrong transcription:

XML:  [...] Postspackessen-Check-Konto
CORR: [...] Postsparkassen-Check-Konto

Line 488

XML:  [...] Tabak= Trafiken u. Verschleitz¬
CORR: [...] Tabak=Trafiken u. Verschlei߬

The image quality is not so bad. The JPGs have 300 dpi or better. They are very compressed which explains the small file size and the visible artefacts. The TIFs are obviously digitised by Google, as Google digitises for ONB. Seem to be binarised by a global method, which destroys information.

My plan is, to filter or correct some of the errors automatically. My method is promising. But I have not enough unpaid time, to make this happen in the next weeks. Still working on normalisation of transcription standards and automatic normalisation of fonts, which is related.

@stweil
Copy link
Member Author

stweil commented Jan 26, 2020

Ideally the ÖNB would offer a repository and accept pull requests to improve the ground truth texts. Do you want to ask them?

@wollmers
Copy link

The authors of the GT are from Uni Innsbruck. Don't know, if ONB has OCR experts at scientific level. Sooner or later I will inform them of my work. The GT files are open source with a friendly license. So I can put them onto github and add the diffs to apply and scripts for corrections.

@stweil
Copy link
Member Author

stweil commented Jan 27, 2020

I just wrote to the colleagues at Innsbruck and expect their answer next week. When they provide a public repository I'd send our fixes as a pull request. Otherwise I could upload our internal repository which not only fixes transcriptions but also adds the long s to GitHub.

@wollmers
Copy link

Sounds good. If it is based on the original version from Zenodo, then it's perfect.

@stweil
Copy link
Member Author

stweil commented Jan 28, 2020

https://github.com/UB-Mannheim/AustrianNewspapers is now online.

@stweil
Copy link
Member Author

stweil commented Jan 28, 2020

If it is based on the original version from Zenodo, then it's perfect.

See release 1.0.0.

@wollmers
Copy link

Perfect.

@wrznr
Copy link
Collaborator

wrznr commented Sep 8, 2021

@wollmers @stweil Can we close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pinned Eternal issues which are save from becoming stale
Projects
None yet
Development

No branches or pull requests

3 participants