-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with Tesseract / ocrd-train and GT4HistOCR #73
Comments
GT4HistOCR contains more than 300000 pairs of line images and ground truth text. This requires much processing time for In addition pull request #72 avoids unnecessary child processes which also improves the performance. |
Tesseract has an encoding problem with the ground truth text for the image dta19/1879-vischer_auch02/03739.nrm.png. The text |
Is the encoding problem caused by the incorrect transcription? |
@stweil Btw. it is great that you share your insights here. Are you starting from scratch or from |
No, Tesseract already complains about the first character. |
Yes, I started from scratch. The results are online. They used 10000, 100000 and 300000 iterations. Currently I try 900000 iterations, although I have the impression that more than 100000 iterations does not improve the model. All those trainings were still made with random lists of training data, so they cannot be reproduced. I'd like to document the process, but not here under issues. Would the Wiki be a better place? It is currently empty. |
I have also run a first fine tuning with a very small amount of data from GT4HistOCR. It was based on |
I sent you an invitation for collaboration. We would be delighted if you could share your experiences in the Wiki. |
See GT4HistOCR in the Wiki. It is still work in progress. |
@stweil You should note in the Wiki which version of tesseract or whatever you used in the very first lines. Are performance results (precision, recall, f-score) available? |
Unless otherwise noted I always use latest Tesseract (git master). And no, sorry, I did not measure performance up to now. It is not difficult, but costs time nevertheless, and often a non quantitative evaluation is sufficient for me. See tesseract-ocr/tessdata#102 (comment) for a small recent test result. We'll do qualitative evaluation of Tesseract in the OCR-D context. |
@stweil It doesn't help a reader after some months or years that YOU know the used version. |
Sure. Therefore I have added that information after your previous e-mail, not in the very first lines because not all tests used the same version. But I wrote in the very first lines that this is still ongoing work. |
@stweil The results in tesseract-ocr/tessdata#102 (comment) are mostly problems of the ground truth (GT) files. It's always important which level of GT is used or created. E. g. DTA (Deutsches Text Archiv) tries to transcribe as accurate as possible to the original, including the spelling errors and faults of manual typesetting. Of course some problems like reversed (head down) u or n are difficult to transcribe or document. There is one book at DTA where this happens very often. I assume that the molded letters of this time didn't have a nick (DE: Signatur), see https://en.wikipedia.org/wiki/Sort_(typesetting), at this time. A second useful level of normalisation for GT would be at the level of current Unicode using combining characters for e. g. small e above. Some ligatures and combinings are not available in Unicode. Same for some symbols, e.g. botanical. That's why MUFI using the PUA still can make sense for some academic scientists. But I wouldn't recommend it for broad usage, because all tools in the workflow must be able to deal with it. It would be nice if the experts could agree to some standard for GT files covering most of the corpora in the period 1750-1945. It should be Unicode compliant and pragmatically near to the original. In my experience before ~1750 orthography, typography, layout, line spacing, type designs, even roman numerals are more chaotic and cannot be covered by a 'one for all' solution easily. IMHO it's easier to convert a near original GT-file to one near current German than the opposite direction. Such a standard could be very practical as a central point for training tools/data, word lists, quality measures of training/tesseract, post-processing, utilities and validating fonts (this is an own story). |
@stweil In the case of
the problem seems caused by the quality of image processing. https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/ground-truth/GT4HistOCR/dta19/1879-vischer_auch02/03739.nrm.png is wrong. If you go into DTA (register free) you will see in the page http://www.deutschestextarchiv.de/dtaq/book/view/vischer_auch02_1879/?p=161&hl=Wie that the original scan of high quality shows both Greek words with the exact spelling of the transcription:
|
|
Thanks for the link. The guidelines have some problems, which make them not usable as a standard in a broader range. But it's better than most. |
I'm sure that the authors would be interested in suggestions for improving those guidelines. Maybe you can write them? |
My idea is more to collect existing guidelines of some serious quality (the 4 mentioned in https://arxiv.org/ftp/arxiv/papers/1809/1809.05501.pdf - DTA, OCR-D, etc.), also MUFI and some non-German large projects. Review and compare them. Start it as an open standard at github (like hOCR). Roughly there should be 3 levels:
Using only Unicode (without PUA) has a lot of advantages. Developers can use standard functionality tested against Unicode. No moving target, no reinventing the wheel. E. g. in Perl5 which has the best Unicode support to my knowledge, you can use Unicode properties in Regex-rules, normalize to NFC, tokenize into graphemes with \X, use Unicode property arithmetics. OK, Perl6 is even better by using graphemes as default (I wrote the specs and spec-tests for this part ). Personally I do not understand, that languages like Python, C++ and Java are so popular in academic context. OO-notation like string.match() is not really convenient. Using ICU has errors. That's why Perl5 and Perl6 compile directly from Unicode tables. Mostly in C++ wchar (a 16 bit unsigned integer) is used. Unicode needs 21 bits. Java I don't know exactly, seems reliable (full UFT-16 support). But from experience academic code doesn't build or install easily. Using PUA has a great disadvantage: You can not use Unicode properties, because a private codepoint has no properties, unless you provide and support them by own complicated code.
|
@stweil Should we open an extra issue for ONB? Or an extra repository? I don't want to edit the wiki. Just downloaded it, look into the first page ONB_aze_18950706_1, saw a darker spot and bingo, XML line 779 has a wrong transcription:
Line 488
The image quality is not so bad. The JPGs have 300 dpi or better. They are very compressed which explains the small file size and the visible artefacts. The TIFs are obviously digitised by Google, as Google digitises for ONB. Seem to be binarised by a global method, which destroys information. My plan is, to filter or correct some of the errors automatically. My method is promising. But I have not enough unpaid time, to make this happen in the next weeks. Still working on normalisation of transcription standards and automatic normalisation of fonts, which is related. |
Ideally the ÖNB would offer a repository and accept pull requests to improve the ground truth texts. Do you want to ask them? |
The authors of the GT are from Uni Innsbruck. Don't know, if ONB has OCR experts at scientific level. Sooner or later I will inform them of my work. The GT files are open source with a friendly license. So I can put them onto github and add the diffs to apply and scripts for corrections. |
I just wrote to the colleagues at Innsbruck and expect their answer next week. When they provide a public repository I'd send our fixes as a pull request. Otherwise I could upload our internal repository which not only fixes transcriptions but also adds the long s to GitHub. |
Sounds good. If it is based on the original version from Zenodo, then it's perfect. |
https://github.com/UB-Mannheim/AustrianNewspapers is now online. |
See release 1.0.0. |
Perfect. |
I start this issue to collect my experiences when trying to train Tesseract from GT4HistOCR using ocrd-train. So problems reported here can be caused by Tesseract, ocrd-train or by GT4HistOCR.
The text was updated successfully, but these errors were encountered: