Q&A: Training Wiki Updates and Request for Info #659

Shreeshrii · 2017-01-13T04:02:38Z

Ray, Thanks for updating the Wiki page for LSTM training. A few more changes in the following may be required in light of the updates:

In theory it isn't necessary to have a base Tesseract of the same language as the neural net Tesseract, but currently it won't load without something there.

Finally, combine your new model with the language model files into a traineddata file:

Please also provide command for building traineddata with just the .lstm file or with just .lstm and lstm-dawgs (so as to minimize traineddata filesize, if only LSTM is going to be used).

Shreeshrii · 2017-01-13T04:17:23Z

Also helpful will be info on:

how big the training text should be (number of lines) for:

fine tuning and
adding a layer.

what kind of text is recommended/can be used?

paragraphs, sentences
word lists
orthographic syllables

e.g. For Sanskrit, I want to train by adding a layer using a list of most frequent orthographic syllables so that the unicharset is expanded to include all possible aksharas. Will this work?

Should training be done using different --ptsize ? If so, is it possible to modify tesstrain.sh to take a list of --ptsize options (similar to the array for exposure --exp).

amitdo · 2017-01-13T08:44:19Z

My own question - the answer can also be added to the wiki.

Is it OK to mix b/w images, produced by text2image, with gray and/or color images from book scan?

Shreeshrii · 2017-01-13T09:28:19Z

Also, is there a way for tesseract to create line boxes for a scanned image. It will make it easier to put the truth text if the box dimensions are pre-made. - excuse the brevity, sent from mobile

…

On 13-Jan-2017 2:14 PM, "Amit D." ***@***.***> wrote: My own question - the answer can also be added to the wiki. Is it OK to mix b/w images, produced by text2image, with gray and/or color images from book scan? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#659 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o7lvUh_kbAfVygdwAU1ZBpPaiXCaks5rRzlqgaJpZM4LigRu> .

amitdo · 2017-01-13T22:23:26Z

Also, is there a way for tesseract to create line boxes for a scanned image. It will make it easier to put the truth text if the box dimensions are pre-made.

This feature is not implemented. I will try to implement it sometime in the next few days and send a PR.

Shreeshrii · 2017-01-14T04:41:42Z

Another question:

what effect does the add a layer type of training have regarding the unicharset in the new traineddata.

For add a layer, a unicharset if required eg. lstmtraining -U ~/tesstutorial/bih/bih.unicharset
Does this

add to the unicharset from the existing lstm file
replace the unicharset from the existing lstm file
replace parts of the unicharset in the existing lstm file

Meaning, if we just want to add a few characters to the unicharset, is it enough to have good sampling of those or do characters from the lstm unicharset (which are unknown at this point) need to be there too.

Shreeshrii · 2017-01-14T07:04:28Z

Traineddata files in tessdata for 4.0 were trained with --perfect_sample_delay 19. The dafault value for the variable is 4.

The training command examples do not specify this. What are the recommended value to be used for finetuning and adding a layer?

Shreeshrii · 2017-01-15T15:31:58Z

@theraysmith

Please see
https://groups.google.com/forum/#!topic/tesseract-ocr/-N5uPdSvJGA
#642
#561

'core dumped' error in these cases seems to be related to using --eval_listfile as part of the lstmtraining command eg.
--eval_listfile ~/tesstutorial/saneval/san.training_files.txt

Please update the wiki, if you can confirm this, so that people are able to run the tutorial.

Thanks.

Wikinaut · 2017-01-15T15:36:56Z

@amitdo Question to you, let me explain as briefly as I can:

I have successfully LSTM ocr-ed 700 pages of a book using tesseract in.ppm out -l deu --oem 2 txt.
I manually corrected the output file out.txt to out.corrected.txt.

I found certain groups of ocr failures in my scan case, two examples which were always wrongly detected

"Citroén" instead of the original word "Citroën"
"fiir" instead of "für"

Question

Is there an easy way - I guess, it could be possible and would be very userfriendly -

a way to easily retrain a Tesseract language (or a copy) by re-feeding a corrected txt version in order to retrain ?
What will be the commandline ?

amitdo · 2017-01-15T16:31:27Z

Hi @Wikinaut!

Believe it or not, I haven't started yet playing with training the LSTM engine, so I don't know enough to answer your question. Hopefully, this serious 'bug' will be fixed sometime in the next month :-)

Some observations:
Both 'für' and 'fiir' are in the wordlist.
https://raw.githubusercontent.com/tesseract-ocr/langdata/master/deu/deu.wordlist

'ë' does not appear in the training text, 'é' appears 4 times.
https://github.com/tesseract-ocr/langdata/blob/master/deu/deu.training_text

Café So für
René für
Cafés
André

'für' appears 10 times in the training text.

OCR Engine modes:
0 Original Tesseract only.
1 Neural nets LSTM only.
2 Tesseract + LSTM.
3 Default, based on what is available.

Did you try --oem 1?

Wikinaut · 2017-01-15T17:35:55Z

@amitdo my original text uses a very "bad" font, where the characters overlap very often, and the characters often look, but are not, "ligatures". This explains the "fiir" in many cases (in my case).

I also tried --oem 1. but found, that --oem 2 gave the best results. However, I did not find an explanation, what this "mixed operation modes" are really doing, pls. can we add a short text to `"2 Tesseract + LSTM", I can supply a PR, but do not know what a correct and short description is.

Wikinaut · 2017-01-15T17:37:22Z

@amitdo and regarding my question above, can I "quickly" retrain my "deu" training data (or a copy of it) with a corrected text, this would be really great?

Promise: some mBitcoins for this today!

Wikinaut · 2017-01-15T17:39:47Z

Whoever coded the LSTM: Big APPLAUSE for him or her!

amitdo · 2017-01-15T17:55:07Z

LSTM - New OCR engine based on neural networks.
Tesseract - old OCR engine (started in the mid 80s) - does character segmentation and shape matching.

Wikinaut · 2017-01-15T17:57:39Z

@amitdo yes, but what if one selects --oem 2 ? Are then the results of both engines being compared or otherwise evaluated together ?

amitdo · 2017-01-15T18:00:59Z

The two engines runs and the results are combined in some way.

Wikinaut · 2017-01-15T18:04:42Z

👍

amitdo · 2017-01-15T18:08:27Z

As said, I have zero experience training the LSTM engine.

What you want is described here:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact

Shreeshrii · 2017-01-16T05:15:12Z

@Wikinaut

my original text uses a very "bad" font, where the characters overlap very often, and the characters often look, but are not, "ligatures". This explains the "fiir" in many cases (in my case).

Please provide a sample image for testing.

Wikinaut · 2017-01-16T06:57:10Z

@Shreeshrii

"für" vs.Tesseract: "fiir"

case 1

case 2

case 3

"Citroën" vs. Tesseract: "Citroén"

Case 1

Case 2

Shreeshrii · 2017-01-16T09:38:49Z

ë is not in the training_text. Needs to be added, hope that @theraysmith will include in next training.

für - is being recognized -see attached output files.
e1ec5bba-dbc0-11e6-9e9a-8e65f50a9d60-oem1-png.txt
beb10966-dbc0-11e6-8b86-f89117a7918c-oem1-png.txt
90011a70-dbc0-11e6-9889-d104dad6822a-oem1-png.txt

though ö was not recognized in one image.

amitdo · 2017-01-16T09:53:56Z

ë is not in the training_text. Needs to be added, hope that @theraysmith will include in next training.

https://en.wikipedia.org/wiki/German_language#Orthography
It's not in the a German alphabet. it's from French. Still, maybe it should be included with the deu traineddata.

amitdo · 2017-01-16T10:11:33Z

It does looks like 'ii' (two 'i's), doesn't it?

Maybe the training text needs some examples of 'ii' so it can learn to distinguish it from 'ü'.

Wikinaut · 2017-01-16T12:15:23Z

@Shreeshrii in my conversion, these words "für" were recognized as "fiir". May be due to use of "unpaper" as preprocessor, and/or my use of "-l deu+eng --oem 2" for the conversion.

There were many more occurences of false-detecting "fiir" in my about 700 pages of text. This was the most frequent conversion error and triggered me to aksing you how I could retrain tessdata by using my corrected text file. A simple command line would be very helpful for such cases.

@amitdo regarding "ii": In my text, tesseract correctly ocr-ed "ii" in the words "Gummiisolation", and "Daiichi" (a name).

Wikinaut · 2017-01-16T17:26:38Z

@theraysmith You appear to be the expert for answering my question, if such a procedure for re-training (tesseract + LSTM) is easily possible, or not:

(I described it already above:)

Can I "quickly" retrain my "deu" (or "deu+eng") training data (or a copy of it) with a corrected text ?

in.pdf -> tesseract -> out.txt
out.txt -> manually corrected -> corrected.txt
retraining tesseract (to get tesseract' )with these inputs: in.pdf + corrected.txt

re-running with re-trained tesseract' should in the best case result in

in-pdf -> tesseract' -> corrected.txt

I found but do not (yet) understand the present training explanations in the Wiki, and perhaps is my idea not yet covered.

theraysmith · 2017-01-17T17:08:22Z

This kind of retraining would be desirable, but is not available. In your case, you don't need it though, as 4.00 works for all the examples of "für" that you provided. You just need to make sure you are using the latest code and data. As Amit points out e diaresis is not in the German alphabet. I successfully correctly got "Citroën" by using fra+deu as the language. Unfortunately, it doesn't work with deu+fra, and neither works for the 2nd example. BTW this needed a bug fix for multi-language, which I will check in soon.

…

On Mon, Jan 16, 2017 at 9:26 AM, Wikinaut ***@***.***> wrote: @theraysmith <https://github.com/theraysmith> You appear to be the expert for answering my question, if such a procedure for re-training (tesseract + LSTM) is easily possible, or not: (I described it already above:) Can I "quickly" retrain my "deu" (or "deu+eng") training data (or a copy of it) with a corrected text ? - in.pdf -> tesseract -> out.txt - out.txt -> manually corrected -> *corrected.txt* - retraining tesseract (to get tesseract' )with these inputs: in.pdf + *corrected.txt* re-running re-trained tesseract should in the best case result in - in-pdf -> tesseract' -> corrected.txt I found but do not (yet) understand the present training explanations in the Wiki, and perhaps is my idea not yet covered. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#659 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056ZP1oiFo1bk3xNLf3lGGBiL7G-ODks5rS6hbgaJpZM4LigRu> .

-- Ray.

Wikinaut · 2017-01-17T17:20:50Z

@theraysmith Thank you for your swift answer.

In my case, many "für" were detected as "fiir", when or when not using unpaper (I cannot remember, because I tried many different runs).

I will retry - and report here - with only -l deu in order to present a correct case for reproduction.

Wikinaut · 2017-01-17T17:25:06Z

@theraysmith to be more precise:

I tried tesseract with -oem 0, 1, 2 and found that "2" gave the best results (for a 700 pages scan). I rerun with and without unpaper, and found some differences. And I only used -l deu+eng, because my German text used some English terms. Now, as I have a manually corrected reference output text I can present (later) a kind of matrix with the results.

amitdo · 2017-01-20T20:13:40Z

New box renderer
https://github.com/amitdo/tesseract/issues/3

stweil · 2017-03-09T20:17:08Z

Some of the problems with German texts were addressed in tesseract-ocr/langdata#54, tesseract-ocr/langdata#56 and tesseract-ocr/langdata#57. I don't know whether those fixes are sufficient to improve future trainings.

Wikinaut · 2017-03-14T09:58:59Z

@stweil @amitdo Stefan, please can you also make sure that common words with a https://en.wikipedia.org/wiki/Diaeresis_(diacritic) (Deutsch: Trema) like Citroën are correctly trained ?

stweil · 2017-03-14T10:16:09Z

I addressed the more general question whether all European languages should support all typical diacritical characters in the tesseract-dev forum and need information from @theraysmith to proceed.

stweil · 2017-03-14T10:25:48Z

I successfully correctly got "Citroën" by using fra+deu as the language.

I expect that using additional languages has more side effects than recognizing additional characters, because they also add word list, unigram frequencies, word bigrams and so on for that languages which might have a negative effect on OCR results for texts which are mainly written in a single language but make sparely use of additional languages. Examples of such texts are German texts with foreign person or trade mark names, but also English scientific texts with additional Greek characters (a combination often used in mathematics and physics).

Wikinaut · 2017-03-14T10:30:02Z

@stweil Thanks for your swift answers. Let me know, if I can help.

amitdo · 2017-08-14T13:23:34Z

Wikinaut, you can try the new best/Latin.traineddata

Shreeshrii · 2017-09-11T13:48:30Z

@Wikinaut

I found certain groups of ocr failures in my scan case, two examples which were always wrongly detected
"Citroén" instead of the original word "Citroën"
"fiir" instead of "für"

Does it work now with best traineddata?

Can I close this issue?

Wikinaut · 2017-09-11T13:55:58Z

I have not tried the latest version. Pls. let this open - I will close it, if it's solved.

amitdo · 2017-09-11T14:04:18Z

@Wikinaut,

The best/eng.traineddata doesn't have the marks you want.

Try the new best/Latin.traineddata.

stweil · 2017-09-11T14:11:53Z

The problem with "fiir" instead of "für" is a typical example of the ii / ü confusion which still exists in the current best traineddata. The wordlist for best/Latin.traineddata includes "dafiir" (correct: "dafür"), "fiir" (correct: "für") for example.

Wikinaut · 2017-09-21T06:39:46Z

@stweil I now use the new https://github.com/tesseract-ocr/tessdata_best data, and found that a problem with lowercase vs. uppercase "s" exists, in a 1000-page text,

typical incorrectly detected word patterns are:

"Sich" instead of "sich"
"Sie" instead of "sie"
"Sagte" instead of "sagte"
"Sagen" instead of "sagen"
"Sah" instead of "sah"
"ICh" instead of "Ich"
"80" instead of "so"

amitdo · 2017-09-21T07:33:47Z

The problem with "fiir" instead of "für" is a typical example of the ii / ü confusion which still exists in the current best traineddata. The wordlist for best/Latin.traineddata includes "dafiir" (correct: "dafür"), "fiir" (correct: "für") for example.

Try to correct the mistakes in the wordlist and see if it helps to recognize these words.

stweil · 2017-09-21T07:44:02Z

... or run Tesseract without a wordlist. I recently removed the wordlists from the best traineddata to see and compare the real quality of the trained LSTM data. This is impossible when Tesseract uses a wordlist. With wordlists, Tesseract also invents words which don't occur in the original text ("computer" and "Google" in historical documents).

PS. Is there a parameter which disables the post OCR steps (like wordlist evaluation) in Tesseract without the need to remove the wordlists from the traineddata files?

amitdo · 2017-09-21T07:49:11Z

Yes, there is a parameter which disables the wordlist evaluation.

I don't remember its name right now...

Shreeshrii · 2017-09-21T08:05:18Z

Please see #960

I guess, you can make the following two config variables as false to not load the wordlist dawg files.

load_system_dawg T
load_freq_dawg T

amitdo · 2017-09-21T08:33:01Z

The parameter is lstm_use_matrix.

amitdo · 2017-09-21T09:04:03Z

I guess, you can make the following two config variables as false to not load the wordlist dawg files. load_system_dawg T
load_freq_dawg T

load_system_dawg should work.

load_freq_dawg seems to have no impact on the lstm recognizer.

Shreeshrii · 2017-09-21T11:03:31Z

Those config variables related to the legacy engine. New traineddata files have a different lstm-word-dawg and have no freq-dawg files. So, I am not sure whether they will work. I haven't tried it yet.

amitdo · 2017-09-21T11:51:23Z

tesseract/dict/dict.cpp

Line 307 in 27d25e9

void Dict::LoadLSTM(const STRING &lang, TessdataManager *data_file) {

stweil · 2017-09-21T13:09:24Z

I wonder why LSTM needs its own word list. I'd expect that a word list is different for different languages, and it is also reasonable to use different word lists for different kinds of text (topic, date) of the same language, but it should not depend on the OCR algorithm.

Shreeshrii · 2017-09-21T13:51:02Z

It is not that the wordlist is different, but the fact that the legacy engine and LSTM models might be using different unicharsets.

The creation and unpacking of dawgs requires unicharsets, that's why there are two sets of dawg files, even for numbers and punctuation, in addition to the wordlist.

Shreeshrii changed the title ~~Training Wiki Corrections~~ Training Wiki Corrections/Request for Info Jan 14, 2017

Shreeshrii changed the title ~~Training Wiki Corrections/Request for Info~~ Training Wiki Updates and Request for Info Jan 14, 2017

amitdo mentioned this issue Jan 19, 2017

Cube and combined modes doesn't work in 3.03 #40

Closed

Shreeshrii changed the title ~~Training Wiki Updates and Request for Info~~ Q&A: Training Wiki Updates and Request for Info Sep 11, 2017

Q&A: Training Wiki Updates and Request for Info #659

Q&A: Training Wiki Updates and Request for Info #659

Comments

Shreeshrii commented Jan 13, 2017

Shreeshrii commented Jan 13, 2017 • edited Loading

amitdo commented Jan 13, 2017

Shreeshrii commented Jan 13, 2017 via email

amitdo commented Jan 13, 2017 • edited Loading

Shreeshrii commented Jan 14, 2017

Shreeshrii commented Jan 14, 2017

Shreeshrii commented Jan 15, 2017 • edited Loading

Wikinaut commented Jan 15, 2017 • edited Loading

Question

amitdo commented Jan 15, 2017 • edited Loading

Wikinaut commented Jan 15, 2017

Wikinaut commented Jan 15, 2017 • edited Loading

Wikinaut commented Jan 15, 2017

amitdo commented Jan 15, 2017 • edited Loading

Wikinaut commented Jan 15, 2017

amitdo commented Jan 15, 2017

Wikinaut commented Jan 15, 2017

amitdo commented Jan 15, 2017

Shreeshrii commented Jan 16, 2017

Wikinaut commented Jan 16, 2017

"für" vs.Tesseract: "fiir"

"Citroën" vs. Tesseract: "Citroén"

Shreeshrii commented Jan 16, 2017 • edited Loading

amitdo commented Jan 16, 2017

amitdo commented Jan 16, 2017

Wikinaut commented Jan 16, 2017

Wikinaut commented Jan 16, 2017 • edited Loading

theraysmith commented Jan 17, 2017 via email

Wikinaut commented Jan 17, 2017 • edited Loading

Wikinaut commented Jan 17, 2017

amitdo commented Jan 20, 2017

stweil commented Mar 9, 2017

Wikinaut commented Mar 14, 2017

stweil commented Mar 14, 2017

stweil commented Mar 14, 2017

Wikinaut commented Mar 14, 2017

amitdo commented Aug 14, 2017

Shreeshrii commented Sep 11, 2017

Wikinaut commented Sep 11, 2017

amitdo commented Sep 11, 2017

stweil commented Sep 11, 2017

Wikinaut commented Sep 21, 2017 • edited Loading

amitdo commented Sep 21, 2017

stweil commented Sep 21, 2017 • edited Loading

amitdo commented Sep 21, 2017 • edited Loading

Shreeshrii commented Sep 21, 2017

amitdo commented Sep 21, 2017

amitdo commented Sep 21, 2017

Shreeshrii commented Sep 21, 2017

amitdo commented Sep 21, 2017

stweil commented Sep 21, 2017

Shreeshrii commented Sep 21, 2017 • edited Loading

Shreeshrii commented Jan 13, 2017 •

edited

Loading

amitdo commented Jan 13, 2017 •

edited

Loading

Shreeshrii commented Jan 15, 2017 •

edited

Loading

Wikinaut commented Jan 15, 2017 •

edited

Loading

amitdo commented Jan 15, 2017 •

edited

Loading

Wikinaut commented Jan 15, 2017 •

edited

Loading

amitdo commented Jan 15, 2017 •

edited

Loading

Shreeshrii commented Jan 16, 2017 •

edited

Loading

Wikinaut commented Jan 16, 2017 •

edited

Loading

Wikinaut commented Jan 17, 2017 •

edited

Loading

Wikinaut commented Sep 21, 2017 •

edited

Loading

stweil commented Sep 21, 2017 •

edited

Loading

amitdo commented Sep 21, 2017 •

edited

Loading

Shreeshrii commented Sep 21, 2017 •

edited

Loading