Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q&A: Training Wiki Updates and Request for Info #659

Open
Shreeshrii opened this issue Jan 13, 2017 · 49 comments
Open

Q&A: Training Wiki Updates and Request for Info #659

Shreeshrii opened this issue Jan 13, 2017 · 49 comments

Comments

@Shreeshrii
Copy link
Collaborator

@theraysmith

Ray, Thanks for updating the Wiki page for LSTM training. A few more changes in the following may be required in light of the updates:

In theory it isn't necessary to have a base Tesseract of the same language as the neural net Tesseract, but currently it won't load without something there.

Finally, combine your new model with the language model files into a traineddata file:

Please also provide command for building traineddata with just the .lstm file or with just .lstm and lstm-dawgs (so as to minimize traineddata filesize, if only LSTM is going to be used).

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 13, 2017

Also helpful will be info on:

  1. how big the training text should be (number of lines) for:
  • fine tuning and
  • adding a layer.
  1. what kind of text is recommended/can be used?
  • paragraphs, sentences
  • word lists
  • orthographic syllables

e.g. For Sanskrit, I want to train by adding a layer using a list of most frequent orthographic syllables so that the unicharset is expanded to include all possible aksharas. Will this work?

  1. Should training be done using different --ptsize ? If so, is it possible to modify tesstrain.sh to take a list of --ptsize options (similar to the array for exposure --exp).

@amitdo
Copy link
Collaborator

amitdo commented Jan 13, 2017

My own question - the answer can also be added to the wiki.

Is it OK to mix b/w images, produced by text2image, with gray and/or color images from book scan?

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 13, 2017 via email

@amitdo
Copy link
Collaborator

amitdo commented Jan 13, 2017

Also, is there a way for tesseract to create line boxes for a scanned image. It will make it easier to put the truth text if the box dimensions are pre-made.

This feature is not implemented. I will try to implement it sometime in the next few days and send a PR.

@Shreeshrii Shreeshrii changed the title Training Wiki Corrections Training Wiki Corrections/Request for Info Jan 14, 2017
@Shreeshrii
Copy link
Collaborator Author

Another question:

what effect does the add a layer type of training have regarding the unicharset in the new traineddata.

For add a layer, a unicharset if required eg. lstmtraining -U ~/tesstutorial/bih/bih.unicharset
Does this

  • add to the unicharset from the existing lstm file
  • replace the unicharset from the existing lstm file
  • replace parts of the unicharset in the existing lstm file

Meaning, if we just want to add a few characters to the unicharset, is it enough to have good sampling of those or do characters from the lstm unicharset (which are unknown at this point) need to be there too.

@Shreeshrii Shreeshrii changed the title Training Wiki Corrections/Request for Info Training Wiki Updates and Request for Info Jan 14, 2017
@Shreeshrii
Copy link
Collaborator Author

Traineddata files in tessdata for 4.0 were trained with --perfect_sample_delay 19. The dafault value for the variable is 4.

The training command examples do not specify this. What are the recommended value to be used for finetuning and adding a layer?

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 15, 2017

@theraysmith

Please see
https://groups.google.com/forum/#!topic/tesseract-ocr/-N5uPdSvJGA
#642
#561

'core dumped' error in these cases seems to be related to using --eval_listfile as part of the lstmtraining command eg.
--eval_listfile ~/tesstutorial/saneval/san.training_files.txt

Please update the wiki, if you can confirm this, so that people are able to run the tutorial.

Thanks.

@Wikinaut
Copy link
Contributor

Wikinaut commented Jan 15, 2017

@amitdo Question to you, let me explain as briefly as I can:

  • I have successfully LSTM ocr-ed 700 pages of a book using tesseract in.ppm out -l deu --oem 2 txt.
  • I manually corrected the output file out.txt to out.corrected.txt.

I found certain groups of ocr failures in my scan case, two examples which were always wrongly detected

  • "Citroén" instead of the original word "Citroën"
  • "fiir" instead of "für"

Question

Is there an easy way - I guess, it could be possible and would be very userfriendly -

  • a way to easily retrain a Tesseract language (or a copy) by re-feeding a corrected txt version in order to retrain ?
  • What will be the commandline ?

@amitdo
Copy link
Collaborator

amitdo commented Jan 15, 2017

Hi @Wikinaut!

Believe it or not, I haven't started yet playing with training the LSTM engine, so I don't know enough to answer your question. Hopefully, this serious 'bug' will be fixed sometime in the next month :-)

Some observations:
Both 'für' and 'fiir' are in the wordlist.
https://raw.githubusercontent.com/tesseract-ocr/langdata/master/deu/deu.wordlist

'ë' does not appear in the training text, 'é' appears 4 times.
https://github.com/tesseract-ocr/langdata/blob/master/deu/deu.training_text

Café So für
René für
Cafés
André

'für' appears 10 times in the training text.

OCR Engine modes:
0 Original Tesseract only.
1 Neural nets LSTM only.
2 Tesseract + LSTM.
3 Default, based on what is available.

Did you try --oem 1?

@Wikinaut
Copy link
Contributor

@amitdo my original text uses a very "bad" font, where the characters overlap very often, and the characters often look, but are not, "ligatures". This explains the "fiir" in many cases (in my case).

I also tried --oem 1. but found, that --oem 2 gave the best results. However, I did not find an explanation, what this "mixed operation modes" are really doing, pls. can we add a short text to `"2 Tesseract + LSTM", I can supply a PR, but do not know what a correct and short description is.

@Wikinaut
Copy link
Contributor

Wikinaut commented Jan 15, 2017

@amitdo and regarding my question above, can I "quickly" retrain my "deu" training data (or a copy of it) with a corrected text, this would be really great?

Promise: some mBitcoins for this today!

@Wikinaut
Copy link
Contributor

Whoever coded the LSTM: Big APPLAUSE for him or her!

@amitdo
Copy link
Collaborator

amitdo commented Jan 15, 2017

LSTM - New OCR engine based on neural networks.
Tesseract - old OCR engine (started in the mid 80s) - does character segmentation and shape matching.

@Wikinaut
Copy link
Contributor

@amitdo yes, but what if one selects --oem 2 ? Are then the results of both engines being compared or otherwise evaluated together ?

@amitdo
Copy link
Collaborator

amitdo commented Jan 15, 2017

The two engines runs and the results are combined in some way.

@Wikinaut
Copy link
Contributor

👍

@amitdo
Copy link
Collaborator

amitdo commented Jan 15, 2017

As said, I have zero experience training the LSTM engine.

What you want is described here:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact

@Shreeshrii
Copy link
Collaborator Author

@Wikinaut

my original text uses a very "bad" font, where the characters overlap very often, and the characters often look, but are not, "ligatures". This explains the "fiir" in many cases (in my case).

Please provide a sample image for testing.

@Wikinaut
Copy link
Contributor

@Shreeshrii

"für" vs.Tesseract: "fiir"

case 1
20170116-07 50 12_auswahl

case 2
20170116-07 52 18_auswahl

case 3
20170116-07 53 20_auswahl

"Citroën" vs. Tesseract: "Citroén"

Case 1
20170116-07 54 14_auswahl

Case 2
20170116-07 55 12_auswahl

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 16, 2017

ë is not in the training_text. Needs to be added, hope that @theraysmith will include in next training.

für - is being recognized -see attached output files.
e1ec5bba-dbc0-11e6-9e9a-8e65f50a9d60-oem1-png.txt
beb10966-dbc0-11e6-8b86-f89117a7918c-oem1-png.txt
90011a70-dbc0-11e6-9889-d104dad6822a-oem1-png.txt

though ö was not recognized in one image.

@amitdo
Copy link
Collaborator

amitdo commented Jan 16, 2017

ë is not in the training_text. Needs to be added, hope that @theraysmith will include in next training.

https://en.wikipedia.org/wiki/German_language#Orthography
It's not in the a German alphabet. it's from French. Still, maybe it should be included with the deu traineddata.

@amitdo
Copy link
Collaborator

amitdo commented Jan 16, 2017

It does looks like 'ii' (two 'i's), doesn't it?

Maybe the training text needs some examples of 'ii' so it can learn to distinguish it from 'ü'.

@Wikinaut
Copy link
Contributor

@Shreeshrii in my conversion, these words "für" were recognized as "fiir". May be due to use of "unpaper" as preprocessor, and/or my use of "-l deu+eng --oem 2" for the conversion.

There were many more occurences of false-detecting "fiir" in my about 700 pages of text. This was the most frequent conversion error and triggered me to aksing you how I could retrain tessdata by using my corrected text file. A simple command line would be very helpful for such cases.

@amitdo regarding "ii": In my text, tesseract correctly ocr-ed "ii" in the words "Gummiisolation", and "Daiichi" (a name).

@Wikinaut
Copy link
Contributor

Wikinaut commented Jan 16, 2017

@theraysmith You appear to be the expert for answering my question, if such a procedure for re-training (tesseract + LSTM) is easily possible, or not:

(I described it already above:)

Can I "quickly" retrain my "deu" (or "deu+eng") training data (or a copy of it) with a corrected text ?

  • in.pdf -> tesseract -> out.txt
  • out.txt -> manually corrected -> corrected.txt
  • retraining tesseract (to get tesseract' )with these inputs: in.pdf + corrected.txt

re-running with re-trained tesseract' should in the best case result in

  • in-pdf -> tesseract' -> corrected.txt

I found but do not (yet) understand the present training explanations in the Wiki, and perhaps is my idea not yet covered.

@theraysmith
Copy link
Contributor

theraysmith commented Jan 17, 2017 via email

@Wikinaut
Copy link
Contributor

Wikinaut commented Jan 17, 2017

@theraysmith Thank you for your swift answer.

In my case, many "für" were detected as "fiir", when or when not using unpaper (I cannot remember, because I tried many different runs).

I will retry - and report here - with only -l deu in order to present a correct case for reproduction.

@Wikinaut
Copy link
Contributor

@theraysmith to be more precise:

I tried tesseract with -oem 0, 1, 2 and found that "2" gave the best results (for a 700 pages scan). I rerun with and without unpaper, and found some differences. And I only used -l deu+eng, because my German text used some English terms. Now, as I have a manually corrected reference output text I can present (later) a kind of matrix with the results.

@amitdo
Copy link
Collaborator

amitdo commented Jan 20, 2017

@stweil
Copy link
Member

stweil commented Mar 9, 2017

Some of the problems with German texts were addressed in tesseract-ocr/langdata#54, tesseract-ocr/langdata#56 and tesseract-ocr/langdata#57. I don't know whether those fixes are sufficient to improve future trainings.

@Wikinaut
Copy link
Contributor

@stweil @amitdo Stefan, please can you also make sure that common words with a https://en.wikipedia.org/wiki/Diaeresis_(diacritic) (Deutsch: Trema) like Citroën are correctly trained ?

@stweil
Copy link
Member

stweil commented Mar 14, 2017

I addressed the more general question whether all European languages should support all typical diacritical characters in the tesseract-dev forum and need information from @theraysmith to proceed.

@stweil
Copy link
Member

stweil commented Mar 14, 2017

I successfully correctly got "Citroën" by using fra+deu as the language.

I expect that using additional languages has more side effects than recognizing additional characters, because they also add word list, unigram frequencies, word bigrams and so on for that languages which might have a negative effect on OCR results for texts which are mainly written in a single language but make sparely use of additional languages. Examples of such texts are German texts with foreign person or trade mark names, but also English scientific texts with additional Greek characters (a combination often used in mathematics and physics).

@Wikinaut
Copy link
Contributor

@stweil Thanks for your swift answers. Let me know, if I can help.

@amitdo
Copy link
Collaborator

amitdo commented Aug 14, 2017

Wikinaut, you can try the new best/Latin.traineddata

@Shreeshrii Shreeshrii changed the title Training Wiki Updates and Request for Info Q&A: Training Wiki Updates and Request for Info Sep 11, 2017
@Shreeshrii
Copy link
Collaborator Author

@Wikinaut

I found certain groups of ocr failures in my scan case, two examples which were always wrongly detected
"Citroén" instead of the original word "Citroën"
"fiir" instead of "für"

Does it work now with best traineddata?

Can I close this issue?

@Wikinaut
Copy link
Contributor

I have not tried the latest version. Pls. let this open - I will close it, if it's solved.

@amitdo
Copy link
Collaborator

amitdo commented Sep 11, 2017

@Wikinaut,

The best/eng.traineddata doesn't have the marks you want.

Try the new best/Latin.traineddata.

@stweil
Copy link
Member

stweil commented Sep 11, 2017

The problem with "fiir" instead of "für" is a typical example of the ii / ü confusion which still exists in the current best traineddata. The wordlist for best/Latin.traineddata includes "dafiir" (correct: "dafür"), "fiir" (correct: "für") for example.

@Wikinaut
Copy link
Contributor

Wikinaut commented Sep 21, 2017

@stweil I now use the new https://github.com/tesseract-ocr/tessdata_best data, and found that a problem with lowercase vs. uppercase "s" exists, in a 1000-page text,

typical incorrectly detected word patterns are:

  • "Sich" instead of "sich"
  • "Sie" instead of "sie"
  • "Sagte" instead of "sagte"
  • "Sagen" instead of "sagen"
  • "Sah" instead of "sah"
  • "ICh" instead of "Ich"
  • "80" instead of "so"

@amitdo
Copy link
Collaborator

amitdo commented Sep 21, 2017

The problem with "fiir" instead of "für" is a typical example of the ii / ü confusion which still exists in the current best traineddata. The wordlist for best/Latin.traineddata includes "dafiir" (correct: "dafür"), "fiir" (correct: "für") for example.

Try to correct the mistakes in the wordlist and see if it helps to recognize these words.

@stweil
Copy link
Member

stweil commented Sep 21, 2017

... or run Tesseract without a wordlist. I recently removed the wordlists from the best traineddata to see and compare the real quality of the trained LSTM data. This is impossible when Tesseract uses a wordlist. With wordlists, Tesseract also invents words which don't occur in the original text ("computer" and "Google" in historical documents).

PS. Is there a parameter which disables the post OCR steps (like wordlist evaluation) in Tesseract without the need to remove the wordlists from the traineddata files?

@amitdo
Copy link
Collaborator

amitdo commented Sep 21, 2017

Yes, there is a parameter which disables the wordlist evaluation.

I don't remember its name right now...

@Shreeshrii
Copy link
Collaborator Author

Please see #960

I guess, you can make the following two config variables as false to not load the wordlist dawg files.

load_system_dawg T
load_freq_dawg T

@amitdo
Copy link
Collaborator

amitdo commented Sep 21, 2017

The parameter is lstm_use_matrix.

@amitdo
Copy link
Collaborator

amitdo commented Sep 21, 2017

I guess, you can make the following two config variables as false to not load the wordlist dawg files. load_system_dawg T
load_freq_dawg T

load_system_dawg should work.

load_freq_dawg seems to have no impact on the lstm recognizer.

@Shreeshrii
Copy link
Collaborator Author

Those config variables related to the legacy engine. New traineddata files have a different lstm-word-dawg and have no freq-dawg files. So, I am not sure whether they will work. I haven't tried it yet.

@amitdo
Copy link
Collaborator

amitdo commented Sep 21, 2017

void Dict::LoadLSTM(const STRING &lang, TessdataManager *data_file) {

@stweil
Copy link
Member

stweil commented Sep 21, 2017

I wonder why LSTM needs its own word list. I'd expect that a word list is different for different languages, and it is also reasonable to use different word lists for different kinds of text (topic, date) of the same language, but it should not depend on the OCR algorithm.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Sep 21, 2017

It is not that the wordlist is different, but the fact that the legacy engine and LSTM models might be using different unicharsets.

The creation and unpacking of dawgs requires unicharsets, that's why there are two sets of dawg files, even for numbers and punctuation, in addition to the wordlist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants