-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating training data using tesstrain.sh #39
Comments
For the LSTM model, use --langdata_dir langdata_lstm You can limit the number of pages, if doing finetuning. |
So if I want to train a LSTM model from scratch, that will reach the Tesseract accuracy that is in the LSTM model what training data do I need create and how? |
Thanks @Shreeshrii I went over this documentation and something is still not clear to me. When following the instructions, the script creates a tiff file with ~50 lines per page and a total of ~3700 pages which is a total of 185,000 lines of text for just a single font. The instructions specify to use ~4000 fonts for English so the total number of lines that will be created is 4000*185,000 whereas according to this post (tesseract-ocr/tesseract#654 (comment)) the training set comprises only 400,000-800,000 textlines. What am I missing? |
Our knowledge about the training method is based on Ray Smith's posts and comments. It is possible that he experimented with different settings and the posts at different times reflect that. https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files-in-tessdata_fast.md shows the following info for English traineddata. Version string:4.00.00alpha:eng:synth20170629 While for tessdata_best it is eng Look at number of iterations to see the difference. I haven't seen any post where someone has been able to replicate his results. |
Hello. d57b942 |
It is not clear when creating training data using tesstain.sh for the LSTM model
should I use --langdata_dir langdata_lstm or to use --langdata_dir langdata?
It defect which eng.training_text file will be used to generate the training data
what should I use?
The text was updated successfully, but these errors were encountered: