Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

only 104 pages used for lstmf instead of the original amount #1508

Closed
ghost opened this issue Apr 22, 2018 · 2 comments
Closed

only 104 pages used for lstmf instead of the original amount #1508

ghost opened this issue Apr 22, 2018 · 2 comments

Comments

@ghost
Copy link

ghost commented Apr 22, 2018

I am using a .txt file containing ~9 million lines, with ~+42 million unique and un-duplicated words.
In the "Generating lstmf file" step, it shows Loaded 104/104 pages (1-104)
My original txt file size is ~570MB containing +100000 pages, the generated lstmf is 445kb, the initial generated traineddata is 1.3MB
Is Tesseract stopping loading pages after facing No block overlapping textline? or is there a size/thresshold limit?

tesstrain log:


=== Starting training for language 'ara'
[Sun Apr 22 08:54:59 PDT 2018] /usr/bin/text2image --fonts_dir=../fonts --font=Arial, --outputbase=/tmp/font_tmp.TfAHElS9Km/sample_text.txt --text=/tmp/font_tmp.TfAHElS9Km/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.TfAHElS9Km
Rendered page 0 to file /tmp/font_tmp.TfAHElS9Km/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using Arial,
[Sun Apr 22 08:55:01 PDT 2018] /usr/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.TfAHElS9Km --fonts_dir=../fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0 --max_pages=3 --font=Arial, --text=../text/arabic.txt
Rendered page 0 to file /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.tif
Rendered page 1 to file /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.tif
Rendered page 2 to file /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Sun Apr 22 08:56:51 PDT 2018] /usr/bin/unicharset_extractor --output_unicharset /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset --norm_mode 2 /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.box
Extracting unicharset from box file /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.box
Wrote unicharset file /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset
[Sun Apr 22 08:56:53 PDT 2018] /usr/bin/set_unicharset_properties -U /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset -O /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset -X /tmp/tmp.FTa4hC8XeN/ara/ara.xheights --script_dir=../langdata
Loaded unicharset of size 38 from file /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset
Setting unichar properties
Setting script properties
Writing unicharset to file /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset

=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=./tessdata
[Sun Apr 22 08:56:54 PDT 2018] /usr/bin/tesseract /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.tif /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0 lstm.train ../langdata/ara/ara.config
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
Page 2
Loaded 52/52 pages (1-52) of document /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.lstmf
Page 3
Loaded 104/104 pages (1-104) of document /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.lstmf
No block overlapping textline: ضجأ صجأ شجأ سجأ زجأ رجأ ذجأ دجأ خجأ حجأ ججأ ثجأ

=== Constructing LSTM training data ===
[Sun Apr 22 08:57:00 PDT 2018] /usr/bin/combine_lang_model --input_unicharset /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset --script_dir ../langdata --words ../langdata/ara/ara.wordlist --numbers ../langdata/ara/ara.numbers --puncs ../langdata/ara/ara.punc --output_dir ../out --lang ara --pass_through_recoder --lang_is_rtl
Loaded unicharset of size 38 from file /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset
Setting unichar properties
Setting script properties
Config file is optional, continuing...
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Moving /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.lstmf to ../out

Completed training for language 'ara'
@Shreeshrii
Copy link
Collaborator

max_pages=3

Change 3 to zero in training/tesstrain_utils.sh

@ghost
Copy link
Author

ghost commented Apr 22, 2018

max_pages=0 confirmed solution, thanks for the pull request @Shreeshrii

@ghost ghost closed this as completed Apr 22, 2018
noahmetzger pushed a commit to noahmetzger/tesseract that referenced this issue Jul 31, 2018
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant