only 104 pages used for lstmf instead of the original amount #1508

ghost · 2018-04-22T16:40:33Z

I am using a .txt file containing ~9 million lines, with ~+42 million unique and un-duplicated words.
In the "Generating lstmf file" step, it shows Loaded 104/104 pages (1-104)
My original txt file size is ~570MB containing +100000 pages, the generated lstmf is 445kb, the initial generated traineddata is 1.3MB
Is Tesseract stopping loading pages after facing No block overlapping textline? or is there a size/thresshold limit?

tesstrain log:


=== Starting training for language 'ara'
[Sun Apr 22 08:54:59 PDT 2018] /usr/bin/text2image --fonts_dir=../fonts --font=Arial, --outputbase=/tmp/font_tmp.TfAHElS9Km/sample_text.txt --text=/tmp/font_tmp.TfAHElS9Km/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.TfAHElS9Km
Rendered page 0 to file /tmp/font_tmp.TfAHElS9Km/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using Arial,
[Sun Apr 22 08:55:01 PDT 2018] /usr/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.TfAHElS9Km --fonts_dir=../fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0 --max_pages=3 --font=Arial, --text=../text/arabic.txt
Rendered page 0 to file /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.tif
Rendered page 1 to file /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.tif
Rendered page 2 to file /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Sun Apr 22 08:56:51 PDT 2018] /usr/bin/unicharset_extractor --output_unicharset /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset --norm_mode 2 /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.box
Extracting unicharset from box file /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.box
Wrote unicharset file /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset
[Sun Apr 22 08:56:53 PDT 2018] /usr/bin/set_unicharset_properties -U /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset -O /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset -X /tmp/tmp.FTa4hC8XeN/ara/ara.xheights --script_dir=../langdata
Loaded unicharset of size 38 from file /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset
Setting unichar properties
Setting script properties
Writing unicharset to file /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset

=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=./tessdata
[Sun Apr 22 08:56:54 PDT 2018] /usr/bin/tesseract /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.tif /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0 lstm.train ../langdata/ara/ara.config
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
Page 2
Loaded 52/52 pages (1-52) of document /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.lstmf
Page 3
Loaded 104/104 pages (1-104) of document /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.lstmf
No block overlapping textline: ضجأ صجأ شجأ سجأ زجأ رجأ ذجأ دجأ خجأ حجأ ججأ ثجأ

=== Constructing LSTM training data ===
[Sun Apr 22 08:57:00 PDT 2018] /usr/bin/combine_lang_model --input_unicharset /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset --script_dir ../langdata --words ../langdata/ara/ara.wordlist --numbers ../langdata/ara/ara.numbers --puncs ../langdata/ara/ara.punc --output_dir ../out --lang ara --pass_through_recoder --lang_is_rtl
Loaded unicharset of size 38 from file /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset
Setting unichar properties
Setting script properties
Config file is optional, continuing...
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Moving /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.lstmf to ../out

Completed training for language 'ara'

The text was updated successfully, but these errors were encountered:

Shreeshrii · 2018-04-22T16:45:01Z

max_pages=3

Change 3 to zero in training/tesstrain_utils.sh

ghost · 2018-04-22T17:21:24Z

max_pages=0 confirmed solution, thanks for the pull request @Shreeshrii

Fixes tesseract-ocr#1149 and tesseract-ocr#1508

Shreeshrii mentioned this issue Apr 22, 2018

Change max_pages to zero #1509

Merged

ghost closed this as completed Apr 22, 2018

noahmetzger pushed a commit to noahmetzger/tesseract that referenced this issue Jul 31, 2018

Change max_pages to zero

8e47438

Fixes tesseract-ocr#1149 and tesseract-ocr#1508

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

only 104 pages used for lstmf instead of the original amount #1508

only 104 pages used for lstmf instead of the original amount #1508

ghost commented Apr 22, 2018 •

edited by ghost

Loading

Shreeshrii commented Apr 22, 2018

ghost commented Apr 22, 2018 •

edited by ghost

Loading

only 104 pages used for lstmf instead of the original amount #1508

only 104 pages used for lstmf instead of the original amount #1508

Comments

ghost commented Apr 22, 2018 • edited by ghost Loading

Shreeshrii commented Apr 22, 2018

ghost commented Apr 22, 2018 • edited by ghost Loading

ghost commented Apr 22, 2018 •

edited by ghost

Loading

ghost commented Apr 22, 2018 •

edited by ghost

Loading