You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using a .txt file containing ~9 million lines, with ~+42 million unique and un-duplicated words.
In the "Generating lstmf file" step, it shows Loaded 104/104 pages (1-104)
My original txt file size is ~570MB containing +100000 pages, the generated lstmf is 445kb, the initial generated traineddata is 1.3MB Is Tesseract stopping loading pages after facing No block overlapping textline? or is there a size/thresshold limit?
tesstrain log:
=== Starting training for language 'ara'
[Sun Apr 22 08:54:59 PDT 2018] /usr/bin/text2image --fonts_dir=../fonts --font=Arial, --outputbase=/tmp/font_tmp.TfAHElS9Km/sample_text.txt --text=/tmp/font_tmp.TfAHElS9Km/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.TfAHElS9Km
Rendered page 0 to file /tmp/font_tmp.TfAHElS9Km/sample_text.txt.tif
=== Phase I: Generating training images ===
Rendering using Arial,
[Sun Apr 22 08:55:01 PDT 2018] /usr/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.TfAHElS9Km --fonts_dir=../fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0 --max_pages=3 --font=Arial, --text=../text/arabic.txt
Rendered page 0 to file /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.tif
Rendered page 1 to file /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.tif
Rendered page 2 to file /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.tif
=== Phase UP: Generating unicharset and unichar properties files ===
[Sun Apr 22 08:56:51 PDT 2018] /usr/bin/unicharset_extractor --output_unicharset /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset --norm_mode 2 /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.box
Extracting unicharset from box file /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.box
Wrote unicharset file /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset
[Sun Apr 22 08:56:53 PDT 2018] /usr/bin/set_unicharset_properties -U /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset -O /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset -X /tmp/tmp.FTa4hC8XeN/ara/ara.xheights --script_dir=../langdata
Loaded unicharset of size 38 from file /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset
Setting unichar properties
Setting script properties
Writing unicharset to file /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset
=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=./tessdata
[Sun Apr 22 08:56:54 PDT 2018] /usr/bin/tesseract /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.tif /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0 lstm.train ../langdata/ara/ara.config
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
Page 2
Loaded 52/52 pages (1-52) of document /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.lstmf
Page 3
Loaded 104/104 pages (1-104) of document /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.lstmf
No block overlapping textline: ضجأ صجأ شجأ سجأ زجأ رجأ ذجأ دجأ خجأ حجأ ججأ ثجأ
=== Constructing LSTM training data ===
[Sun Apr 22 08:57:00 PDT 2018] /usr/bin/combine_lang_model --input_unicharset /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset --script_dir ../langdata --words ../langdata/ara/ara.wordlist --numbers ../langdata/ara/ara.numbers --puncs ../langdata/ara/ara.punc --output_dir ../out --lang ara --pass_through_recoder --lang_is_rtl
Loaded unicharset of size 38 from file /tmp/tmp.FTa4hC8XeN/ara/ara.unicharset
Setting unichar properties
Setting script properties
Config file is optional, continuing...
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Moving /tmp/tmp.FTa4hC8XeN/ara/ara.Arial.exp0.lstmf to ../out
Completed training for language 'ara'
The text was updated successfully, but these errors were encountered:
I am using a .txt file containing ~9 million lines, with ~+42 million unique and un-duplicated words.
In the "Generating lstmf file" step, it shows
Loaded 104/104 pages (1-104)
My original txt file size is ~570MB containing +100000 pages, the generated lstmf is 445kb, the initial generated traineddata is 1.3MB
Is Tesseract stopping loading pages after facing
No block overlapping textline
? or is there a size/thresshold limit?tesstrain log:
The text was updated successfully, but these errors were encountered: