LSTM: Training - Arabic - Add Top layer - Aborted (core dumped) #642

Shreeshrii · 2017-01-07T04:59:09Z

While Add Top layer LSTM training worked for Latin unicharset based languages (eng, nor), It is failing for Arabic.

I am copying below the log for creating lstmf files and then for the training.

Shreeshrii · 2017-01-07T05:01:16Z

$ training/tesstrain.sh --fonts_dir /home/shree/.fonts --lang ara    --linedata_only --noextract_font_properties
   --langdata_dir ../langdata --tessdata_dir ./tessdata   --output_dir ~/tesstutorial/aralayer

=== Starting training for language 'ara'
[Sat Jan 7 10:09:33 DST 2017] /usr/local/bin/text2image --fonts_dir=/home/shree/.fonts --font=Arial Unicode MS --outputbase=/tmp/font_tmp.0Tqbe3jIFz/sample_text.txt --text
=/tmp/font_tmp.0Tqbe3jIFz/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz
Rendered page 0 to file /tmp/font_tmp.0Tqbe3jIFz/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using Arial Unicode MS
Rendering using Amiri
Rendering using Arial
Rendering using Scheherazade
Rendering using Calibri
Rendering using Tahoma
Rendering using FreeSerif
Rendering using Microsoft Sans Serif
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0 --font=Arial Unicode MS --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0 --font=Amiri --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0 --font=Arial --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0 --font=Scheherazade --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0 --font=Calibri --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0 --font=Tahoma --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0 --font=FreeSerif --text=../langdata/ara/ara.training_text
Stripped 15 unrenderable words
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0 --font=Microsoft Sans Serif --text=../langdata/ara/ara.training_text
Stripped 15 unrenderable words
Stripped 15 unrenderable words
Stripped 2 unrenderable words
Stripped 15 unrenderable words
Stripped 15 unrenderable words
Stripped 13 unrenderable words
Stripped 15 unrenderable words
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.tif
Rendered page 2 to file /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.tif
Rendered page 2 to file /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.tif
Rendered page 2 to file /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.tif
Rendered page 2 to file /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.tif
Rendered page 2 to file /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.tif
Rendering using Times New Roman,
Rendering using Courier New
Rendering using Traditional Arabic
[Sat Jan 7 10:10:02 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0 --font=Times New Roman, --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:10:03 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0 --font=Courier New --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:10:03 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0 --font=Traditional Arabic --text=../langdata/ara/ara.training_text
Stripped 15 unrenderable words
Stripped 15 unrenderable words
Stripped 15 unrenderable words
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.tif
Rendered page 2 to file /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Sat Jan 7 10:10:13 DST 2017] /usr/local/bin/unicharset_extractor -D /tmp/tmp.Ey23alPX8e/ara/ /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Arial.
exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.box /tmp/tmp.Ey23a
lPX8e/ara/ara.FreeSerif.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Ta
homa.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.box
Wrote unicharset file /tmp/tmp.Ey23alPX8e/ara//unicharset.
[Sat Jan 7 10:10:14 DST 2017] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.Ey23alPX8e/ara/ara.unicharset -O /tmp/tmp.Ey23alPX8e/ara/ara.unicharset -X /tmp/tmp.Ey23
alPX8e/ara/ara.xheights --script_dir=../langdata
Loaded unicharset of size 381 from file /tmp/tmp.Ey23alPX8e/ara/ara.unicharset
Setting unichar properties
Mirror { of } is not in unicharset
Writing unicharset to file /tmp/tmp.Ey23alPX8e/ara/ara.unicharset

=== Phase D: Generating Dawg files ===
Generating word Dawg
[Sat Jan 7 10:10:14 DST 2017] /usr/local/bin/wordlist2dawg -r 1 ../langdata/ara/ara.wordlist /tmp/tmp.Ey23alPX8e/ara/ara.word-dawg /tmp/tmp.Ey23alPX8e/ara/ara.unicharset
Set reverse_policy to RRP_REVERSE_IF_HAS_RTL
Loading unicharset from '/tmp/tmp.Ey23alPX8e/ara/ara.unicharset'
Reading word list from '../langdata/ara/ara.wordlist'
Reducing Trie to SquishedDawg
Writing squished DAWG to '/tmp/tmp.Ey23alPX8e/ara/ara.word-dawg'
Generating frequent-word Dawg
[Sat Jan 7 10:10:20 DST 2017] /usr/local/bin/wordlist2dawg -r 1 /tmp/tmp.Ey23alPX8e/ara/ara.wordlist.clean.freq /tmp/tmp.Ey23alPX8e/ara/ara.freq-dawg /tmp/tmp.Ey23alPX8e/a
ra/ara.unicharset
Set reverse_policy to RRP_REVERSE_IF_HAS_RTL
Loading unicharset from '/tmp/tmp.Ey23alPX8e/ara/ara.unicharset'
Reading word list from '/tmp/tmp.Ey23alPX8e/ara/ara.wordlist.clean.freq'
Reducing Trie to SquishedDawg
Writing squished DAWG to '/tmp/tmp.Ey23alPX8e/ara/ara.freq-dawg'
[Sat Jan 7 10:10:20 DST 2017] /usr/local/bin/wordlist2dawg -r 2 ../langdata/ara/ara.punc /tmp/tmp.Ey23alPX8e/ara/ara.punc-dawg /tmp/tmp.Ey23alPX8e/ara/ara.unicharset
Set reverse_policy to RRP_FORCE_REVERSE
Loading unicharset from '/tmp/tmp.Ey23alPX8e/ara/ara.unicharset'
Reading word list from '../langdata/ara/ara.punc'
Reducing Trie to SquishedDawg
Writing squished DAWG to '/tmp/tmp.Ey23alPX8e/ara/ara.punc-dawg'
[Sat Jan 7 10:10:21 DST 2017] /usr/local/bin/wordlist2dawg -r 0 ../langdata/ara/ara.numbers /tmp/tmp.Ey23alPX8e/ara/ara.number-dawg /tmp/tmp.Ey23alPX8e/ara/ara.unicharset
Set reverse_policy to RRP_DO_NO_REVERSE
Loading unicharset from '/tmp/tmp.Ey23alPX8e/ara/ara.unicharset'
Reading word list from '../langdata/ara/ara.numbers'
Reducing Trie to SquishedDawg
Writing squished DAWG to '/tmp/tmp.Ey23alPX8e/ara/ara.number-dawg'
[Sat Jan 7 10:10:21 DST 2017] /usr/local/bin/wordlist2dawg -r 1 ../langdata/ara/ara.word.bigrams /tmp/tmp.Ey23alPX8e/ara/ara.bigram-dawg /tmp/tmp.Ey23alPX8e/ara/ara.unicha
rset
Set reverse_policy to RRP_REVERSE_IF_HAS_RTL
Loading unicharset from '/tmp/tmp.Ey23alPX8e/ara/ara.unicharset'
Reading word list from '../langdata/ara/ara.word.bigrams'
Reducing Trie to SquishedDawg
Writing squished DAWG to '/tmp/tmp.Ey23alPX8e/ara/ara.bigram-dawg'

=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=./tessdata
[Sat Jan 7 10:10:31 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0 lstm.train ../langdata/ara/ara.con
fig
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0 lstm.train .
./langdata/ara/ara.config
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0 lstm.train ../langdata/ara/ara.con
fig
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0 lstm.train ../langdata/ara/ara
.config
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0 lstm.train ../langdata
/ara/ara.config
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0 lstm.train ../langdata/ara
/ara.config
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0 lstm
.train ../langdata/ara/ara.config
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0 lstm.train ../langda
ta/ara/ara.config
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Detected 1300 diacritics
Detected 675 diacritics
Detected 923 diacritics
Page 2
Page 2
No block overlapping textline: اونُمَآ نَيذِلَّا اوقُلَ اذَإِوَ نَومُلَعْيَ الَ نْكِلَوَ ءُاهَفَسُّلا مُهُ مْهُنَّإِ الَأَ ءُاهَفَسُّلا نَمَآ اكَمَ
No block overlapping textline: امَّلَفَ ارًانَ دَقَوْتَسْا يذِلَّا لِثَمَكَ مْهُلُثَمَ نَيدِتَهْمُ اونُاكَ امَوَ مْهُتُرَاجَتِ تْحَبِرَ امَفَ ىدَهُلْابِ
Page 2
Page 2
Page 2
Page 2
Page 2
Page 2
Loaded 39/39 pages (1-39) of document /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.lstmf
Page 3
Loaded 55/55 pages (1-55) of document /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.lstmf
Loaded 53/53 pages (1-53) of document /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.lstmf
Loaded 55/55 pages (1-55) of document /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.lstmf
Loaded 50/50 pages (1-50) of document /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.lstmf
Loaded 36/36 pages (1-36) of document /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.lstmf
Loaded 59/59 pages (1-59) of document /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.lstmf
Page 3
Page 3
Page 3
Loaded 83/83 pages (1-83) of document /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.lstmf
Loaded 55/55 pages (1-55) of document /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.lstmf
Loaded 109/109 pages (1-109) of document /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.lstmf
Loaded 100/100 pages (1-100) of document /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.lstmf
Loaded 79/79 pages (1-79) of document /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.lstmf
[Sat Jan 7 10:10:59 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0 lstm.train ../langdata/ara/ara.c
onfig
[Sat Jan 7 10:10:59 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0 lstm.train ../
langdata/ara/ara.config
[Sat Jan 7 10:10:59 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0 lstm.tra
in ../langdata/ara/ara.config
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Page 2
Page 2
Page 2
Loaded 43/43 pages (1-43) of document /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.lstmf
Page 3
Loaded 56/56 pages (1-56) of document /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.lstmf
Loaded 53/53 pages (1-53) of document /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.lstmf
Page 3
Loaded 90/90 pages (1-90) of document /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.lstmf
Loaded 109/109 pages (1-109) of document /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.lstmf

=== Constructing LSTM training data ===
Creating new directory /home/shree/tesstutorial/aralayer
Copying ../langdata/ara/ara.config to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.unicharset to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.number-dawg to /home/shree/tesstutorial/aralayer/ara.lstm-number-dawg
Moving /tmp/tmp.Ey23alPX8e/ara/ara.punc-dawg to /home/shree/tesstutorial/aralayer/ara.lstm-punc-dawg
Moving /tmp/tmp.Ey23alPX8e/ara/ara.word-dawg to /home/shree/tesstutorial/aralayer/ara.lstm-word-dawg
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.lstmf to /home/shree/tesstutorial/aralayer

Completed training for language 'ara'

Shreeshrii · 2017-01-07T05:02:06Z

$ mkdir -p ~/tesstutorial/aralayer_from_ara
$ combine_tessdata -e ../tessdata/ara.traineddata \
>   ~/tesstutorial/aralayer_from_ara/ara.lstm
Extracting tessdata components from ../tessdata/ara.traineddata
Wrote /home/shree/tesstutorial/aralayer_from_ara/ara.lstm
$
$  lstmtraining -U ~/tesstutorial/aralayer/ara.unicharset \
>   --script_dir ../langdata  --debug_interval 0 \
>   --continue_from ~/tesstutorial/aralayer_from_ara/ara.lstm \
>   --append_index 5 --net_spec '[Lfx256 O1c105]' \
>   --learning_rate 10e-5 \
>   --net_mode 192 \
>   --perfect_sample_delay 19 \
>   --model_output ~/tesstutorial/aralayer_from_ara/aralayer \
>   --train_listfile ~/tesstutorial/aralayer/ara.training_files.txt \
>   --eval_listfile ~/tesstutorial/ara/ara.training_files.txt \
>   --max_iterations 50000
Loaded file /home/shree/tesstutorial/aralayer_from_ara/ara.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from /home/shree/tesstutorial/aralayer_from_ara/ara.lstm
Mirror { of } is not in unicharset
Appending a new network to an old one!!Setting unichar properties
Setting properties for script Common
Setting properties for script Latin
Setting properties for script Arabic
Warning: given outputs 105 not equal to unicharset of 106.
Num outputs,weights in serial:
  Lfx256:256, 394240
  Fc106:106, 27242
Total weights = 421482
Built network:[1,0,0,1[C5,5Ft16]Mp3,3Lfys64Lfx128Lrx128Lfx256Fc106] from request [Lfx256 O1c105]
Training parameters:
  Debug interval = 0, weights = 0.1, learning rate = 0.0001, momentum=0.9
Loaded 111/111 pages (1-111) of document /home/shree/tesstutorial/aralayer/ara.Amiri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Arial.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Microsoft_Sans_Serif.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Scheherazade.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Tahoma.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Courier_New.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Arial_Unicode_MS.exp0.lstmf
Loaded 229/229 pages (1-229) of document /home/shree/tesstutorial/ara/ara.Amiri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Calibri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.FreeSerif.exp0.lstmf
Loaded 232/232 pages (1-232) of document /home/shree/tesstutorial/ara/ara.Arial.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Times_New_Roman.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Traditional_Arabic.exp0.lstmf
At iteration 100/100/100, Mean rms=6.949%, delta=69.759%, char train=127.235%, word train=100%, skip ratio=0%,  New worst char error = 127.235 wrote checkpoint.

At iteration 200/200/200, Mean rms=6.558%, delta=62.072%, char train=116.738%, word train=100%, skip ratio=0%,  New worst char error = 116.738 wrote checkpoint.

Encoding of string failed! Failure bytes: ffffffd9 ffffff92 20 ffffffd9 ffffff8d 20 ffffffd9 ffffff90 20 ffffffd9 ffffff8f ffffffd9 ffffff8c 20 ffffffd9 ffffff8b 20 ffffff
d9 ffffff8e 20 20 ffffffd9 ffffff92 20 ffffffd9 ffffff8d 20 ffffffd9 ffffff90 20 ffffffd9 ffffff8f ffffffd9 ffffff8c 20 ffffffd9 ffffff8b 20 ffffffd9 ffffff8e 20 ffffffd9
ffffff91 20 ffffffd8 ffffffa8 ffffffd9 ffffff91 ffffffd9 ffffff90 ffffffd8 ffffffb1 ffffffd9 ffffff8e 20 ffffffd9 ffffff87 ffffffd9 ffffff90 ffffffd9 ffffff84 ffffffd9 fff
fff91 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd9 ffffff90 20 ffffffd8 ffffffaf ffffffd9 ffffff8f ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd8 ffffffad ffffffd9 ffffff8e
ffffffd9 ffffff84 ffffffd9 ffffff92 ffffffd8 ffffffa7 20 ffffffd9 ffffff85 ffffffd9 ffffff90 ffffffd9 ffffff8a ffffffd8 ffffffad ffffffd9 ffffff90 ffffffd8 ffffffb1 ffffff
d9 ffffff91 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff86 ffffffd9 ffffff90 ffffffd9 ffffff85 ffffffd9 ffffff8e ffffffd8 ffffffad ffffffd9 fff
fff92 ffffffd8 ffffffb1 ffffffd9 ffffff91 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff87 ffffffd9 ffffff90 ffffffd9 ffffff84 ffffffd9 ffffff91
ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff85 ffffffd9 ffffff90 ffffffd8 ffffffb3 ffffffd9 ffffff92 ffffffd8 ffffffa8 ffffffd9 ffffff90
Can't encode transcription: / بَـَتـكَ ةحتف تاكرحلا  ْ ٍ ِ ٌُ ً َ  ْ ٍ ِ ٌُ ً َ ّ بِّرَ هِلَّلِ دُمْحَلْا مِيحِرَّلا نِمَحْرَّلا هِلَّلا مِسْبِ
At iteration 300/300/301, Mean rms=6.463%, delta=59.695%, char train=111.691%, word train=100%, skip ratio=0.333%,  New worst char error = 111.691 wrote checkpoint.

At iteration 400/400/401, Mean rms=6.363%, delta=57.356%, char train=106.695%, word train=100%, skip ratio=0.25%,  New worst char error = 106.695 wrote checkpoint.

lstmtraining: ../ccutil/genericvector.h:696: T& GenericVector<T>::operator[](int) const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.
Aborted (core dumped)

Shreeshrii · 2017-01-08T09:12:40Z

This seems to be happening when an --eval_listfile is given. Seems to work if that is not given. See below:

shree@ALL-IN-1-TOUCH:/mnt/c/Users/User/shree/tesseract-ocr$  lstmtraining -U ~/tesstutorial/aralayer/ara.unicharset \
>   --script_dir ../langdata  --debug_interval 0 \
>   --continue_from ~/tesstutorial/aralayer_from_ara/ara.lstm \
>   --append_index 5 --net_spec '[Lfx256 O1c105]' \
>   --learning_rate 10e-5 \
>   --net_mode 192 \
>   --perfect_sample_delay 19 \
>   --model_output ~/tesstutorial/aralayer_from_ara/aralayer \
>    --eval_listfile ~/tesstutorial/ara/ara.training_files.txt  \
>   --train_listfile ~/tesstutorial/aralayer/ara.training_files.txt \
>   --max_iterations 50000
Loaded file /home/shree/tesstutorial/aralayer_from_ara/aralayer_checkpoint, unpacking...
Successfully restored trainer from /home/shree/tesstutorial/aralayer_from_ara/aralayer_checkpoint
Loaded 111/111 pages (1-111) of document /home/shree/tesstutorial/aralayer/ara.Amiri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Arial.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Calibri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Courier_New.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Arial_Unicode_MS.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.FreeSerif.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Microsoft_Sans_Serif.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Tahoma.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Scheherazade.exp0.lstmf
Loaded 229/229 pages (1-229) of document /home/shree/tesstutorial/ara/ara.Amiri.exp0.lstmf
Loaded 31/113 pages (83-113) of document /home/shree/tesstutorial/aralayer/ara.Traditional_Arabic.exp0.lstmf
Loaded 232/232 pages (1-232) of document /home/shree/tesstutorial/ara/ara.Arial.exp0.lstmf
Loaded 31/113 pages (83-113) of document /home/shree/tesstutorial/aralayer/ara.Times_New_Roman.exp0.lstmf
At iteration 16533/33300/33327, Mean rms=0.79%, delta=0.326%, char train=2.38%, word train=11.082%, skip ratio=0.1%,  New worst char error = 2.38 wrote checkpoint.

lstmtraining: ../ccutil/genericvector.h:696: T& GenericVector<T>::operator[](int) const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.
2 Percent improvement time=7141, best error was 4.338 @ 9418
Aborted (core dumped)

without --eval_listfile process continues

 shree@ALL-IN-1-TOUCH:/mnt/c/Users/User/shree/tesseract-ocr$  lstmtraining -U ~/tesstutorial/aralayer/ara.unicharset \
>   --script_dir ../langdata  --debug_interval 0 \
>   --continue_from ~/tesstutorial/aralayer_from_ara/ara.lstm \
>   --append_index 5 --net_spec '[Lfx256 O1c105]' \
>   --learning_rate 10e-5 \
>   --net_mode 192 \
>   --perfect_sample_delay 19 \
>   --model_output ~/tesstutorial/aralayer_from_ara/aralayer \
>    --train_listfile ~/tesstutorial/aralayer/ara.training_files.txt \
>   --max_iterations 50000
Loaded file /home/shree/tesstutorial/aralayer_from_ara/aralayer_checkpoint, unpacking...
Successfully restored trainer from /home/shree/tesstutorial/aralayer_from_ara/aralayer_checkpoint
Loaded 111/111 pages (1-111) of document /home/shree/tesstutorial/aralayer/ara.Amiri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Arial.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Calibri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Microsoft_Sans_Serif.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Arial_Unicode_MS.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Scheherazade.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.FreeSerif.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Tahoma.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Courier_New.exp0.lstmf
Loaded 22/113 pages (92-113) of document /home/shree/tesstutorial/aralayer/ara.Traditional_Arabic.exp0.lstmf
Loaded 22/113 pages (92-113) of document /home/shree/tesstutorial/aralayer/ara.Times_New_Roman.exp0.lstmf
2 Percent improvement time=7141, best error was 4.338 @ 9418
At iteration 16559/33400/33427, Mean rms=0.776%, delta=0.33%, char train=2.313%, word train=10.483%, skip ratio=0.1%,  New best char error = 2.313 wrote best model:/home/s
hree/tesstutorial/aralayer_from_ara/aralayer2.313_16559.lstm wrote checkpoint.

2 Percent improvement time=7177, best error was 4.338 @ 9418
At iteration 16595/33500/33527, Mean rms=0.778%, delta=0.334%, char train=2.312%, word train=10.634%, skip ratio=0.1%,  New best char error = 2.312 wrote checkpoint.

Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Times_New_Roman.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Traditional_Arabic.exp0.lstmf
At iteration 16627/33600/33627, Mean rms=0.788%, delta=0.344%, char train=2.473%, word train=11.073%, skip ratio=0%,  New worst char error = 2.473 wrote checkpoint.

At iteration 16664/33700/33727, Mean rms=0.79%, delta=0.356%, char train=2.519%, word train=11.23%, skip ratio=0%,  New worst char error = 2.519 wrote checkpoint.

Encoding of string failed! Failure bytes: ffffffd9 ffffff92 20 ffffffd9 ffffff8d 20 ffffffd9 ffffff90 20 ffffffd9 ffffff8f ffffffd9 ffffff8c 20 ffffffd9 ffffff8b 20 ffffff
d9 ffffff8e 20 20 ffffffd9 ffffff92 20 ffffffd9 ffffff8d 20 ffffffd9 ffffff90 20 ffffffd9 ffffff8f ffffffd9 ffffff8c 20 ffffffd9 ffffff8b 20 ffffffd9 ffffff8e 20 ffffffd9
ffffff91 20 ffffffd8 ffffffa8 ffffffd9 ffffff91 ffffffd9 ffffff90 ffffffd8 ffffffb1 ffffffd9 ffffff8e 20 ffffffd9 ffffff87 ffffffd9 ffffff90 ffffffd9 ffffff84 ffffffd9 fff
fff91 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd9 ffffff90 20 ffffffd8 ffffffaf ffffffd9 ffffff8f ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd8 ffffffad ffffffd9 ffffff8e
ffffffd9 ffffff84 ffffffd9 ffffff92 ffffffd8 ffffffa7 20 ffffffd9 ffffff85 ffffffd9 ffffff90 ffffffd9 ffffff8a ffffffd8 ffffffad ffffffd9 ffffff90 ffffffd8 ffffffb1 ffffff
d9 ffffff91 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff86 ffffffd9 ffffff90 ffffffd9 ffffff85 ffffffd9 ffffff8e ffffffd8 ffffffad ffffffd9 fff
fff92 ffffffd8 ffffffb1 ffffffd9 ffffff91 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff87 ffffffd9 ffffff90 ffffffd9 ffffff84 ffffffd9 ffffff91
ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff85 ffffffd9 ffffff90 ffffffd8 ffffffb3 ffffffd9 ffffff92 ffffffd8 ffffffa8 ffffffd9 ffffff90
Can't encode transcription: / بَـَتـكَ ةحتف تاكرحلا  ْ ٍ ِ ٌُ ً َ  ْ ٍ ِ ٌُ ً َ ّ بِّرَ هِلَّلِ دُمْحَلْا مِيحِرَّلا نِمَحْرَّلا هِلَّلا مِسْبِ

ghost · 2017-01-09T19:46:24Z

@Shreeshrii I have noticed that the Arabic text in your log is reversed,
Your log shows: مِيحِرَّلا نِمَحْرَّلا هِلَّلا مِسْبِ
It should be: بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

A representation of this mistake, example:
Correct: Peace Be Upon You
Wrong: uoY nopU eB ecaeP

The Arabic language read/write from right to left ( RTL )

Shreeshrii · 2017-01-10T03:29:16Z

Thanks for pointing it out. I neither know Arabic nor am familiar with bidi. Is it just one line that is reversed or all? I am using the training text from langdata, prefixed with sample with diacritics provided by @bmwmy along with few words copied from wikipedia. I had copied the error msg from the console. I could try to save the log in a file to see if that is correct, since it is possible that my locale under bash on Windows 10 does not support Arabic. - excuse the brevity, sent from mobile

…

On 10-Jan-2017 1:16 AM, "christophered" ***@***.***> wrote: @Shreeshrii <https://github.com/Shreeshrii> I have noticed that the Arabic text in your log is reversed, Your log shows: مِيحِرَّلا نِمَحْرَّلا هِلَّلا مِسْبِ It should be: بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ A representation of this mistake, example: Correct: Peace Be Upon You Wrong: uoY nopU eB ecaeP — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#642 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o9eQiTRRJyspo6OSoBaTRMgYZRsHks5rQo6WgaJpZM4LdVVV> .

bmwmy · 2017-01-10T09:25:11Z

@Shreeshrii could you post some generated image files (tif) to look if Arabic text is rendered correctly!

Shreeshrii · 2017-01-10T09:41:58Z

Please see attached, the zip file has the training text, box tiff pair and unicharset. ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jan 10, 2017 at 2:55 PM, bmwmy ***@***.***> wrote: @Shreeshrii <https://github.com/Shreeshrii> could you post some generated image files (tif) to look if Arabic text is rendered correctly! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#642 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o389hnCFPZQnP8q0ueqrdLdfTZB9ks5rQ05-gaJpZM4LdVVV> .

ghost · 2017-01-10T12:00:05Z

@Shreeshrii

All the Arabic language lines are reversed.
I am have checked the samples from Arabic lang. feature request #552
The "Original_Text.txt" was encoded in (UTF-8-BOM) and everything seems okay, except that the words are not in their correct order.
Also just to be sure, go to (Controlled Parnell/clock language and region/ region/ administrative/ change system locale/ Arabic "Saudi Arabia")
Attach the tif/box that you are using
I am not seeing any zip files here.

Shreeshrii · 2017-01-10T12:13:23Z

I had attached file via email. Maybe github does not allow that. Will upload on forum. - excuse the brevity, sent from mobile

…

On 10-Jan-2017 5:30 PM, "christophered" ***@***.***> wrote: @Shreeshrii <https://github.com/Shreeshrii> - All the Arabic language lines are reversed. - I am have checked the samples from #552 <#552> The "Original_Text.txt" was encoded in (UTF-8-BOM) and everything seems okay. - So attach the tif/box that you are using I am not seeing any zip files here. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#642 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o3pcEjeaz_dh6hSyK7S7E5g3vly2ks5rQ3LNgaJpZM4LdVVV> .

Shreeshrii · 2017-01-10T12:48:54Z

ara.TRAINING.zip

Uploaded zip file with training data for a group of fonts which have coverage for Arabic on Windows.

It is possible that the tesstrain.sh process is dropping diacritics as noise. I am trying to change config variables to see if I can get some improvement.

Shreeshrii · 2017-01-10T13:18:14Z

Attached is a log file which shows verbose output for every iteration of training - from middle of current training session.

traininglog-mid.txt

ghost · 2017-01-10T21:18:18Z

@Shreeshrii
What font size are you using for the "Traditional Arabic"?

Initial Observation:

Letter extenders
Don't/ Never set at all the letter extenders (Shift+j or Shift+ت) as a sole letter, they are not a single letter, they are used to stretch the words, and are causing a deterioration in the recognition rate and a huge amount of error based on your log and my experience.
If the letter extender ( ـ ) is set to be considered as a character, then the recognition engine will recognize many stretches and extensions in the words as a ( ـ ) letter extender.

When i used them in my training process, i was merging the letter extender with the Arabic letter into one single box, and putting that Arabic letters as the character of the box, basically, i was trying to train the engine to recognize that Arabic letter in it's multiple positions, as you know the Arabic letters have multiple forms based which is based on it's position in the word ( beginning, middle, ending, isolated )
Example:
( كـ ) is not ( ك + ـ ) in the box file, it should be ( ك )
also ( ـكـ ) or ( ـك ) they are a single character ( ك ) in different positions, this is important in the box file.

Which also means that ( كَـ ) is not ( ك + ـَ ), it is ( كَ )

ghost · 2017-01-10T22:01:24Z

@theraysmith @amitdo @Shreeshrii

Box file disorder
i also observe that the Traditional Arabic box file is in LTR ( Left to Right ) which is reversed, the Arabic language is from RTL ( Right to Left ). That means that the first box should start from from the right side.
( have a look at the attached Arabic example tif/box of version Tesseract 3.05).
Arabic example 1.zip
Example 1, correct box order:

Tesseract 4.0 lstm puts the spaces between the words into boxes, as you know.
Thus a problem arises caused by the box file disorder since the boxes are mistakenly set to be in LTR ( Left to Right ) for Arabic which is wrong, causing jumps from ( the end of the first line) to ( the end of the last letter of the line after it).
See the image attached

ghost · 2017-01-10T22:18:17Z

Wrong encoding & Arabic language support by the text editor
The Arabic language txt should be encoded in UTF-8 or any other that support it.
Most text editors including Notepad++ don't get it right the first time, you must change system locale to Arabic so that the windows Notepad might have some sense in it.

(Controlled Parnell/clock language and region/ region/ administrative/ change system locale/ Arabic "Saudi Arabia")

Also, when using txt, the words are not in their correct order. at google chrome the words are correct, but once copying them and pasting them in a text file, the order is change, what a weird issue.

ghost · 2017-01-10T22:28:23Z

@theraysmith @amitdo @Shreeshrii

The Reversed Text Issue!
This is the first, last and most important problem that is persistent in all the Tesseract versions, including but not limited to Tesseract 4.0,

Shreeshrii · 2017-01-11T04:10:19Z

@Christophered

I had experimented with 32 ptsize for Traditional Arabic in one run. I am using the default, which is 12 pt, I think.
Don't/ Never set at all the letter extenders (Shift+j or Shift+ت) as a sole letter,

It is possible that I copied some text from wikipedia which is incorrect. Please look at the training_text file and let me know which lines should be deleted.

i was merging the letter extender with the Arabic letter into one single box, and putting that Arabic letters as the character of the box, basically, i was trying to train the engine to recognize that Arabic letter in it's multiple positions, as you know the Arabic letters have multiple forms based which is based on it's position in the word ( beginning, middle, ending, isolated )

Please share your training text and I can give it a try.

Shreeshrii · 2017-01-11T04:54:36Z

Original problem, core dumped -
This seems to be happening when an --eval_listfile is given.
Related issues:
#644 (eval not run)
#561 (core dumped)

Arabic related issues:
See new issue filed by @Christophered
#648 (arabic reversal)

Closing this issue.

amitdo · 2017-01-11T11:31:22Z

Wrong encoding & Arabic language support by the text editor
The Arabic language txt should be encoded in UTF-8 or any other that support it.

The langdata text files for all languages are saved using UTF-8 encoding.

imohammadhossein · 2019-06-24T07:11:46Z

i am trying to train or finetune tesseract for my own dataset on farsi language . can anyone please help me through this ?

Shreeshrii closed this as completed Jan 11, 2017

This was referenced Jan 11, 2017

Box File disorder, Arabic Language #648

Open

Q&A: Training Wiki Updates and Request for Info #659

Open

amitdo added the RTL label Mar 18, 2021

amitdo added the encoding failed label Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSTM: Training - Arabic - Add Top layer - Aborted (core dumped) #642

LSTM: Training - Arabic - Add Top layer - Aborted (core dumped) #642

Shreeshrii commented Jan 7, 2017

Shreeshrii commented Jan 7, 2017

Shreeshrii commented Jan 7, 2017

Shreeshrii commented Jan 8, 2017 •

edited

Loading

ghost commented Jan 9, 2017 •

edited by ghost

Loading

Shreeshrii commented Jan 10, 2017 via email

bmwmy commented Jan 10, 2017

Shreeshrii commented Jan 10, 2017 via email

ghost commented Jan 10, 2017 •

edited by ghost

Loading

Shreeshrii commented Jan 10, 2017 via email

Shreeshrii commented Jan 10, 2017 •

edited

Loading

Shreeshrii commented Jan 10, 2017

ghost commented Jan 10, 2017 •

edited by ghost

Loading

ghost commented Jan 10, 2017 •

edited by ghost

Loading

ghost commented Jan 10, 2017 •

edited by ghost

Loading

ghost commented Jan 10, 2017 •

edited by ghost

Loading

Shreeshrii commented Jan 11, 2017

Shreeshrii commented Jan 11, 2017

amitdo commented Jan 11, 2017 •

edited

Loading

imohammadhossein commented Jun 24, 2019

LSTM: Training - Arabic - Add Top layer - Aborted (core dumped) #642

LSTM: Training - Arabic - Add Top layer - Aborted (core dumped) #642

Comments

Shreeshrii commented Jan 7, 2017

Shreeshrii commented Jan 7, 2017

Shreeshrii commented Jan 7, 2017

Shreeshrii commented Jan 8, 2017 • edited Loading

ghost commented Jan 9, 2017 • edited by ghost Loading

Shreeshrii commented Jan 10, 2017 via email

bmwmy commented Jan 10, 2017

Shreeshrii commented Jan 10, 2017 via email

ghost commented Jan 10, 2017 • edited by ghost Loading

Shreeshrii commented Jan 10, 2017 via email

Shreeshrii commented Jan 10, 2017 • edited Loading

Shreeshrii commented Jan 10, 2017

ghost commented Jan 10, 2017 • edited by ghost Loading

ghost commented Jan 10, 2017 • edited by ghost Loading

ghost commented Jan 10, 2017 • edited by ghost Loading

ghost commented Jan 10, 2017 • edited by ghost Loading

Shreeshrii commented Jan 11, 2017

Shreeshrii commented Jan 11, 2017

amitdo commented Jan 11, 2017 • edited Loading

imohammadhossein commented Jun 24, 2019

Shreeshrii commented Jan 8, 2017 •

edited

Loading

ghost commented Jan 9, 2017 •

edited by ghost

Loading

ghost commented Jan 10, 2017 •

edited by ghost

Loading

Shreeshrii commented Jan 10, 2017 •

edited

Loading

ghost commented Jan 10, 2017 •

edited by ghost

Loading

ghost commented Jan 10, 2017 •

edited by ghost

Loading

ghost commented Jan 10, 2017 •

edited by ghost

Loading

ghost commented Jan 10, 2017 •

edited by ghost

Loading

amitdo commented Jan 11, 2017 •

edited

Loading