Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM: Training - Arabic - Add Top layer - Aborted (core dumped) #642

Closed
Shreeshrii opened this issue Jan 7, 2017 · 19 comments
Closed

LSTM: Training - Arabic - Add Top layer - Aborted (core dumped) #642

Shreeshrii opened this issue Jan 7, 2017 · 19 comments

Comments

@Shreeshrii
Copy link
Collaborator

While Add Top layer LSTM training worked for Latin unicharset based languages (eng, nor), It is failing for Arabic.

I am copying below the log for creating lstmf files and then for the training.

@Shreeshrii
Copy link
Collaborator Author

$ training/tesstrain.sh --fonts_dir /home/shree/.fonts --lang ara    --linedata_only --noextract_font_properties
   --langdata_dir ../langdata --tessdata_dir ./tessdata   --output_dir ~/tesstutorial/aralayer

=== Starting training for language 'ara'
[Sat Jan 7 10:09:33 DST 2017] /usr/local/bin/text2image --fonts_dir=/home/shree/.fonts --font=Arial Unicode MS --outputbase=/tmp/font_tmp.0Tqbe3jIFz/sample_text.txt --text
=/tmp/font_tmp.0Tqbe3jIFz/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz
Rendered page 0 to file /tmp/font_tmp.0Tqbe3jIFz/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using Arial Unicode MS
Rendering using Amiri
Rendering using Arial
Rendering using Scheherazade
Rendering using Calibri
Rendering using Tahoma
Rendering using FreeSerif
Rendering using Microsoft Sans Serif
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0 --font=Arial Unicode MS --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0 --font=Amiri --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0 --font=Arial --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0 --font=Scheherazade --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0 --font=Calibri --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0 --font=Tahoma --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0 --font=FreeSerif --text=../langdata/ara/ara.training_text
Stripped 15 unrenderable words
[Sat Jan 7 10:09:43 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0 --font=Microsoft Sans Serif --text=../langdata/ara/ara.training_text
Stripped 15 unrenderable words
Stripped 15 unrenderable words
Stripped 2 unrenderable words
Stripped 15 unrenderable words
Stripped 15 unrenderable words
Stripped 13 unrenderable words
Stripped 15 unrenderable words
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.tif
Rendered page 2 to file /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.tif
Rendered page 2 to file /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.tif
Rendered page 2 to file /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.tif
Rendered page 2 to file /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.tif
Rendered page 2 to file /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.tif
Rendering using Times New Roman,
Rendering using Courier New
Rendering using Traditional Arabic
[Sat Jan 7 10:10:02 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0 --font=Times New Roman, --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:10:03 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0 --font=Courier New --text=../langdata/ara/ara.training_text
[Sat Jan 7 10:10:03 DST 2017] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.0Tqbe3jIFz --fonts_dir=/home/shree/.fonts --strip_unrenderable_words --leading=32
 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0 --font=Traditional Arabic --text=../langdata/ara/ara.training_text
Stripped 15 unrenderable words
Stripped 15 unrenderable words
Stripped 15 unrenderable words
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.tif
Rendered page 0 to file /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.tif
Rendered page 1 to file /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.tif
Rendered page 2 to file /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Sat Jan 7 10:10:13 DST 2017] /usr/local/bin/unicharset_extractor -D /tmp/tmp.Ey23alPX8e/ara/ /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Arial.
exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.box /tmp/tmp.Ey23a
lPX8e/ara/ara.FreeSerif.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Ta
homa.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.box /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.box
Extracting unicharset from /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.box
Wrote unicharset file /tmp/tmp.Ey23alPX8e/ara//unicharset.
[Sat Jan 7 10:10:14 DST 2017] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.Ey23alPX8e/ara/ara.unicharset -O /tmp/tmp.Ey23alPX8e/ara/ara.unicharset -X /tmp/tmp.Ey23
alPX8e/ara/ara.xheights --script_dir=../langdata
Loaded unicharset of size 381 from file /tmp/tmp.Ey23alPX8e/ara/ara.unicharset
Setting unichar properties
Mirror { of } is not in unicharset
Writing unicharset to file /tmp/tmp.Ey23alPX8e/ara/ara.unicharset

=== Phase D: Generating Dawg files ===
Generating word Dawg
[Sat Jan 7 10:10:14 DST 2017] /usr/local/bin/wordlist2dawg -r 1 ../langdata/ara/ara.wordlist /tmp/tmp.Ey23alPX8e/ara/ara.word-dawg /tmp/tmp.Ey23alPX8e/ara/ara.unicharset
Set reverse_policy to RRP_REVERSE_IF_HAS_RTL
Loading unicharset from '/tmp/tmp.Ey23alPX8e/ara/ara.unicharset'
Reading word list from '../langdata/ara/ara.wordlist'
Reducing Trie to SquishedDawg
Writing squished DAWG to '/tmp/tmp.Ey23alPX8e/ara/ara.word-dawg'
Generating frequent-word Dawg
[Sat Jan 7 10:10:20 DST 2017] /usr/local/bin/wordlist2dawg -r 1 /tmp/tmp.Ey23alPX8e/ara/ara.wordlist.clean.freq /tmp/tmp.Ey23alPX8e/ara/ara.freq-dawg /tmp/tmp.Ey23alPX8e/a
ra/ara.unicharset
Set reverse_policy to RRP_REVERSE_IF_HAS_RTL
Loading unicharset from '/tmp/tmp.Ey23alPX8e/ara/ara.unicharset'
Reading word list from '/tmp/tmp.Ey23alPX8e/ara/ara.wordlist.clean.freq'
Reducing Trie to SquishedDawg
Writing squished DAWG to '/tmp/tmp.Ey23alPX8e/ara/ara.freq-dawg'
[Sat Jan 7 10:10:20 DST 2017] /usr/local/bin/wordlist2dawg -r 2 ../langdata/ara/ara.punc /tmp/tmp.Ey23alPX8e/ara/ara.punc-dawg /tmp/tmp.Ey23alPX8e/ara/ara.unicharset
Set reverse_policy to RRP_FORCE_REVERSE
Loading unicharset from '/tmp/tmp.Ey23alPX8e/ara/ara.unicharset'
Reading word list from '../langdata/ara/ara.punc'
Reducing Trie to SquishedDawg
Writing squished DAWG to '/tmp/tmp.Ey23alPX8e/ara/ara.punc-dawg'
[Sat Jan 7 10:10:21 DST 2017] /usr/local/bin/wordlist2dawg -r 0 ../langdata/ara/ara.numbers /tmp/tmp.Ey23alPX8e/ara/ara.number-dawg /tmp/tmp.Ey23alPX8e/ara/ara.unicharset
Set reverse_policy to RRP_DO_NO_REVERSE
Loading unicharset from '/tmp/tmp.Ey23alPX8e/ara/ara.unicharset'
Reading word list from '../langdata/ara/ara.numbers'
Reducing Trie to SquishedDawg
Writing squished DAWG to '/tmp/tmp.Ey23alPX8e/ara/ara.number-dawg'
[Sat Jan 7 10:10:21 DST 2017] /usr/local/bin/wordlist2dawg -r 1 ../langdata/ara/ara.word.bigrams /tmp/tmp.Ey23alPX8e/ara/ara.bigram-dawg /tmp/tmp.Ey23alPX8e/ara/ara.unicha
rset
Set reverse_policy to RRP_REVERSE_IF_HAS_RTL
Loading unicharset from '/tmp/tmp.Ey23alPX8e/ara/ara.unicharset'
Reading word list from '../langdata/ara/ara.word.bigrams'
Reducing Trie to SquishedDawg
Writing squished DAWG to '/tmp/tmp.Ey23alPX8e/ara/ara.bigram-dawg'

=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=./tessdata
[Sat Jan 7 10:10:31 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0 lstm.train ../langdata/ara/ara.con
fig
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0 lstm.train .
./langdata/ara/ara.config
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0 lstm.train ../langdata/ara/ara.con
fig
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0 lstm.train ../langdata/ara/ara
.config
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0 lstm.train ../langdata
/ara/ara.config
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0 lstm.train ../langdata/ara
/ara.config
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0 lstm
.train ../langdata/ara/ara.config
[Sat Jan 7 10:10:32 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0 lstm.train ../langda
ta/ara/ara.config
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Detected 1300 diacritics
Detected 675 diacritics
Detected 923 diacritics
Page 2
Page 2
No block overlapping textline: اونُمَآ نَيذِلَّا اوقُلَ اذَإِوَ نَومُلَعْيَ الَ نْكِلَوَ ءُاهَفَسُّلا مُهُ مْهُنَّإِ الَأَ ءُاهَفَسُّلا نَمَآ اكَمَ
No block overlapping textline: امَّلَفَ ارًانَ دَقَوْتَسْا يذِلَّا لِثَمَكَ مْهُلُثَمَ نَيدِتَهْمُ اونُاكَ امَوَ مْهُتُرَاجَتِ تْحَبِرَ امَفَ ىدَهُلْابِ
Page 2
Page 2
Page 2
Page 2
Page 2
Page 2
Loaded 39/39 pages (1-39) of document /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.lstmf
Page 3
Loaded 55/55 pages (1-55) of document /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.lstmf
Loaded 53/53 pages (1-53) of document /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.lstmf
Loaded 55/55 pages (1-55) of document /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.lstmf
Loaded 50/50 pages (1-50) of document /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.lstmf
Loaded 36/36 pages (1-36) of document /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.lstmf
Loaded 59/59 pages (1-59) of document /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.lstmf
Page 3
Page 3
Page 3
Loaded 83/83 pages (1-83) of document /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.lstmf
Loaded 55/55 pages (1-55) of document /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.lstmf
Loaded 109/109 pages (1-109) of document /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.lstmf
Loaded 100/100 pages (1-100) of document /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.lstmf
Loaded 79/79 pages (1-79) of document /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.lstmf
[Sat Jan 7 10:10:59 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0 lstm.train ../langdata/ara/ara.c
onfig
[Sat Jan 7 10:10:59 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0 lstm.train ../
langdata/ara/ara.config
[Sat Jan 7 10:10:59 DST 2017] /usr/local/bin/tesseract /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.tif /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0 lstm.tra
in ../langdata/ara/ara.config
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Tesseract Open Source OCR Engine v4.00.00alpha-239-g3817aa3 with Leptonica
Page 1
Page 2
Page 2
Page 2
Loaded 43/43 pages (1-43) of document /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.lstmf
Page 3
Loaded 56/56 pages (1-56) of document /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.lstmf
Loaded 53/53 pages (1-53) of document /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.lstmf
Page 3
Loaded 90/90 pages (1-90) of document /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.lstmf
Loaded 109/109 pages (1-109) of document /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.lstmf

=== Constructing LSTM training data ===
Creating new directory /home/shree/tesstutorial/aralayer
Copying ../langdata/ara/ara.config to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.unicharset to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.number-dawg to /home/shree/tesstutorial/aralayer/ara.lstm-number-dawg
Moving /tmp/tmp.Ey23alPX8e/ara/ara.punc-dawg to /home/shree/tesstutorial/aralayer/ara.lstm-punc-dawg
Moving /tmp/tmp.Ey23alPX8e/ara/ara.word-dawg to /home/shree/tesstutorial/aralayer/ara.lstm-word-dawg
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.box to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.tif to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Amiri.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Arial.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Arial_Unicode_MS.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Calibri.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Courier_New.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.FreeSerif.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Microsoft_Sans_Serif.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Scheherazade.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Tahoma.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Times_New_Roman.exp0.lstmf to /home/shree/tesstutorial/aralayer
Moving /tmp/tmp.Ey23alPX8e/ara/ara.Traditional_Arabic.exp0.lstmf to /home/shree/tesstutorial/aralayer

Completed training for language 'ara'


@Shreeshrii
Copy link
Collaborator Author

$ mkdir -p ~/tesstutorial/aralayer_from_ara
$ combine_tessdata -e ../tessdata/ara.traineddata \
>   ~/tesstutorial/aralayer_from_ara/ara.lstm
Extracting tessdata components from ../tessdata/ara.traineddata
Wrote /home/shree/tesstutorial/aralayer_from_ara/ara.lstm
$
$  lstmtraining -U ~/tesstutorial/aralayer/ara.unicharset \
>   --script_dir ../langdata  --debug_interval 0 \
>   --continue_from ~/tesstutorial/aralayer_from_ara/ara.lstm \
>   --append_index 5 --net_spec '[Lfx256 O1c105]' \
>   --learning_rate 10e-5 \
>   --net_mode 192 \
>   --perfect_sample_delay 19 \
>   --model_output ~/tesstutorial/aralayer_from_ara/aralayer \
>   --train_listfile ~/tesstutorial/aralayer/ara.training_files.txt \
>   --eval_listfile ~/tesstutorial/ara/ara.training_files.txt \
>   --max_iterations 50000
Loaded file /home/shree/tesstutorial/aralayer_from_ara/ara.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from /home/shree/tesstutorial/aralayer_from_ara/ara.lstm
Mirror { of } is not in unicharset
Appending a new network to an old one!!Setting unichar properties
Setting properties for script Common
Setting properties for script Latin
Setting properties for script Arabic
Warning: given outputs 105 not equal to unicharset of 106.
Num outputs,weights in serial:
  Lfx256:256, 394240
  Fc106:106, 27242
Total weights = 421482
Built network:[1,0,0,1[C5,5Ft16]Mp3,3Lfys64Lfx128Lrx128Lfx256Fc106] from request [Lfx256 O1c105]
Training parameters:
  Debug interval = 0, weights = 0.1, learning rate = 0.0001, momentum=0.9
Loaded 111/111 pages (1-111) of document /home/shree/tesstutorial/aralayer/ara.Amiri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Arial.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Microsoft_Sans_Serif.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Scheherazade.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Tahoma.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Courier_New.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Arial_Unicode_MS.exp0.lstmf
Loaded 229/229 pages (1-229) of document /home/shree/tesstutorial/ara/ara.Amiri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Calibri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.FreeSerif.exp0.lstmf
Loaded 232/232 pages (1-232) of document /home/shree/tesstutorial/ara/ara.Arial.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Times_New_Roman.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Traditional_Arabic.exp0.lstmf
At iteration 100/100/100, Mean rms=6.949%, delta=69.759%, char train=127.235%, word train=100%, skip ratio=0%,  New worst char error = 127.235 wrote checkpoint.

At iteration 200/200/200, Mean rms=6.558%, delta=62.072%, char train=116.738%, word train=100%, skip ratio=0%,  New worst char error = 116.738 wrote checkpoint.

Encoding of string failed! Failure bytes: ffffffd9 ffffff92 20 ffffffd9 ffffff8d 20 ffffffd9 ffffff90 20 ffffffd9 ffffff8f ffffffd9 ffffff8c 20 ffffffd9 ffffff8b 20 ffffff
d9 ffffff8e 20 20 ffffffd9 ffffff92 20 ffffffd9 ffffff8d 20 ffffffd9 ffffff90 20 ffffffd9 ffffff8f ffffffd9 ffffff8c 20 ffffffd9 ffffff8b 20 ffffffd9 ffffff8e 20 ffffffd9
ffffff91 20 ffffffd8 ffffffa8 ffffffd9 ffffff91 ffffffd9 ffffff90 ffffffd8 ffffffb1 ffffffd9 ffffff8e 20 ffffffd9 ffffff87 ffffffd9 ffffff90 ffffffd9 ffffff84 ffffffd9 fff
fff91 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd9 ffffff90 20 ffffffd8 ffffffaf ffffffd9 ffffff8f ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd8 ffffffad ffffffd9 ffffff8e
ffffffd9 ffffff84 ffffffd9 ffffff92 ffffffd8 ffffffa7 20 ffffffd9 ffffff85 ffffffd9 ffffff90 ffffffd9 ffffff8a ffffffd8 ffffffad ffffffd9 ffffff90 ffffffd8 ffffffb1 ffffff
d9 ffffff91 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff86 ffffffd9 ffffff90 ffffffd9 ffffff85 ffffffd9 ffffff8e ffffffd8 ffffffad ffffffd9 fff
fff92 ffffffd8 ffffffb1 ffffffd9 ffffff91 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff87 ffffffd9 ffffff90 ffffffd9 ffffff84 ffffffd9 ffffff91
ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff85 ffffffd9 ffffff90 ffffffd8 ffffffb3 ffffffd9 ffffff92 ffffffd8 ffffffa8 ffffffd9 ffffff90
Can't encode transcription: / بَـَتـكَ ةحتف تاكرحلا  ْ ٍ ِ ٌُ ً َ  ْ ٍ ِ ٌُ ً َ ّ بِّرَ هِلَّلِ دُمْحَلْا مِيحِرَّلا نِمَحْرَّلا هِلَّلا مِسْبِ
At iteration 300/300/301, Mean rms=6.463%, delta=59.695%, char train=111.691%, word train=100%, skip ratio=0.333%,  New worst char error = 111.691 wrote checkpoint.

At iteration 400/400/401, Mean rms=6.363%, delta=57.356%, char train=106.695%, word train=100%, skip ratio=0.25%,  New worst char error = 106.695 wrote checkpoint.

lstmtraining: ../ccutil/genericvector.h:696: T& GenericVector<T>::operator[](int) const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.
Aborted (core dumped)

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 8, 2017

This seems to be happening when an --eval_listfile is given. Seems to work if that is not given. See below:

shree@ALL-IN-1-TOUCH:/mnt/c/Users/User/shree/tesseract-ocr$  lstmtraining -U ~/tesstutorial/aralayer/ara.unicharset \
>   --script_dir ../langdata  --debug_interval 0 \
>   --continue_from ~/tesstutorial/aralayer_from_ara/ara.lstm \
>   --append_index 5 --net_spec '[Lfx256 O1c105]' \
>   --learning_rate 10e-5 \
>   --net_mode 192 \
>   --perfect_sample_delay 19 \
>   --model_output ~/tesstutorial/aralayer_from_ara/aralayer \
>    --eval_listfile ~/tesstutorial/ara/ara.training_files.txt  \
>   --train_listfile ~/tesstutorial/aralayer/ara.training_files.txt \
>   --max_iterations 50000
Loaded file /home/shree/tesstutorial/aralayer_from_ara/aralayer_checkpoint, unpacking...
Successfully restored trainer from /home/shree/tesstutorial/aralayer_from_ara/aralayer_checkpoint
Loaded 111/111 pages (1-111) of document /home/shree/tesstutorial/aralayer/ara.Amiri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Arial.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Calibri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Courier_New.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Arial_Unicode_MS.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.FreeSerif.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Microsoft_Sans_Serif.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Tahoma.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Scheherazade.exp0.lstmf
Loaded 229/229 pages (1-229) of document /home/shree/tesstutorial/ara/ara.Amiri.exp0.lstmf
Loaded 31/113 pages (83-113) of document /home/shree/tesstutorial/aralayer/ara.Traditional_Arabic.exp0.lstmf
Loaded 232/232 pages (1-232) of document /home/shree/tesstutorial/ara/ara.Arial.exp0.lstmf
Loaded 31/113 pages (83-113) of document /home/shree/tesstutorial/aralayer/ara.Times_New_Roman.exp0.lstmf
At iteration 16533/33300/33327, Mean rms=0.79%, delta=0.326%, char train=2.38%, word train=11.082%, skip ratio=0.1%,  New worst char error = 2.38 wrote checkpoint.

lstmtraining: ../ccutil/genericvector.h:696: T& GenericVector<T>::operator[](int) const [with T = char]: Assertion `index >= 0 && index < size_used_' failed.
2 Percent improvement time=7141, best error was 4.338 @ 9418
Aborted (core dumped)

without --eval_listfile process continues

 shree@ALL-IN-1-TOUCH:/mnt/c/Users/User/shree/tesseract-ocr$  lstmtraining -U ~/tesstutorial/aralayer/ara.unicharset \
>   --script_dir ../langdata  --debug_interval 0 \
>   --continue_from ~/tesstutorial/aralayer_from_ara/ara.lstm \
>   --append_index 5 --net_spec '[Lfx256 O1c105]' \
>   --learning_rate 10e-5 \
>   --net_mode 192 \
>   --perfect_sample_delay 19 \
>   --model_output ~/tesstutorial/aralayer_from_ara/aralayer \
>    --train_listfile ~/tesstutorial/aralayer/ara.training_files.txt \
>   --max_iterations 50000
Loaded file /home/shree/tesstutorial/aralayer_from_ara/aralayer_checkpoint, unpacking...
Successfully restored trainer from /home/shree/tesstutorial/aralayer_from_ara/aralayer_checkpoint
Loaded 111/111 pages (1-111) of document /home/shree/tesstutorial/aralayer/ara.Amiri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Arial.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Calibri.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Microsoft_Sans_Serif.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Arial_Unicode_MS.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Scheherazade.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.FreeSerif.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Tahoma.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Courier_New.exp0.lstmf
Loaded 22/113 pages (92-113) of document /home/shree/tesstutorial/aralayer/ara.Traditional_Arabic.exp0.lstmf
Loaded 22/113 pages (92-113) of document /home/shree/tesstutorial/aralayer/ara.Times_New_Roman.exp0.lstmf
2 Percent improvement time=7141, best error was 4.338 @ 9418
At iteration 16559/33400/33427, Mean rms=0.776%, delta=0.33%, char train=2.313%, word train=10.483%, skip ratio=0.1%,  New best char error = 2.313 wrote best model:/home/s
hree/tesstutorial/aralayer_from_ara/aralayer2.313_16559.lstm wrote checkpoint.

2 Percent improvement time=7177, best error was 4.338 @ 9418
At iteration 16595/33500/33527, Mean rms=0.778%, delta=0.334%, char train=2.312%, word train=10.634%, skip ratio=0.1%,  New best char error = 2.312 wrote checkpoint.

Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Times_New_Roman.exp0.lstmf
Loaded 113/113 pages (1-113) of document /home/shree/tesstutorial/aralayer/ara.Traditional_Arabic.exp0.lstmf
At iteration 16627/33600/33627, Mean rms=0.788%, delta=0.344%, char train=2.473%, word train=11.073%, skip ratio=0%,  New worst char error = 2.473 wrote checkpoint.

At iteration 16664/33700/33727, Mean rms=0.79%, delta=0.356%, char train=2.519%, word train=11.23%, skip ratio=0%,  New worst char error = 2.519 wrote checkpoint.

Encoding of string failed! Failure bytes: ffffffd9 ffffff92 20 ffffffd9 ffffff8d 20 ffffffd9 ffffff90 20 ffffffd9 ffffff8f ffffffd9 ffffff8c 20 ffffffd9 ffffff8b 20 ffffff
d9 ffffff8e 20 20 ffffffd9 ffffff92 20 ffffffd9 ffffff8d 20 ffffffd9 ffffff90 20 ffffffd9 ffffff8f ffffffd9 ffffff8c 20 ffffffd9 ffffff8b 20 ffffffd9 ffffff8e 20 ffffffd9
ffffff91 20 ffffffd8 ffffffa8 ffffffd9 ffffff91 ffffffd9 ffffff90 ffffffd8 ffffffb1 ffffffd9 ffffff8e 20 ffffffd9 ffffff87 ffffffd9 ffffff90 ffffffd9 ffffff84 ffffffd9 fff
fff91 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd9 ffffff90 20 ffffffd8 ffffffaf ffffffd9 ffffff8f ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd8 ffffffad ffffffd9 ffffff8e
ffffffd9 ffffff84 ffffffd9 ffffff92 ffffffd8 ffffffa7 20 ffffffd9 ffffff85 ffffffd9 ffffff90 ffffffd9 ffffff8a ffffffd8 ffffffad ffffffd9 ffffff90 ffffffd8 ffffffb1 ffffff
d9 ffffff91 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff86 ffffffd9 ffffff90 ffffffd9 ffffff85 ffffffd9 ffffff8e ffffffd8 ffffffad ffffffd9 fff
fff92 ffffffd8 ffffffb1 ffffffd9 ffffff91 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff87 ffffffd9 ffffff90 ffffffd9 ffffff84 ffffffd9 ffffff91
ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff85 ffffffd9 ffffff90 ffffffd8 ffffffb3 ffffffd9 ffffff92 ffffffd8 ffffffa8 ffffffd9 ffffff90
Can't encode transcription: / بَـَتـكَ ةحتف تاكرحلا  ْ ٍ ِ ٌُ ً َ  ْ ٍ ِ ٌُ ً َ ّ بِّرَ هِلَّلِ دُمْحَلْا مِيحِرَّلا نِمَحْرَّلا هِلَّلا مِسْبِ


@ghost
Copy link

ghost commented Jan 9, 2017

@Shreeshrii I have noticed that the Arabic text in your log is reversed,
Your log shows: مِيحِرَّلا نِمَحْرَّلا هِلَّلا مِسْبِ
It should be: بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

A representation of this mistake, example:
Correct: Peace Be Upon You
Wrong: uoY nopU eB ecaeP

The Arabic language read/write from right to left ( RTL )

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 10, 2017 via email

@bmwmy
Copy link

bmwmy commented Jan 10, 2017

@Shreeshrii could you post some generated image files (tif) to look if Arabic text is rendered correctly!

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 10, 2017 via email

@ghost
Copy link

ghost commented Jan 10, 2017

@Shreeshrii

  • All the Arabic language lines are reversed.

  • I am have checked the samples from Arabic lang. feature request #552
    The "Original_Text.txt" was encoded in (UTF-8-BOM) and everything seems okay, except that the words are not in their correct order.

  • Also just to be sure, go to (Controlled Parnell/clock language and region/ region/ administrative/ change system locale/ Arabic "Saudi Arabia")

  • Attach the tif/box that you are using
    I am not seeing any zip files here.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 10, 2017 via email

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 10, 2017

ara.TRAINING.zip

Uploaded zip file with training data for a group of fonts which have coverage for Arabic on Windows.

It is possible that the tesstrain.sh process is dropping diacritics as noise. I am trying to change config variables to see if I can get some improvement.

@Shreeshrii
Copy link
Collaborator Author

Attached is a log file which shows verbose output for every iteration of training - from middle of current training session.

traininglog-mid.txt

@ghost
Copy link

ghost commented Jan 10, 2017

@Shreeshrii
What font size are you using for the "Traditional Arabic"?

Initial Observation:

  • Letter extenders
    Don't/ Never set at all the letter extenders (Shift+j or Shift+ت) as a sole letter, they are not a single letter, they are used to stretch the words, and are causing a deterioration in the recognition rate and a huge amount of error based on your log and my experience.
    If the letter extender ( ـ ) is set to be considered as a character, then the recognition engine will recognize many stretches and extensions in the words as a ( ـ ) letter extender.

When i used them in my training process, i was merging the letter extender with the Arabic letter into one single box, and putting that Arabic letters as the character of the box, basically, i was trying to train the engine to recognize that Arabic letter in it's multiple positions, as you know the Arabic letters have multiple forms based which is based on it's position in the word ( beginning, middle, ending, isolated )
Example:
( كـ ) is not ( ك + ـ ) in the box file, it should be ( ك )
also ( ـكـ ) or ( ـك ) they are a single character ( ك ) in different positions, this is important in the box file.

Which also means that ( كَـ ) is not ( ك + ـَ ), it is ( كَ )

@ghost
Copy link

ghost commented Jan 10, 2017

@theraysmith @amitdo @Shreeshrii

  • Box file disorder
    i also observe that the Traditional Arabic box file is in LTR ( Left to Right ) which is reversed, the Arabic language is from RTL ( Right to Left ). That means that the first box should start from from the right side.
    ( have a look at the attached Arabic example tif/box of version Tesseract 3.05).
    Arabic example 1.zip
    Example 1, correct box order:
    right order

Tesseract 4.0 lstm puts the spaces between the words into boxes, as you know.
Thus a problem arises caused by the box file disorder since the boxes are mistakenly set to be in LTR ( Left to Right ) for Arabic which is wrong, causing jumps from ( the end of the first line) to ( the end of the last letter of the line after it).
See the image attached
box disorder

@ghost
Copy link

ghost commented Jan 10, 2017

  • Wrong encoding & Arabic language support by the text editor
    The Arabic language txt should be encoded in UTF-8 or any other that support it.
    Most text editors including Notepad++ don't get it right the first time, you must change system locale to Arabic so that the windows Notepad might have some sense in it.

(Controlled Parnell/clock language and region/ region/ administrative/ change system locale/ Arabic "Saudi Arabia")

Also, when using txt, the words are not in their correct order. at google chrome the words are correct, but once copying them and pasting them in a text file, the order is change, what a weird issue.

@ghost
Copy link

ghost commented Jan 10, 2017

@theraysmith @amitdo @Shreeshrii

  • The Reversed Text Issue!
    This is the first, last and most important problem that is persistent in all the Tesseract versions, including but not limited to Tesseract 4.0,

@Shreeshrii
Copy link
Collaborator Author

@Christophered

  1. I had experimented with 32 ptsize for Traditional Arabic in one run. I am using the default, which is 12 pt, I think.

  2. Don't/ Never set at all the letter extenders (Shift+j or Shift+ت) as a sole letter,

It is possible that I copied some text from wikipedia which is incorrect. Please look at the training_text file and let me know which lines should be deleted.

  1. i was merging the letter extender with the Arabic letter into one single box, and putting that Arabic letters as the character of the box, basically, i was trying to train the engine to recognize that Arabic letter in it's multiple positions, as you know the Arabic letters have multiple forms based which is based on it's position in the word ( beginning, middle, ending, isolated )

Please share your training text and I can give it a try.

@Shreeshrii
Copy link
Collaborator Author

Original problem, core dumped -
This seems to be happening when an --eval_listfile is given.
Related issues:
#644 (eval not run)
#561 (core dumped)

Arabic related issues:
See new issue filed by @Christophered
#648 (arabic reversal)

Closing this issue.

@amitdo
Copy link
Collaborator

amitdo commented Jan 11, 2017

Wrong encoding & Arabic language support by the text editor
The Arabic language txt should be encoded in UTF-8 or any other that support it.

The langdata text files for all languages are saved using UTF-8 encoding.

@imohammadhossein
Copy link

i am trying to train or finetune tesseract for my own dataset on farsi language . can anyone please help me through this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants