Arabic lang. feature request #552

bmwmy · 2016-12-09T14:35:01Z

Hi

I try out your OCR engine and it outputs text with pretty much good accuracy but it lacks the support of Arabic diacritics ( ّ َ ً ُ ِ ٍ ْ ) which are used often in some kind of textbooks and other texts as well and it is Arabic language feature. I tried to train but I could not succeed. Used engine v3.05beta with jTessBoxEditor GUI.

An example for such text Link

Thanks

Shreeshrii · 2016-12-28T17:44:34Z

Please provide a small sample text that could be tested for training.

bmwmy · 2016-12-30T19:05:13Z

https://drive.google.com/file/d/0B1JdJ8IXNweRX3NEMkZfX3gtdlk/view?usp=sharing

download at the corner

Shreeshrii · 2016-12-31T04:42:35Z

@theraysmith

Ray, please include this training sample during the next retrain for Arabic.

It would also be helpful if you publish accuracy stats for the various languages from the next training.

Thanks and best wishes for a Happy New Year!

Shreeshrii · 2016-12-31T06:38:41Z

@bmwmy

Thanks for the training text and font. I am going to try 'finetune' with the 4.0alpha arabic traineddata.

I have also requested Ray to include this in the retraining that he has planned to do in January.

Meanwhile, please check that these diacritics are included in the Arabic unicharset at https://github.com/tesseract-ocr/langdata/blob/master/Arabic.unicharset
and if not, pleaase add an issue in the langdata repo also.

Shreeshrii · 2016-12-31T09:25:51Z

@bmwmy Is this training text for 'Classical Arabic'? https://en.wikipedia.org/wiki/Classical_Arabic

Is it different enough to be kept as a separate language - similar to ancient greek etc

bmwmy · 2017-01-02T07:59:06Z

Actually, this font used in many Arabic printed books while some use the standard "Arial". What makes this font popular that it inspire the classical Arabic style but in modern way or style. This font was created by Microsoft and bundled with MS-word 6.0 as I remember, to mimic a similar Macintosh Arabic font.

Meanwhile (Tashkeel) = Arabic vowel diacritics ( ّ َ ً ٌُ ِ ٍ ْ ) is Arabic feature which was added with dots diacritics to ease reading and eliminate ambiguity of Classical Arabic. Tashkeel is there for all fonts (not associated with this font only) but it is kind of auxiliary and not common except in books (especially which was printed between 1975 and 2005) it is rarely used in newspapers and still used in some kind of books.

It will be good to make an option to ignore these Arabic vowel diacritics ( ّ َ ً ٌُ ِ ٍ ْ ) in the output or to ignore it completely but the recognition process should be aware of these diacritics as it impacts the accuracy and mix up with letters. Also, I have to open an issue to add these chars in langdata repo

I think no need to consider this training set as a separate language

Thanks for help and Happy New Year

Shreeshrii · 2017-01-02T08:11:25Z

@theraysmith has mentioned on the wiki that `For Latin-based languages, the existing model data provided has been trained on about 400000 textlines spanning about 4500 fonts. For other scripts, not so many fonts are available, but they have still been trained on a similar number of textlines. ` So, it will be helpful if you make available additional training text with Arabic vowel diacritics. I have tried finetune and adding layer using the text you had provided but it is not improving results, most probably because these characters are not part of the character set of the traineddata. ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jan 2, 2017 at 1:29 PM, bmwmy ***@***.***> wrote: Actually, this font used in many Arabic printed books while some use the standard "Arial". What makes this font popular that it inspire the classical Arabic style but in modern way or style. This font was created by Microsoft and bundled with MS-word 6.0 as I remember, to mimic a similar Macintosh Arabic font. Meanwhile (Tashkeel) = Arabic vowel diacritics ( ّ َ ً ٌُ ِ ٍ ْ ) is Arabic feature which was added with dots diacritics to ease reading and eliminate ambiguity of Classical Arabic. Tashkeel is there for all fonts (not associated with this font only) but it is kind of auxiliary and not common except in books (especially which was printed between 1975 and 2005) it is rarely used in newspapers and still used in some kind of books. It will be good to make an option to ignore these Arabic vowel diacritics ( ّ َ ً ٌُ ِ ٍ ْ ) in the output or to ignore it completely but the recognition process should be aware of these diacritics as it impacts the accuracy and mix up with letters. Also, I have to open an issue to add these chars in langdata repo I think no need to consider this training set as a separate language Thanks for help and Happy New Year — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#552 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o0nZSpFzhCXEwGUuRYmgS0fGVspIks5rOK5RgaJpZM4LI_wM> .

bmwmy · 2017-01-02T11:39:40Z

I will prepare additional training text with Arabic vowel diacritics. But wondering which font size to be used in training images. Usually the books publishers use A4 paper size or a bit smaller, with 16pt/14pt of this font for printed books. Let me know if the font size does not make difference for LSTM, if not, kindly advice me the preferable font size. Also, I think 300DPI will be ok.

Thanks

Shreeshrii · 2017-01-02T12:24:07Z

For now, do not make images, as the text2image program thru tesstrain.sh script does that. Just the training text will be enough. You can wait for additional feedback from Ray. - excuse the brevity, sent from mobile

…

On 02-Jan-2017 5:09 PM, "bmwmy" ***@***.***> wrote: I will prepare additional training text with Arabic vowel diacritics. But wondering which font size to be used in training images. Usually the book publishers use A4 paper size or a bit smaller, with 16pt/14pt of this font for printed books. Let me know if the font size does not make difference for LSTM, if not, kindly advice me the preferable font size. Also, I think 300DPI will be ok. Thanks — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#552 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o6dyLZdvpVubsO-tWqu_93KpGR6Yks5rOOIDgaJpZM4LI_wM> .

bmwmy · 2017-08-26T12:20:08Z

ok the recent Arabic.traineddata is doing fine

zafarale · 2018-06-06T05:24:26Z

Hi Chaps, is it possible for some one to point me to resource to help me use
https://drive.google.com/file/d/0B1JdJ8IXNweRX3NEMkZfX3gtdlk/view?usp=sharing
I need to use tesseract to scan classical theological writing with diacritics.

I started tesseract thinking its going to be simple, and struggling

Shreeshrii · 2018-06-06T17:43:56Z

try the traineddata file from tessdata_fast/script/Arabic ShreeDevi

…

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jun 6, 2018 at 10:54 AM, zafarale ***@***.***> wrote: Hi Chaps, is it possible for some one to point me to resource to help me use https://drive.google.com/file/d/0B1JdJ8IXNweRX3NEMkZfX3gtdlk/ view?usp=sharing I need to use tesseract to scan calssical theological writing with diacritics. I started tesseract thinking its going to be simple, and struggling — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#552 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o_5LyatP7g_ENc5n4M89WO5KUlbrks5t52edgaJpZM4LI_wM> .

bmwmy mentioned this issue Jan 2, 2017

add vowel diacritics characters in Arabic charset tesseract-ocr/langdata#35

Closed

ghost mentioned this issue Jan 10, 2017

LSTM: Training - Arabic - Add Top layer - Aborted (core dumped) #642

Closed

Shreeshrii mentioned this issue Jan 12, 2017

Box File disorder, Arabic Language #648

Open

Shreeshrii mentioned this issue Feb 5, 2017

test of arabic_lines.c and misctest1.c DanBloomberg/leptonica#236

Closed

bmwmy closed this as completed Aug 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arabic lang. feature request #552

Arabic lang. feature request #552

bmwmy commented Dec 9, 2016 •

edited

Loading

Shreeshrii commented Dec 28, 2016

bmwmy commented Dec 30, 2016

Shreeshrii commented Dec 31, 2016

Shreeshrii commented Dec 31, 2016

Shreeshrii commented Dec 31, 2016

bmwmy commented Jan 2, 2017

Shreeshrii commented Jan 2, 2017 via email

bmwmy commented Jan 2, 2017 •

edited

Loading

Shreeshrii commented Jan 2, 2017 via email

bmwmy commented Aug 26, 2017

zafarale commented Jun 6, 2018 •

edited

Loading

Shreeshrii commented Jun 6, 2018 via email

Arabic lang. feature request #552

Arabic lang. feature request #552

Comments

bmwmy commented Dec 9, 2016 • edited Loading

Shreeshrii commented Dec 28, 2016

bmwmy commented Dec 30, 2016

Shreeshrii commented Dec 31, 2016

Shreeshrii commented Dec 31, 2016

Shreeshrii commented Dec 31, 2016

bmwmy commented Jan 2, 2017

Shreeshrii commented Jan 2, 2017 via email

bmwmy commented Jan 2, 2017 • edited Loading

Shreeshrii commented Jan 2, 2017 via email

bmwmy commented Aug 26, 2017

zafarale commented Jun 6, 2018 • edited Loading

Shreeshrii commented Jun 6, 2018 via email

bmwmy commented Dec 9, 2016 •

edited

Loading

bmwmy commented Jan 2, 2017 •

edited

Loading

zafarale commented Jun 6, 2018 •

edited

Loading