Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic lang. feature request #552

Closed
bmwmy opened this issue Dec 9, 2016 · 12 comments
Closed

Arabic lang. feature request #552

bmwmy opened this issue Dec 9, 2016 · 12 comments

Comments

@bmwmy
Copy link

bmwmy commented Dec 9, 2016

Hi

I try out your OCR engine and it outputs text with pretty much good accuracy but it lacks the support of Arabic diacritics ( ّ َ ً ُ ِ ٍ ْ ) which are used often in some kind of textbooks and other texts as well and it is Arabic language feature. I tried to train but I could not succeed. Used engine v3.05beta with jTessBoxEditor GUI.

An example for such text Link

Thanks

@Shreeshrii
Copy link
Collaborator

Please provide a small sample text that could be tested for training.

@Shreeshrii
Copy link
Collaborator

@theraysmith

Ray, please include this training sample during the next retrain for Arabic.

It would also be helpful if you publish accuracy stats for the various languages from the next training.

Thanks and best wishes for a Happy New Year!

@Shreeshrii
Copy link
Collaborator

@bmwmy

Thanks for the training text and font. I am going to try 'finetune' with the 4.0alpha arabic traineddata.

I have also requested Ray to include this in the retraining that he has planned to do in January.

Meanwhile, please check that these diacritics are included in the Arabic unicharset at https://github.com/tesseract-ocr/langdata/blob/master/Arabic.unicharset
and if not, pleaase add an issue in the langdata repo also.

@Shreeshrii
Copy link
Collaborator

@bmwmy Is this training text for 'Classical Arabic'? https://en.wikipedia.org/wiki/Classical_Arabic

Is it different enough to be kept as a separate language - similar to ancient greek etc

@bmwmy
Copy link
Author

bmwmy commented Jan 2, 2017

Actually, this font used in many Arabic printed books while some use the standard "Arial". What makes this font popular that it inspire the classical Arabic style but in modern way or style. This font was created by Microsoft and bundled with MS-word 6.0 as I remember, to mimic a similar Macintosh Arabic font.

Meanwhile (Tashkeel) = Arabic vowel diacritics ( ّ َ ً ٌُ ِ ٍ ْ ) is Arabic feature which was added with dots diacritics to ease reading and eliminate ambiguity of Classical Arabic. Tashkeel is there for all fonts (not associated with this font only) but it is kind of auxiliary and not common except in books (especially which was printed between 1975 and 2005) it is rarely used in newspapers and still used in some kind of books.

It will be good to make an option to ignore these Arabic vowel diacritics ( ّ َ ً ٌُ ِ ٍ ْ ) in the output or to ignore it completely but the recognition process should be aware of these diacritics as it impacts the accuracy and mix up with letters. Also, I have to open an issue to add these chars in langdata repo

I think no need to consider this training set as a separate language

Thanks for help and Happy New Year

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jan 2, 2017 via email

@bmwmy
Copy link
Author

bmwmy commented Jan 2, 2017

I will prepare additional training text with Arabic vowel diacritics. But wondering which font size to be used in training images. Usually the books publishers use A4 paper size or a bit smaller, with 16pt/14pt of this font for printed books. Let me know if the font size does not make difference for LSTM, if not, kindly advice me the preferable font size. Also, I think 300DPI will be ok.

Thanks

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jan 2, 2017 via email

@bmwmy
Copy link
Author

bmwmy commented Aug 26, 2017

ok the recent Arabic.traineddata is doing fine

@bmwmy bmwmy closed this as completed Aug 26, 2017
@zafarale
Copy link

zafarale commented Jun 6, 2018

Hi Chaps, is it possible for some one to point me to resource to help me use
https://drive.google.com/file/d/0B1JdJ8IXNweRX3NEMkZfX3gtdlk/view?usp=sharing
I need to use tesseract to scan classical theological writing with diacritics.

I started tesseract thinking its going to be simple, and struggling

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jun 6, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants