-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arabic lang. feature request #552
Comments
Please provide a small sample text that could be tested for training. |
Ray, please include this training sample during the next retrain for Arabic. It would also be helpful if you publish accuracy stats for the various languages from the next training. Thanks and best wishes for a Happy New Year! |
Thanks for the training text and font. I am going to try 'finetune' with the 4.0alpha arabic traineddata. I have also requested Ray to include this in the retraining that he has planned to do in January. Meanwhile, please check that these diacritics are included in the Arabic unicharset at https://github.com/tesseract-ocr/langdata/blob/master/Arabic.unicharset |
@bmwmy Is this training text for 'Classical Arabic'? https://en.wikipedia.org/wiki/Classical_Arabic Is it different enough to be kept as a separate language - similar to ancient greek etc |
Actually, this font used in many Arabic printed books while some use the standard "Arial". What makes this font popular that it inspire the classical Arabic style but in modern way or style. This font was created by Microsoft and bundled with MS-word 6.0 as I remember, to mimic a similar Macintosh Arabic font. Meanwhile (Tashkeel) = Arabic vowel diacritics ( ّ َ ً ٌُ ِ ٍ ْ ) is Arabic feature which was added with dots diacritics to ease reading and eliminate ambiguity of Classical Arabic. Tashkeel is there for all fonts (not associated with this font only) but it is kind of auxiliary and not common except in books (especially which was printed between 1975 and 2005) it is rarely used in newspapers and still used in some kind of books. It will be good to make an option to ignore these Arabic vowel diacritics ( ّ َ ً ٌُ ِ ٍ ْ ) in the output or to ignore it completely but the recognition process should be aware of these diacritics as it impacts the accuracy and mix up with letters. Also, I have to open an issue to add these chars in langdata repo I think no need to consider this training set as a separate language Thanks for help and Happy New Year |
@theraysmith has mentioned on the wiki that
`For Latin-based languages, the existing model data provided has been
trained on about 400000 textlines spanning about 4500 fonts. For other
scripts, not so many fonts are available, but they have still been trained
on a similar number of textlines. `
So, it will be helpful if you make available additional training text with
Arabic vowel diacritics.
I have tried finetune and adding layer using the text you had provided but
it is not improving results, most probably because these characters are not
part of the character set of the traineddata.
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Jan 2, 2017 at 1:29 PM, bmwmy ***@***.***> wrote:
Actually, this font used in many Arabic printed books while some use the
standard "Arial". What makes this font popular that it inspire the
classical Arabic style but in modern way or style. This font was created by
Microsoft and bundled with MS-word 6.0 as I remember, to mimic a similar
Macintosh Arabic font.
Meanwhile (Tashkeel) = Arabic vowel diacritics ( ّ َ ً ٌُ ِ ٍ ْ ) is
Arabic feature which was added with dots diacritics to ease reading and
eliminate ambiguity of Classical Arabic. Tashkeel is there for all fonts
(not associated with this font only) but it is kind of auxiliary and not
common except in books (especially which was printed between 1975 and 2005)
it is rarely used in newspapers and still used in some kind of books.
It will be good to make an option to ignore these Arabic vowel diacritics
( ّ َ ً ٌُ ِ ٍ ْ ) in the output or to ignore it completely but the
recognition process should be aware of these diacritics as it impacts the
accuracy and mix up with letters. Also, I have to open an issue to add
these chars in langdata repo
I think no need to consider this training set as a separate language
Thanks for help and Happy New Year
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#552 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o0nZSpFzhCXEwGUuRYmgS0fGVspIks5rOK5RgaJpZM4LI_wM>
.
|
I will prepare additional training text with Arabic vowel diacritics. But wondering which font size to be used in training images. Usually the books publishers use A4 paper size or a bit smaller, with 16pt/14pt of this font for printed books. Let me know if the font size does not make difference for LSTM, if not, kindly advice me the preferable font size. Also, I think 300DPI will be ok. Thanks |
For now, do not make images, as the text2image program thru tesstrain.sh
script does that. Just the training text will be enough.
You can wait for additional feedback from Ray.
- excuse the brevity, sent from mobile
…On 02-Jan-2017 5:09 PM, "bmwmy" ***@***.***> wrote:
I will prepare additional training text with Arabic vowel diacritics. But
wondering which font size to be used in training images. Usually the book
publishers use A4 paper size or a bit smaller, with 16pt/14pt of this font
for printed books. Let me know if the font size does not make difference
for LSTM, if not, kindly advice me the preferable font size. Also, I think
300DPI will be ok.
Thanks
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#552 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o6dyLZdvpVubsO-tWqu_93KpGR6Yks5rOOIDgaJpZM4LI_wM>
.
|
ok the recent Arabic.traineddata is doing fine |
Hi Chaps, is it possible for some one to point me to resource to help me use I started tesseract thinking its going to be simple, and struggling |
try the traineddata file from tessdata_fast/script/Arabic
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Jun 6, 2018 at 10:54 AM, zafarale ***@***.***> wrote:
Hi Chaps, is it possible for some one to point me to resource to help me
use
https://drive.google.com/file/d/0B1JdJ8IXNweRX3NEMkZfX3gtdlk/
view?usp=sharing
I need to use tesseract to scan calssical theological writing with
diacritics.
I started tesseract thinking its going to be simple, and struggling
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#552 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o_5LyatP7g_ENc5n4M89WO5KUlbrks5t52edgaJpZM4LI_wM>
.
|
Hi
I try out your OCR engine and it outputs text with pretty much good accuracy but it lacks the support of Arabic diacritics ( ّ َ ً ُ ِ ٍ ْ ) which are used often in some kind of textbooks and other texts as well and it is Arabic language feature. I tried to train but I could not succeed. Used engine v3.05beta with jTessBoxEditor GUI.
An example for such text Link
Thanks
The text was updated successfully, but these errors were encountered: