-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Box File disorder, Arabic Language #648
Comments
While tesstrain.sh takes into account RTL languages while creating the DAWG files, text2image process does not seem to have specific RTL processing. |
will there be modifications for text2image ? |
What you show here is 'by design'. This should not cause any problem in training process and characters recognition for RTL languages. |
There *is* RTL-specific processing in text2image.
Pango renders RTL text RTL, and the post-processing is intended to re-order
everything strictly LTR.
*If it isn't doing that then there is a bug.*
Here is why:
Old Tesseract (3.05) is only learning individual character shapes, so the
order of training data is somewhat irrelevant.
New Tesseract (LSTM/4.00) is learning to identify the text characters in
the (LTR) order they appear in the image, so the truth transcription should
be LTR, and the words in the dawg should be reversed.
There is Bidi processing inside the post-recognition processing of
Tesseract that reprocesses/re-orders the text for output, so it appears in
the correct order.
I wonder if the bidi integration is working correctly for LSTM, as the
accuracy with Arabic is unsatisfactory.
In light of this design, please take another look, and let me know if there
is anything systematically wrong in the output that might provide some
hints as to where to look.
Thanks.
…On Wed, Jan 11, 2017 at 5:08 AM, Amit D. ***@***.***> wrote:
What you show here is 'by design'. This should not cause any problem in
training process and characters recognition for RTL languages.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#648 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056RlP7ZpsIBdJ7ooGiEN1KuDDG54Kks5rRNRggaJpZM4Lf-kT>
.
--
Ray.
|
Ray,
But the OCRed text does not seem to include any. ara.Arial_Unicode_MS.exp0.txt @Christophered and @bmwmy can provide Arabic specific details. |
https://github.com/tesseract-ocr/tesseract/files/696122/ara.TRAINING.zip has the box tif pairs for the above training. https://github.com/tesseract-ocr/tesseract/files/696184/traininglog-mid.txt shows the debug messages from during training. |
Ray, According to your tests, how does Hebrew (another RTL language) perform? Do you have accuracy report for various languages that you can share with us, other than the one in the DAS2016 slides? |
@theraysmith Hope you have seen comments by Chris on the other thread also - #642
|
Thanks for the work on this.
I've learned 2 things so far:
- U+640 (tatweel) is a very special case that I need to think how to
handle - it needs to be preserved in the training text, so it gets
rendered, but needs to disappear everywhere else, as if it doesn't exist.
- The diacritics are currently excluded from the unicharset, probably
because they are only rarely used, but need to be included. There may not
be enough text with them included in the text corpora.
Questions:
Is that all of U+64b->U+652 inclusive?
Applied uniformly to all Arabic languages?
(ara+div+fas+kur_ara+pus+snd+syr+uig+urd)
…On Wed, Jan 11, 2017 at 7:30 AM, Shreeshrii ***@***.***> wrote:
@theraysmith <https://github.com/theraysmith> Hope you have seen comments
by Chris on the other thread also - #642
<#642>
i was merging the letter extender with the Arabic letter into one single
box, and putting that Arabic letters as the character of the box,
basically, i was trying to train the engine to recognize that Arabic letter
in it's multiple positions, as you know the Arabic letters have multiple
forms based which is based on it's position in the word ( beginning,
middle, ending, isolated )
Example:
( كـ ) is not ( ك + ـ ) in the box file, it should be ( ك )
also ( ـكـ ) or ( ـك ) they are a single character ( ك ) in different
positions, this is important in the box file.
Which also means that ( كَـ ) is not ( ك + ـَ ), it is ( كَ )
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#648 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056fQCae8sI16xZmZURI7_cApxZrHzks5rRPWmgaJpZM4Lf-kT>
.
--
Ray.
|
@amitdo Hebrew seems to be OK. It is certainly ahead of 3.05. |
Ray, Arabic Diacritics are included in the Arabic.unicharset. @bmwmy had offered to provide additional training text with diacritics - see #552 (comment) I was able to get the diacritics recognized during training by adding the following line to ara.config, however for some fonts it seems to be treating diacritics as a separate line. Do not know whether it is related to x-height for fonts.
Related question, for lstm training, I am using
while creating the box tiff pairs and lstmf files since @amitdo mentioned that font_properties are not needed for LSTM training, please confirm. |
Please also see #318 (comment) and other comments regarding unicharset. Are the glyph metrics updated based on the fonts used for training? Answer: No. Most of the unicharset fields are irrelevant to LSTM training and recognition. |
I examined @Shreeshrii training set and I appreciate his effort, but seems the text in generated tiffs image files are very small than it should be. It is hard to read even for humans. The vowel diacritics looks like noise also some letter glyphs as ( فـ / ـمـ ). I suggest using 16-22pt font. comparison between noisy text example which was used in training and good one: I think using bigger text image input will result in very high improvement. I am satisfied with this result taking in consideration this noisy input but (RTL issue) should be solved. @Christophered ara.Traditional_Arabic.exp0.zip was good input image file |
About |
I believe the answer is 'No'. @theraysmith, can you confirm that? |
@amitdo I have been training using However, I am wondering whether some type of fontinfo / xheights is still required for LSTM training. eg. in order to avoid the diacritics being discarded as noise, I had to add So, I am trying the training again with only one font, Traditional Arabic font at 32 point, as suggested by @bmwmy. Ray might have a different solution - will wait till the changes in wiki for training are updated. |
If the layout analysis step does not 'cut' the lines properly, the next step - the lines' text recognition, will suffer. |
Tesseract release notes July 11 2015 - V3.04.00
From DAS2016 slide 5 - 'Page Layout Analysis':
|
Ray, Can the bidi post-processing be applied as an experiment to these debug messages? Then we can very easily see whether it is working.
|
Shree, you might want to use this text2image option with Arabic: As a minimum it should equal to ptsize. For Arabic, you can try to increase it (20-50 percent bigger than ptsize) |
IMO, 32 ptsize is too big. Try 14/16. |
Arabic also has presentation forms i.e. spacing forms of Arabic diacritics, and contextual letter forms. Please see Though, it is not recommended to use these for content under newer versions of unicode, I am wondering whether it will make OCR easier if these were used... There can be a post-processing step to convert them to regular unicode points later. |
No. Confirmed by Ray here: tesseract-ocr/langdata#31 (comment)
|
tesstrain.sh process uses default --ptsize which is 12. language_specific.sh sets --leading to 32 by default and to 48 for Thai fonts. |
@Shreeshrii here is a sample text for you, please test and post the findings @theraysmith Most if not all languages related to Arabic (example: Farsi, Urdu, etc..) use such diacritics. |
@Shreeshrii your sample of box/tif that you provided had an some errors that I've notice:
I tracked the problem, and the cause was the txt that you were using, it contained the 1st U+640 (tatweel) mistake that I mentioned earlier |
The box tiff pairs are as generated by text2image program.
Any changes to that will have to be done by Ray.
I will test with the files you have provided.
- excuse the brevity, sent from mobile
…On 28-Jan-2017 9:30 PM, "chris" ***@***.***> wrote:
@Shreeshrii <https://github.com/Shreeshrii> your sample of box/tif that
you provided had an some errors that I've notice:
1.
The U+640 (tatweel) issue
example: بِسمِ
wrong: ب ـِ س ـِ م
correct: ب س م only 3 letters بسم
@theraysmith <https://github.com/theraysmith> got it right, Tesseract
should have the capability to generate it in .tif, but not consider it as a
single character in the .box file, the correct thing would be that the box
of U+640 (tatweel) be combined with the box of another letter, while
setting the box as the character of a single letter, never even mentioning
U+640 (tatweel) in the .box file , ever.
U+640 (tatweel) is a special case, people dont usually use it while
writing text, so dont use it, or merge the boxes and remove it.
2.
As @bmwmy <https://github.com/bmwmy> mentioned earlier, in your .tif
the charachters of the text are seperated, thats wrong.
wrong: بـ ـسـ ـم
correct: بسم
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#648 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o1m8lCpawdFVw2ywk8nxQgwrn-zKks5rW2YIgaJpZM4Lf-kT>
.
|
@theraysmith i suggest that you give capability to convert tesseract 3.0x box files to tesseract 4.0x since many of us have tif/box files based on the older version 3.0x @theraysmith Also, Ubuntu have released the Snaps project, giving the ability to distribute a software as a universal Linux package. Would it be possible to release a Snappy version on Tesseract 4.0x, this would save us alot of time and effort by skipping the building process and it's issues. |
Question The difference between NFD vs NFC, is that after the text is is reordered to LTR for training, NFD pushes the marks/diacritics to the right-side, while in NFC the marks/diacritics remain on the characters itself. |
text2image should have an option to randomly add tatweel every n lines. https://en.wikipedia.org/wiki/Kashida
|
@amitdo No, No.......! That is very bad |
It seems it uses NFKC. |
Hi @Christophered, I don't think you understood my meaning. I Re-read what Ray said in this issue.
So it seems he already implemented (but didn't push yet) what I just now suggested. |
text2image is the program that renders images from ground truth.
'add' here means 'render'. |
Please test with latest source from github. The commit by @theraysmith fixes the issue. |
Thanks for these updates. Does the complete fix for RTL languages also require new traineddata created with these fixes? Will you be uploading a new version of traineddata? |
the new version of traineddata has been uploaded a lot of time ago.
…On Sat, Sep 9, 2017 at 10:07 AM, Shreeshrii ***@***.***> wrote:
@theraysmith <https://github.com/theraysmith>
Thanks for these updates.
Does the complete fix for RTL languages also require new traineddata
created with these fixes?
Will you be uploading a new version of traineddata?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#648 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AZFiAfGrAIkJiFceO2zo3iNI77Q7sz9Dks5sgiQGgaJpZM4Lf-kT>
.
|
@hanikh You are right, Ray had uploaded the best traineddata files on August 1. However, I think that the wordlists in traineddata for RTL languages in that still had some errors in order of characters in ligatures. I am hoping that @theraysmith will upload fixed versions of traineddata to the new repos for LSTM traineddata - https://github.com/tesseract-ocr/tessdata_best |
if you are looking for Arabic diacritization corpus, here is a good one |
Hi guys, im using tesseract 4.00 and i have did my program to recognize English language . it worked very well , now im trying to include arabic language in my program but it give a very weird characters even though i used ara.triandata instead of eng.traindata |
Try
script/Arabic.traineddata
From tessdata_fast
or
tessdata_best
…On Mon 30 Apr, 2018, 11:58 AM AbdelsalamHaa, ***@***.***> wrote:
Hi guys, im using tesseract 4.00 and i have did my program to recognize
English language . it worked very well , now im trying to include arabic
language in my program but it give a very weird characters even though i
used ara.triandata instead of eng.traindata
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#648 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o1fLbf5agYdH9NZ-FRFNJLv0vOSLks5ttq8bgaJpZM4Lf-kT>
.
|
Why do you ask the same question 5 times in different issues? https://github.com/tesseract-ocr/tesseract/issues?q=is%3Aissue+involves%3AAbdelsalamHaa+is%3Aopen |
@amitdo @Shreeshrii . the image im trying to read is this one however on the prompt window the result is also different then even the one before printing is it because the language i use for my laptop or that has nothing to do with this . |
Seems like a locale issue. Output to a file and then open it in a unicode text editor. |
@Shreeshrii but still the result still the same i download ara.traineddata from here https://github.com/tesseract-ocr/tessdata_fast and i used this website to check the unicode as u mention i think i found the problem but im not sure how to solve it but still not sure how to solve the problem |
To say my problem more clearly , the tesseract read the image correctly. It returns the correct UTF-8 code but then this code when it's treated as hex code the charterers will be wrong. |
hello |
@theraysmith @amitdo @Shreeshrii
The Arabic box file generate using Tesseract 4.x is in LTR ( Left to Right ) which is reversed, the Arabic language is from RTL ( Right to Left ). That means that the first box should start from from the right side.
( Have a look at the wrong and disorder Tesseract 4.0x Arabic box file )
ara.Traditional_Arabic.exp0.zip
Tesseract 4.0 lstm puts the spaces between the words into boxes, as you know.
Thus a problem arises caused by the box file disorder since the boxes are mistakenly set to be in LTR ( Left to Right ) for Arabic which is wrong, causing jumps from ( the end of the first line) to ( the end of the last letter of the line after it).
See the image attached
( Now have a look at the attached correct order of Arabic example tif/box of version Tesseract 3.05).
Arabic example 1.zip
Example 1, correct box order:
The text was updated successfully, but these errors were encountered: