Box File disorder, Arabic Language #648

ghost · 2017-01-10T22:36:37Z

Box file disorder
The Arabic box file generate using Tesseract 4.x is in LTR ( Left to Right ) which is reversed, the Arabic language is from RTL ( Right to Left ). That means that the first box should start from from the right side.
( Have a look at the wrong and disorder Tesseract 4.0x Arabic box file )
ara.Traditional_Arabic.exp0.zip

Tesseract 4.0 lstm puts the spaces between the words into boxes, as you know.
Thus a problem arises caused by the box file disorder since the boxes are mistakenly set to be in LTR ( Left to Right ) for Arabic which is wrong, causing jumps from ( the end of the first line) to ( the end of the last letter of the line after it).
See the image attached

( Now have a look at the attached correct order of Arabic example tif/box of version Tesseract 3.05).
Arabic example 1.zip
Example 1, correct box order:

Shreeshrii · 2017-01-11T05:01:15Z

While tesstrain.sh takes into account RTL languages while creating the DAWG files, text2image process does not seem to have specific RTL processing.

ghost · 2017-01-11T08:57:55Z

will there be modifications for text2image ?

amitdo · 2017-01-11T13:08:36Z

What you show here is 'by design'. This should not cause any problem in training process and characters recognition for RTL languages.

theraysmith · 2017-01-11T13:40:19Z

There *is* RTL-specific processing in text2image. Pango renders RTL text RTL, and the post-processing is intended to re-order everything strictly LTR. *If it isn't doing that then there is a bug.* Here is why: Old Tesseract (3.05) is only learning individual character shapes, so the order of training data is somewhat irrelevant. New Tesseract (LSTM/4.00) is learning to identify the text characters in the (LTR) order they appear in the image, so the truth transcription should be LTR, and the words in the dawg should be reversed. There is Bidi processing inside the post-recognition processing of Tesseract that reprocesses/re-orders the text for output, so it appears in the correct order. I wonder if the bidi integration is working correctly for LSTM, as the accuracy with Arabic is unsatisfactory. In light of this design, please take another look, and let me know if there is anything systematically wrong in the output that might provide some hints as to where to look. Thanks.

…

On Wed, Jan 11, 2017 at 5:08 AM, Amit D. ***@***.***> wrote: What you show here is 'by design'. This should not cause any problem in training process and characters recognition for RTL languages. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#648 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056RlP7ZpsIBdJ7ooGiEN1KuDDG54Kks5rRNRggaJpZM4Lf-kT> .

-- Ray.

Shreeshrii · 2017-01-11T14:30:23Z

Ray,
There seems to be a bug. I have tried training a couple of times to 2-3% char error rate. But the OCRed text seems to be way off. During training, it seems that the diacritics are being recognized well., eg.

Iteration 1702: ALIGNED TRUTH : انَدبْعَ ىلَعَ انَلْنَ امَّمِ بٍيْرَ يفِ مْتُنْكُ نْإِوَ نَومُلَعْتَ مْتُنْأوَ ادًادَنْأ هِللِ اولُعَجْتَ الَفَ مْكُل
Iteration 1702: BEST OCR TEXT : انَيبْعَ ىلَعَ انَلْنَ اهَمِ بِيْرَ يفِ مْتُنَ نْإوَ نَوهُلَمْتَ مْتُأوَ اذَادَنأ هِلَلِ اولُعَجْتَ الَفَ مْكُلَ

But the OCRed text does not seem to include any.

ara.Arial_Unicode_MS.exp0.txt
ara.Calibri.exp0.txt
ara.Arial.exp0.txt

@Christophered and @bmwmy can provide Arabic specific details.

Shreeshrii · 2017-01-11T14:31:54Z

https://github.com/tesseract-ocr/tesseract/files/696122/ara.TRAINING.zip has the box tif pairs for the above training.

https://github.com/tesseract-ocr/tesseract/files/696184/traininglog-mid.txt shows the debug messages from during training.

amitdo · 2017-01-11T15:24:24Z

I wonder if the bidi integration is working correctly for LSTM, as the accuracy with Arabic is unsatisfactory.

Ray,

According to your tests, how does Hebrew (another RTL language) perform?

Do you have accuracy report for various languages that you can share with us, other than the one in the DAS2016 slides?

Shreeshrii · 2017-01-11T15:30:33Z

@theraysmith Hope you have seen comments by Chris on the other thread also - #642

i was merging the letter extender with the Arabic letter into one single box, and putting that Arabic letters as the character of the box, basically, i was trying to train the engine to recognize that Arabic letter in it's multiple positions, as you know the Arabic letters have multiple forms based which is based on it's position in the word ( beginning, middle, ending, isolated )
Example:
( كـ ) is not ( ك + ـ ) in the box file, it should be ( ك )
also ( ـكـ ) or ( ـك ) they are a single character ( ك ) in different positions, this is important in the box file.

Which also means that ( كَـ ) is not ( ك + ـَ ), it is ( كَ )

theraysmith · 2017-01-11T20:29:51Z

Thanks for the work on this. I've learned 2 things so far: - U+640 (tatweel) is a very special case that I need to think how to handle - it needs to be preserved in the training text, so it gets rendered, but needs to disappear everywhere else, as if it doesn't exist. - The diacritics are currently excluded from the unicharset, probably because they are only rarely used, but need to be included. There may not be enough text with them included in the text corpora. Questions: Is that all of U+64b->U+652 inclusive? Applied uniformly to all Arabic languages? (ara+div+fas+kur_ara+pus+snd+syr+uig+urd)

…

On Wed, Jan 11, 2017 at 7:30 AM, Shreeshrii ***@***.***> wrote: @theraysmith <https://github.com/theraysmith> Hope you have seen comments by Chris on the other thread also - #642 <#642> i was merging the letter extender with the Arabic letter into one single box, and putting that Arabic letters as the character of the box, basically, i was trying to train the engine to recognize that Arabic letter in it's multiple positions, as you know the Arabic letters have multiple forms based which is based on it's position in the word ( beginning, middle, ending, isolated ) Example: ( كـ ) is not ( ك + ـ ) in the box file, it should be ( ك ) also ( ـكـ ) or ( ـك ) they are a single character ( ك ) in different positions, this is important in the box file. Which also means that ( كَـ ) is not ( ك + ـَ ), it is ( كَ ) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#648 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AL056fQCae8sI16xZmZURI7_cApxZrHzks5rRPWmgaJpZM4Lf-kT> .

-- Ray.

theraysmith · 2017-01-11T20:40:40Z

@amitdo Hebrew seems to be OK. It is certainly ahead of 3.05.
I have some detailed results, but they aren't very meaningful without being able to look at the actual errors to see how many are actually due to the ground truth, or some strange disagreement on whitespace.
The gist is that 4.00 is less good than it could be on:
Arabic (all langs)
Indic (all langs)
Chinese, Japanese.
The problems with Arabic may be explained by this thread.
Chinese, Japanese are troubled by the used of radical-stroke encoding. I need to switch to a better scheme.
Indic may be troubled by the length of the compressed codes used. I need to switch to a better scheme.

Shreeshrii · 2017-01-12T03:45:01Z

@theraysmith

The diacritics are currently excluded from the unicharset, probably
because they are only rarely used, but need to be included. There may not
be enough text with them included in the text corpora.

Ray,
Please see #552 and tesseract-ocr/langdata#35

Arabic Diacritics are included in the Arabic.unicharset.

@bmwmy had offered to provide additional training text with diacritics - see #552 (comment)

I was able to get the diacritics recognized during training by adding the following line to ara.config, however for some fonts it seems to be treating diacritics as a separate line. Do not know whether it is related to x-height for fonts.

#Diacritics
textord_min_linesize 2.5

Related question, for lstm training, I am using

--noextract_font_properties

while creating the box tiff pairs and lstmf files since @amitdo mentioned that font_properties are not needed for LSTM training, please confirm.

Shreeshrii · 2017-01-12T04:26:46Z

Please also see #318 (comment) and other comments regarding unicharset.

Are the glyph metrics updated based on the fonts used for training?
Are glyph metrics used for LSTM training?

Answer:

No. Most of the unicharset fields are irrelevant to LSTM training and recognition.
_The mirror and normalized string fields ARE important though._

bmwmy · 2017-01-12T08:46:43Z

I examined @Shreeshrii training set and I appreciate his effort, but seems the text in generated tiffs image files are very small than it should be. It is hard to read even for humans. The vowel diacritics looks like noise also some letter glyphs as ( فـ / ـمـ ). I suggest using 16-22pt font.
Also this:
Iteration 1702: ALIGNED TRUTH : انَدبْعَ ىلَعَ انَلْنَ امَّمِ بٍيْرَ يفِ مْتُنْكُ نْإِوَ نَومُلَعْتَ مْتُنْأوَ ادًادَنْأ هِللِ اولُعَجْتَ الَفَ مْكُل
Iteration 1702: BEST OCR TEXT : انَيبْعَ ىلَعَ انَلْنَ اهَمِ بِيْرَ يفِ مْتُنَ نْإوَ نَوهُلَمْتَ مْتُأوَ اذَادَنأ هِلَلِ اولُعَجْتَ الَفَ مْكُلَ
are reversed order (RTL issue)

comparison between noisy text example which was used in training and good one:

I think using bigger text image input will result in very high improvement. I am satisfied with this result taking in consideration this noisy input but (RTL issue) should be solved.

@Christophered ara.Traditional_Arabic.exp0.zip was good input image file
but @Shreeshrii https://github.com/tesseract-ocr/tesseract/files/696122/ara.TRAINING.zip is noisy input image.

amitdo · 2017-01-12T10:45:50Z

About --noextract_font_properties . Ray confirmed it here:
#634 (comment)

amitdo · 2017-01-12T10:57:53Z

Are glyph metrics used for LSTM training?

I believe the answer is 'No'.

@theraysmith, can you confirm that?

Shreeshrii · 2017-01-12T11:00:04Z

@amitdo I have been training using --noextract_font_properties since you brought it to my notice.

However, I am wondering whether some type of fontinfo / xheights is still required for LSTM training.

eg. in order to avoid the diacritics being discarded as noise, I had to add textord_min_linesize 2.6 in ara.config. But different fonts have different sizes, even at same point size. I had to play with different values, but couldn't figure out a value that would work with multiple fonts.

So, I am trying the training again with only one font, Traditional Arabic font at 32 point, as suggested by @bmwmy.

Ray might have a different solution - will wait till the changes in wiki for training are updated.

amitdo · 2017-01-12T11:16:29Z

textord_min_linesize is a hint for the layout analysis step in Tesseract.

If the layout analysis step does not 'cut' the lines properly, the next step - the lines' text recognition, will suffer.

amitdo · 2017-01-12T11:40:39Z

Tesseract release notes July 11 2015 - V3.04.00

Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc.

From DAS2016 slide 5 - 'Page Layout Analysis':

Tesseract's existing text-line finding is also weak wrt diacritics,
especially for Arabic and Thai.

Shreeshrii · 2017-01-12T12:29:06Z

There is Bidi processing inside the post-recognition processing of
Tesseract that reprocesses/re-orders the text for output, so it appears in
the correct order.

Iteration 1702: ALIGNED TRUTH : انَدبْعَ ىلَعَ انَلْنَ امَّمِ بٍيْرَ يفِ مْتُنْكُ نْإِوَ نَومُلَعْتَ مْتُنْأوَ ادًادَنْأ هِللِ اولُعَجْتَ الَفَ مْكُل
Iteration 1702: BEST OCR TEXT : انَيبْعَ ىلَعَ انَلْنَ اهَمِ بِيْرَ يفِ مْتُنَ نْإوَ نَوهُلَمْتَ مْتُأوَ اذَادَنأ هِلَلِ اولُعَجْتَ الَفَ مْكُلَ
are reversed order (RTL issue)

@theraysmith

Ray, Can the bidi post-processing be applied as an experiment to these debug messages? Then we can very easily see whether it is working.

Answer: No. That would be very difficult. They are intended to be displayed completely without any RTL smarts.

amitdo · 2017-01-12T13:04:31Z

Shree, you might want to use this text2image option with Arabic:
--leading Inter-line space (in pixels) (type:int default:12)

As a minimum it should equal to ptsize. For Arabic, you can try to increase it (20-50 percent bigger than ptsize)

amitdo · 2017-01-12T13:07:46Z

IMO, 32 ptsize is too big. Try 14/16.

Shreeshrii · 2017-01-12T16:57:23Z

Arabic also has presentation forms i.e. spacing forms of Arabic diacritics, and contextual letter forms.

Please see
http://www.alanwood.net/unicode/arabic_presentation_forms_a.html
http://www.alanwood.net/unicode/arabic_presentation_forms_b.html
and
https://github.com/w3c/alreq/wiki/Should-I-use-the-Arabic-Presentation-Forms-provided-in-Unicode%3F

Though, it is not recommended to use these for content under newer versions of unicode, I am wondering whether it will make OCR easier if these were used...

There can be a post-processing step to convert them to regular unicode points later.

amitdo · 2017-01-13T11:25:35Z

Are glyph metrics used for LSTM training?

No. Confirmed by Ray here: tesseract-ocr/langdata#31 (comment)

... the glyph metrics aren't used.

Shreeshrii · 2017-01-17T13:29:26Z

Shree, you might want to use this text2image option with Arabic:
--leading Inter-line space (in pixels) (type:int default:12)

As a minimum it should equal to ptsize. For Arabic, you can try to increase it (20-50 percent bigger than ptsize)

tesstrain.sh process uses default --ptsize which is 12.

language_specific.sh sets --leading to 32 by default and to 48 for Thai fonts.

ghost · 2017-01-28T15:33:23Z

@Shreeshrii here is a sample text for you, please test and post the findings
Arabic sample variation.zip

@theraysmith Most if not all languages related to Arabic (example: Farsi, Urdu, etc..) use such diacritics.
The Arabic diacritics are often but not always used in the Arabic text, sometimes in all the text, and sometimes at one letter in each word, but believe me the diacritics are frequently used.
Have a look at The Quran

ghost · 2017-01-28T15:59:59Z

@Shreeshrii your sample of box/tif that you provided had an some errors that I've notice:

The U+640 (tatweel) issue
example: بِسمِ
wrong: ب ـِ س ـِ م
correct: ب س م only 3 letters بسم
@theraysmith got it right, Tesseract should have the capability to generate it in .tif, but not consider it as a single character in the .box file, the correct thing would be that the box of U+640 (tatweel) be combined with the box of another letter, while setting the box as the character of a single letter, never even mentioning U+640 (tatweel) in the .box file , ever.
U+640 (tatweel) is a special case, people dont usually use it while writing text, so dont use it, or merge the boxes and remove it.
As @bmwmy mentioned earlier, in your .tif the charachters of the text are seperated, thats wrong.
wrong: بـ ـسـ ـم
correct: بسم

I tracked the problem, and the cause was the txt that you were using, it contained the 1st U+640 (tatweel) mistake that I mentioned earlier

Shreeshrii · 2017-01-28T16:05:11Z

The box tiff pairs are as generated by text2image program. Any changes to that will have to be done by Ray. I will test with the files you have provided. - excuse the brevity, sent from mobile

…

On 28-Jan-2017 9:30 PM, "chris" ***@***.***> wrote: @Shreeshrii <https://github.com/Shreeshrii> your sample of box/tif that you provided had an some errors that I've notice: 1. The U+640 (tatweel) issue example: بِسمِ wrong: ب ـِ س ـِ م correct: ب س م only 3 letters بسم @theraysmith <https://github.com/theraysmith> got it right, Tesseract should have the capability to generate it in .tif, but not consider it as a single character in the .box file, the correct thing would be that the box of U+640 (tatweel) be combined with the box of another letter, while setting the box as the character of a single letter, never even mentioning U+640 (tatweel) in the .box file , ever. U+640 (tatweel) is a special case, people dont usually use it while writing text, so dont use it, or merge the boxes and remove it. 2. As @bmwmy <https://github.com/bmwmy> mentioned earlier, in your .tif the charachters of the text are seperated, thats wrong. wrong: بـ ـسـ ـم correct: بسم — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#648 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o1m8lCpawdFVw2ywk8nxQgwrn-zKks5rW2YIgaJpZM4Lf-kT> .

ghost · 2017-01-28T23:27:29Z

@theraysmith i suggest that you give capability to convert tesseract 3.0x box files to tesseract 4.0x since many of us have tif/box files based on the older version 3.0x

@theraysmith Also, Ubuntu have released the Snaps project, giving the ability to distribute a software as a universal Linux package. Would it be possible to release a Snappy version on Tesseract 4.0x, this would save us alot of time and effort by skipping the building process and it's issues.
https://www.ubuntu.com/desktop/snappy
http://snapcraft.io/

ghost · 2017-07-30T19:11:46Z

Question
@theraysmith I understand that for training, Tesseract 4.x reorders the Arabic text to tesseracts's reading order, meaning it converts RTL to LTR & then normalizes it.
For normalization, which form does it uses, NFD or NFC?
Example:

GDT: آمنا بالله إن شئتم الآخرة هم بمؤمنين يا أيها
NFD: اهئا اي نينمٔومب مه ةرخٓالا متٔيش نٕا هللاب انمٓا
NFC: اهيأ اي نينمؤمب مه ةرخآلا متئش نإ هللاب انمآ

The difference between NFD vs NFC, is that after the text is is reordered to LTR for training, NFD pushes the marks/diacritics to the right-side, while in NFC the marks/diacritics remain on the characters itself.

amitdo · 2017-07-30T19:16:39Z

text2image should have an option to randomly add tatweel every n lines.

https://en.wikipedia.org/wiki/Kashida

Kashida is generally only used in one word per line and applied to one letter per word.

Furthermore, kashida is recommended only between certain combinations of letters (typically those which cannot form a ligature).

ghost · 2017-07-30T19:32:05Z

@amitdo No, No.......! That is very bad
Tatweel reduces the recognition rate, and makes the model unstable and confused theoretically speaking
Ray stated earlier that he have fixed the tatweel problem, he even gave the option to render it if you want to, but it will be removed as a singular sole character from the training data.
I think he solved this issue by Merging the tatweel Box with the earlier glyph box, and removing the tatweel as a character, the final remainder is 1 box with the earlier glyph.

amitdo · 2017-07-30T19:45:04Z

For normalization, which form does it uses, NFD or NFC?

It seems it uses NFKC.

amitdo · 2017-07-30T20:18:00Z

Hi @Christophered, I don't think you understood my meaning.

I Re-read what Ray said in this issue.

it needs to be preserved in the training text, so it gets rendered

I understand that tatweel is a rendering artifact, that should be rendered for training, but not occur in the output text (or in the language model).

The tatweel and ligature problem are fixed and will be corrected in the new traineddatas coming soon

So it seems he already implemented (but didn't push yet) what I just now suggested.

amitdo · 2017-07-30T20:33:42Z

text2image is the program that renders images from ground truth.

text2image should have an option to randomly add tatweel every n lines.

'add' here means 'render'.

Shreeshrii · 2017-09-09T03:57:24Z

3e63918

Please test with latest source from github. The commit by @theraysmith fixes the issue.

Shreeshrii · 2017-09-09T05:36:50Z

@theraysmith

Thanks for these updates.

Does the complete fix for RTL languages also require new traineddata created with these fixes?

Will you be uploading a new version of traineddata?

hanikh · 2017-09-12T06:44:33Z

the new version of traineddata has been uploaded a lot of time ago.

…

On Sat, Sep 9, 2017 at 10:07 AM, Shreeshrii ***@***.***> wrote: @theraysmith <https://github.com/theraysmith> Thanks for these updates. Does the complete fix for RTL languages also require new traineddata created with these fixes? Will you be uploading a new version of traineddata? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#648 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AZFiAfGrAIkJiFceO2zo3iNI77Q7sz9Dks5sgiQGgaJpZM4Lf-kT> .

Shreeshrii · 2017-09-12T08:00:53Z

@hanikh You are right, Ray had uploaded the best traineddata files on August 1.

However, I think that the wordlists in traineddata for RTL languages in that still had some errors in order of characters in ligatures.

I am hoping that @theraysmith will upload fixed versions of traineddata to the new repos for LSTM traineddata -

https://github.com/tesseract-ocr/tessdata_best
and
https://github.com/tesseract-ocr/tessdata_fast

Fahad-Alsaidi · 2017-12-07T08:03:44Z

if you are looking for Arabic diacritization corpus, here is a good one

AbdelsalamHaa · 2018-04-30T06:28:32Z

Hi guys, im using tesseract 4.00 and i have did my program to recognize English language . it worked very well , now im trying to include arabic language in my program but it give a very weird characters even though i used ara.triandata instead of eng.traindata

Shreeshrii · 2018-04-30T08:02:10Z

Try script/Arabic.traineddata From tessdata_fast or tessdata_best

…

On Mon 30 Apr, 2018, 11:58 AM AbdelsalamHaa, ***@***.***> wrote: Hi guys, im using tesseract 4.00 and i have did my program to recognize English language . it worked very well , now im trying to include arabic language in my program but it give a very weird characters even though i used ara.triandata instead of eng.traindata — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#648 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o1fLbf5agYdH9NZ-FRFNJLv0vOSLks5ttq8bgaJpZM4Lf-kT> .

amitdo · 2018-04-30T09:06:39Z

@AbdelsalamHaa

Why do you ask the same question 5 times in different issues?
Please use the forum to ask questions.

https://github.com/tesseract-ocr/tesseract/issues?q=is%3Aissue+involves%3AAbdelsalamHaa+is%3Aopen
https://github.com/tesseract-ocr/tessdata/issues?q=is%3Aissue+involves%3AAbdelsalamHaa+is%3Aopen

AbdelsalamHaa · 2018-05-02T02:10:05Z

@amitdo
sorry i thought each one is different , my apologies to you .

@Shreeshrii .
this is the result when i use both fast ara.traineddata and best ara.traineddata . both the same

the image im trying to read is this one

however on the prompt window the result is also different then even the one before printing

is it because the language i use for my laptop or that has nothing to do with this .

Shreeshrii · 2018-05-02T02:39:11Z

Seems like a locale issue. Output to a file and then open it in a unicode text editor.

Shreeshrii · 2018-05-02T03:26:30Z

You may need to preprocess the image for better result.

عبدالسلام حمدي عبدالعزيز

tesseract 648-arabic.png    -  -l ara --tessdata-dir ./tessdata_best
عبدالسلام حمدي عبدالعزيز

AbdelsalamHaa · 2018-05-02T03:51:42Z

@Shreeshrii
i found out why they are different it's because my computer used chines to encode any non unicode text , i mean from the tesseract and the prompt window

but still the result still the same

i download ara.traineddata from here

https://github.com/tesseract-ocr/tessdata_fast
and also tried from here but same results
https://github.com/tesseract-ocr/tessdata_best

and i used this website to check the unicode as u mention
https://r12a.github.io/app-conversion/

i think i found the problem but im not sure how to solve it
the first line is the UTF-8 code for "عبدالسلام حمدي عبدالعزيز"
then when i convert it press"Hex code points" i got the string that i got when used tessreract .

but still not sure how to solve the problem

AbdelsalamHaa · 2018-05-02T03:54:19Z

this is the part of my code for tessract

i aslo used the image u just send to test but still got the same wired charterers

AbdelsalamHaa · 2018-05-02T03:56:52Z

here where i initialized all .traineddata files

AbdelsalamHaa · 2018-05-02T06:04:01Z

To say my problem more clearly , the tesseract read the image correctly. It returns the correct UTF-8 code but then this code when it's treated as hex code the charterers will be wrong.
hope can anyone help me with this

AbdelsalamHaa · 2018-05-03T08:45:29Z

okay final i found out the problem , well i fixed few thing so im not sure which one exactly was the mistake but i think the most is
in the Advance saving option i change it to be like this

i test the code first using this simple code even though it's not tesseract fault but it might be use full for other

`#include <stdio.h>
#include <windows.h>
#include
#include
#include
#include

using namespace std;
int main() {
ofstream writer("file3.txt");
// Set console code page to UTF-8 so console known how to interpret string data
//SetConsoleOutputCP(CP_UTF8);

// Enable buffering to prevent VS from chopping up UTF-8 byte sequences
//setvbuf(stdout, nullptr, _IOFBF, 1000);

string test1 = "عبدالسلام حمدي عبدالعزيز\n";

cout << test1  << std::endl;

writer << "\t na  " << test1.c_str() << endl;



getchar(); 
return 0;

} `
please take note the result is only correct when u print to a file but not on the prompt window not even in watch of visual studio .

mhdsRahnama · 2018-06-29T07:46:32Z

hello
I have tatweel (kashida) problem in Persain too. can i use text2image to fix it?

Shreeshrii mentioned this issue Jan 11, 2017

LSTM: Training - Arabic - Add Top layer - Aborted (core dumped) #642

Closed

Shreeshrii mentioned this issue Jan 12, 2017

Q&A: Indic - length of the compressed codes #654

Open

ghost closed this as completed Jul 30, 2017

ghost reopened this Jul 30, 2017

theraysmith pushed a commit that referenced this issue Sep 8, 2017

Fixed order of characters in ligatures of RTL languages issue #648

3e63918

Shreeshrii mentioned this issue Jan 25, 2018

wrong coordinates in .box file with LSTM #1276

Closed

Shreeshrii mentioned this issue Jan 1, 2019

Fine Tuning Leads to Segmentation Issue #2132

Open

amitdo added the RTL label Mar 18, 2021

amitdo added the text2image label Sep 25, 2022

Box File disorder, Arabic Language #648

Box File disorder, Arabic Language #648

Comments

ghost commented Jan 10, 2017 • edited by ghost Loading

Shreeshrii commented Jan 11, 2017

ghost commented Jan 11, 2017

amitdo commented Jan 11, 2017

theraysmith commented Jan 11, 2017 via email

Shreeshrii commented Jan 11, 2017

Shreeshrii commented Jan 11, 2017 • edited Loading

amitdo commented Jan 11, 2017 • edited Loading

Shreeshrii commented Jan 11, 2017

theraysmith commented Jan 11, 2017 via email

theraysmith commented Jan 11, 2017

Shreeshrii commented Jan 12, 2017

Shreeshrii commented Jan 12, 2017 • edited Loading

bmwmy commented Jan 12, 2017 • edited Loading

amitdo commented Jan 12, 2017 • edited Loading

amitdo commented Jan 12, 2017

Shreeshrii commented Jan 12, 2017 • edited Loading

amitdo commented Jan 12, 2017 • edited Loading

amitdo commented Jan 12, 2017 • edited Loading

Shreeshrii commented Jan 12, 2017 • edited by theraysmith Loading

amitdo commented Jan 12, 2017

amitdo commented Jan 12, 2017

Shreeshrii commented Jan 12, 2017 • edited Loading

amitdo commented Jan 13, 2017

Shreeshrii commented Jan 17, 2017

ghost commented Jan 28, 2017

ghost commented Jan 28, 2017 • edited by ghost Loading

Shreeshrii commented Jan 28, 2017 via email

ghost commented Jan 28, 2017 • edited by ghost Loading

ghost commented Jul 30, 2017 • edited by ghost Loading

amitdo commented Jul 30, 2017

ghost commented Jul 30, 2017

amitdo commented Jul 30, 2017 • edited Loading

amitdo commented Jul 30, 2017 • edited Loading

amitdo commented Jul 30, 2017

Shreeshrii commented Sep 9, 2017

Shreeshrii commented Sep 9, 2017

hanikh commented Sep 12, 2017 via email

Shreeshrii commented Sep 12, 2017

Fahad-Alsaidi commented Dec 7, 2017

AbdelsalamHaa commented Apr 30, 2018

Shreeshrii commented Apr 30, 2018 via email

amitdo commented Apr 30, 2018 • edited Loading

AbdelsalamHaa commented May 2, 2018

Shreeshrii commented May 2, 2018

Shreeshrii commented May 2, 2018

عبدالسلام حمدي عبدالعزيز

AbdelsalamHaa commented May 2, 2018 • edited Loading

AbdelsalamHaa commented May 2, 2018

AbdelsalamHaa commented May 2, 2018

AbdelsalamHaa commented May 2, 2018

AbdelsalamHaa commented May 3, 2018 • edited Loading

mhdsRahnama commented Jun 29, 2018

ghost commented Jan 10, 2017 •

edited by ghost

Loading

Shreeshrii commented Jan 11, 2017 •

edited

Loading

amitdo commented Jan 11, 2017 •

edited

Loading

Shreeshrii commented Jan 12, 2017 •

edited

Loading

bmwmy commented Jan 12, 2017 •

edited

Loading

amitdo commented Jan 12, 2017 •

edited

Loading

Shreeshrii commented Jan 12, 2017 •

edited

Loading

amitdo commented Jan 12, 2017 •

edited

Loading

amitdo commented Jan 12, 2017 •

edited

Loading

Shreeshrii commented Jan 12, 2017 •

edited by theraysmith

Loading

Shreeshrii commented Jan 12, 2017 •

edited

Loading

ghost commented Jan 28, 2017 •

edited by ghost

Loading

ghost commented Jan 28, 2017 •

edited by ghost

Loading

ghost commented Jul 30, 2017 •

edited by ghost

Loading

amitdo commented Jul 30, 2017 •

edited

Loading

amitdo commented Jul 30, 2017 •

edited

Loading

amitdo commented Apr 30, 2018 •

edited

Loading

AbdelsalamHaa commented May 2, 2018 •

edited

Loading

AbdelsalamHaa commented May 3, 2018 •

edited

Loading