Report on RTL training with OCR_GS_Data for Arabic #128

Shreeshrii · 2019-12-01T05:33:19Z

Similar to @stweil's training for Fraktur, I am collecting here info regarding finetune RTL training with OCR_GS_Data for Arabic. Some of this has already been reported elsewhere in other threads earlier.

Shreeshrii · 2019-12-01T05:50:01Z

OCR_GS_Data has Double-checked Gold Standard Data for Training and Testing OCR Engines for RTL languages. It includes Arabic data used for Important New Developments in Arabographic Optical Character Recognition (OCR).

I have used (so far) only a subset of the datasets referred to in the above report, work #0, #1, #2, #5 and #6 , since they are said to have similar typeface. These have approximately 10000 single line images and their transcription. (As per the report it should be approx. 5000 text lines. The images are at high and low (200 dpi) resolution, hence doubling the number).

Shreeshrii · 2019-12-01T06:58:07Z

Initial test comparison, with testing only for work#0 (Buldan) is shown in a table in this post.

Second run of training using all five datasets referenced above with certain modifications show better results, when evaluated on the same training data.

Type	Rate
CER	0.70
WER	1.71
WER (order independent)	1.65

Shreeshrii · 2019-12-01T07:01:44Z

Issues with groundtruth and images:

space at beginning/end of some transcriptions (could lead to hallucination effect)
digits in Arabic script have been transcribed as 0-9.
some images are not tightly cropped so using the bbox for whole image does not match the actual bbox for text.

Shreeshrii · 2019-12-01T07:05:52Z

Issues with tesstrain Makefile:

RTL text not handled correctly either by generate_line_box.py or by generate_wordstr_box.py. See comment.

PR #127 proposes a new script to handle these.

Shreeshrii · 2019-12-01T07:12:39Z

Issues with tesseract an text2image:

Need to use --psm 13 for correct recognition.
Need to use -c page_separator=''
text2image does not create correct charboxes for certain images (Bad box coordinates in boxfile string! ح). The resulting wordstrbox files for these images have more than 2 lines. Discard these images otherwise training does not converge at all. This can be done as follows

find /home/ubuntu/OCR_GS_Data/ara/ground-truth -type f -name '*Buldan*.box' -exec bash -c '[[ $(wc -l < "$1") -gt 1 ]] && echo "$1"' _ '{}' \;  > err.txt
sed -i -e 's/^/rm /' err.txt
sed -i -e 's/box/*/' err.txt
bash err.txt

Shreeshrii · 2019-12-01T07:15:22Z

Issues with ocr evaluation tools:

For large texts, the accuracy reports are not generated. Error is accuracy: text stream is too long. See issue.

Alternative is to use https://github.com/impactcentre/ocrevalUAtion

Sample report : eval-Buldan-araKraken.html.txt

Shreeshrii · 2019-12-19T18:12:13Z

See https://github.com/Shreeshrii/tesstrain-arabic-GS for current training data and reports

M3ssman · 2020-12-16T08:10:57Z

@Shreeshrii I wonder if/how the ordering of punctuation chars affects training.

Given a line image like https://github.com/OpenITI/OCR_GS_Data/blob/master/ara/book_IbnFaqihHamadhani.Buldan/7_final_b/a_000716.png, compared with it's transcription (https://github.com/OpenITI/OCR_GS_Data/blob/master/ara/book_IbnFaqihHamadhani.Buldan/7_final/a_000004.gt.txt), it seems to me the double colon is not in proper position, since the transcription places it right most, but within the image it is left-end.

I've seen this turns in many text lines, also in https://github.com/OpenITI/TrainingData/tree/master/JSTORArabic used at #213

Shreeshrii · 2020-12-16T08:24:07Z

Punctuation marks are an open issue. Someone with knowledge of Arabic and bidi will have to look at it and suggest a solution.

generate_wordstr_box.py uses bidi but leaves punctuation as is.

M3ssman · 2020-12-16T10:00:28Z

Maybe currently it's convenient to eliminate punctuation from training data? Our focus is on letters.

The PR-Request #205 tries to sanitize this by wiping off any RTL-unicode direction marks, which otherwise make it tricky just to follow with the arrow keys char-by-char, especially with punctuation and other non-arabic mixed-ins. I guess punctuation is, apart from usual arabic, considered as char with "normal" LTR reading order, like any non-arabic digits (latin, indic or whatever) and therefore turned right-end.

MihoMahi · 2022-03-07T11:10:48Z

@Shreeshrii, what is the final status of your efforts regarding fine tuning with the compete GS_Data set? The link above
https://github.com/Shreeshrii/tesstrain-arabic-GS
is not available any more?

Shreeshrii changed the title ~~Issues with RTL training with OCR_GS_Data for Arabic~~ Report on RTL training with OCR_GS_Data for Arabic Dec 1, 2019

wrznr added the pinned Eternal issues which are save from becoming stale label Dec 4, 2019

Shreeshrii mentioned this issue Nov 12, 2020

Numbers in Arabic script are getting reversed tesseract-ocr/tesseract#2263

Closed

Shreeshrii mentioned this issue Dec 16, 2020

Decrease of Recognition: Training from existing Tesseract Model ara #213

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report on RTL training with OCR_GS_Data for Arabic #128

Report on RTL training with OCR_GS_Data for Arabic #128

Shreeshrii commented Dec 1, 2019 •

edited

Loading

Shreeshrii commented Dec 1, 2019 •

edited

Loading

Shreeshrii commented Dec 1, 2019 •

edited

Loading

Shreeshrii commented Dec 1, 2019 •

edited

Loading

Shreeshrii commented Dec 1, 2019

Shreeshrii commented Dec 1, 2019 •

edited

Loading

Shreeshrii commented Dec 1, 2019 •

edited

Loading

Shreeshrii commented Dec 19, 2019

M3ssman commented Dec 16, 2020

Shreeshrii commented Dec 16, 2020

M3ssman commented Dec 16, 2020

MihoMahi commented Mar 7, 2022

Report on RTL training with OCR_GS_Data for Arabic #128

Report on RTL training with OCR_GS_Data for Arabic #128

Comments

Shreeshrii commented Dec 1, 2019 • edited Loading

Shreeshrii commented Dec 1, 2019 • edited Loading

Shreeshrii commented Dec 1, 2019 • edited Loading

Shreeshrii commented Dec 1, 2019 • edited Loading

Shreeshrii commented Dec 1, 2019

Shreeshrii commented Dec 1, 2019 • edited Loading

Shreeshrii commented Dec 1, 2019 • edited Loading

Shreeshrii commented Dec 19, 2019

M3ssman commented Dec 16, 2020

Shreeshrii commented Dec 16, 2020

M3ssman commented Dec 16, 2020

MihoMahi commented Mar 7, 2022

Shreeshrii commented Dec 1, 2019 •

edited

Loading

Shreeshrii commented Dec 1, 2019 •

edited

Loading

Shreeshrii commented Dec 1, 2019 •

edited

Loading

Shreeshrii commented Dec 1, 2019 •

edited

Loading

Shreeshrii commented Dec 1, 2019 •

edited

Loading

Shreeshrii commented Dec 1, 2019 •

edited

Loading