-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report on RTL training with OCR_GS_Data for Arabic #128
Comments
OCR_GS_Data has Double-checked Gold Standard Data for Training and Testing OCR Engines for RTL languages. It includes Arabic data used for Important New Developments in Arabographic Optical Character Recognition (OCR). I have used (so far) only a subset of the datasets referred to in the above report, work #0, #1, #2, #5 and #6 , since they are said to have similar typeface. These have approximately 10000 single line images and their transcription. (As per the report it should be approx. 5000 text lines. The images are at high and low (200 dpi) resolution, hence doubling the number). |
Initial test comparison, with testing only for work#0 (Buldan) is shown in a table in this post. Second run of training using all five datasets referenced above with certain modifications show better results, when evaluated on the same training data.
|
Issues with groundtruth and images:
|
Issues with tesseract an text2image:
|
Issues with ocr evaluation tools: For large texts, the accuracy reports are not generated. Error is Alternative is to use https://github.com/impactcentre/ocrevalUAtion Sample report : eval-Buldan-araKraken.html.txt |
See https://github.com/Shreeshrii/tesstrain-arabic-GS for current training data and reports |
@Shreeshrii I wonder if/how the ordering of punctuation chars affects training. Given a line image like https://github.com/OpenITI/OCR_GS_Data/blob/master/ara/book_IbnFaqihHamadhani.Buldan/7_final_b/a_000716.png, compared with it's transcription (https://github.com/OpenITI/OCR_GS_Data/blob/master/ara/book_IbnFaqihHamadhani.Buldan/7_final/a_000004.gt.txt), it seems to me the double colon is not in proper position, since the transcription places it right most, but within the image it is left-end. I've seen this turns in many text lines, also in https://github.com/OpenITI/TrainingData/tree/master/JSTORArabic used at #213 |
Punctuation marks are an open issue. Someone with knowledge of Arabic and bidi will have to look at it and suggest a solution. generate_wordstr_box.py uses bidi but leaves punctuation as is. |
Maybe currently it's convenient to eliminate punctuation from training data? Our focus is on letters. The PR-Request #205 tries to sanitize this by wiping off any RTL-unicode direction marks, which otherwise make it tricky just to follow with the arrow keys char-by-char, especially with punctuation and other non-arabic mixed-ins. I guess punctuation is, apart from usual arabic, considered as char with "normal" LTR reading order, like any non-arabic digits (latin, indic or whatever) and therefore turned right-end. |
@Shreeshrii, what is the final status of your efforts regarding fine tuning with the compete GS_Data set? The link above |
Similar to @stweil's training for Fraktur, I am collecting here info regarding finetune RTL training with OCR_GS_Data for Arabic. Some of this has already been reported elsewhere in other threads earlier.
The text was updated successfully, but these errors were encountered: