Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report on RTL training with OCR_GS_Data for Arabic #128

Open
Shreeshrii opened this issue Dec 1, 2019 · 11 comments
Open

Report on RTL training with OCR_GS_Data for Arabic #128

Shreeshrii opened this issue Dec 1, 2019 · 11 comments
Labels
pinned Eternal issues which are save from becoming stale

Comments

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Dec 1, 2019

Similar to @stweil's training for Fraktur, I am collecting here info regarding finetune RTL training with OCR_GS_Data for Arabic. Some of this has already been reported elsewhere in other threads earlier.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Dec 1, 2019

OCR_GS_Data has Double-checked Gold Standard Data for Training and Testing OCR Engines for RTL languages. It includes Arabic data used for Important New Developments in Arabographic Optical Character Recognition (OCR).

I have used (so far) only a subset of the datasets referred to in the above report, work #0, #1, #2, #5 and #6 , since they are said to have similar typeface. These have approximately 10000 single line images and their transcription. (As per the report it should be approx. 5000 text lines. The images are at high and low (200 dpi) resolution, hence doubling the number).

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Dec 1, 2019

Initial test comparison, with testing only for work#0 (Buldan) is shown in a table in this post.

Second run of training using all five datasets referenced above with certain modifications show better results, when evaluated on the same training data.

Type Rate
CER 0.70
WER 1.71
WER (order independent) 1.65

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Dec 1, 2019

Issues with groundtruth and images:

@Shreeshrii
Copy link
Collaborator Author

Issues with tesstrain Makefile:

  • RTL text not handled correctly either by generate_line_box.py or by generate_wordstr_box.py. See comment.

PR #127 proposes a new script to handle these.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Dec 1, 2019

Issues with tesseract an text2image:

  • Need to use --psm 13 for correct recognition.
  • Need to use -c page_separator=''
  • text2image does not create correct charboxes for certain images (Bad box coordinates in boxfile string! ح). The resulting wordstrbox files for these images have more than 2 lines. Discard these images otherwise training does not converge at all. This can be done as follows
find /home/ubuntu/OCR_GS_Data/ara/ground-truth -type f -name '*Buldan*.box' -exec bash -c '[[ $(wc -l < "$1") -gt 1 ]] && echo "$1"' _ '{}' \;  > err.txt
sed -i -e 's/^/rm /' err.txt
sed -i -e 's/box/*/' err.txt
bash err.txt

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Dec 1, 2019

Issues with ocr evaluation tools:

For large texts, the accuracy reports are not generated. Error is accuracy: text stream is too long. See issue.

Alternative is to use https://github.com/impactcentre/ocrevalUAtion

Sample report : eval-Buldan-araKraken.html.txt

@Shreeshrii Shreeshrii changed the title Issues with RTL training with OCR_GS_Data for Arabic Report on RTL training with OCR_GS_Data for Arabic Dec 1, 2019
@wrznr wrznr added the pinned Eternal issues which are save from becoming stale label Dec 4, 2019
@Shreeshrii
Copy link
Collaborator Author

See https://github.com/Shreeshrii/tesstrain-arabic-GS for current training data and reports

@M3ssman
Copy link
Contributor

M3ssman commented Dec 16, 2020

@Shreeshrii I wonder if/how the ordering of punctuation chars affects training.

Given a line image like https://github.com/OpenITI/OCR_GS_Data/blob/master/ara/book_IbnFaqihHamadhani.Buldan/7_final_b/a_000716.png, compared with it's transcription (https://github.com/OpenITI/OCR_GS_Data/blob/master/ara/book_IbnFaqihHamadhani.Buldan/7_final/a_000004.gt.txt), it seems to me the double colon is not in proper position, since the transcription places it right most, but within the image it is left-end.

I've seen this turns in many text lines, also in https://github.com/OpenITI/TrainingData/tree/master/JSTORArabic used at #213

@Shreeshrii
Copy link
Collaborator Author

Punctuation marks are an open issue. Someone with knowledge of Arabic and bidi will have to look at it and suggest a solution.

generate_wordstr_box.py uses bidi but leaves punctuation as is.

@M3ssman
Copy link
Contributor

M3ssman commented Dec 16, 2020

Maybe currently it's convenient to eliminate punctuation from training data? Our focus is on letters.

The PR-Request #205 tries to sanitize this by wiping off any RTL-unicode direction marks, which otherwise make it tricky just to follow with the arrow keys char-by-char, especially with punctuation and other non-arabic mixed-ins. I guess punctuation is, apart from usual arabic, considered as char with "normal" LTR reading order, like any non-arabic digits (latin, indic or whatever) and therefore turned right-end.

@MihoMahi
Copy link

MihoMahi commented Mar 7, 2022

@Shreeshrii, what is the final status of your efforts regarding fine tuning with the compete GS_Data set? The link above
https://github.com/Shreeshrii/tesstrain-arabic-GS
is not available any more?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pinned Eternal issues which are save from becoming stale
Projects
None yet
Development

No branches or pull requests

4 participants