Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best Traineddata Feedback - Gujarati - ન - ત Confusion #60

Open
Shreeshrii opened this issue Jun 23, 2017 · 2 comments
Open

Best Traineddata Feedback - Gujarati - ન - ત Confusion #60

Shreeshrii opened this issue Jun 23, 2017 · 2 comments

Comments

@Shreeshrii
Copy link
Contributor

When using tesseract 4.0 with --oem 1 (LSTM) with Gujarati traineddata, ન is being recognized as ત in the attached image.

Same image when recognized with --oem 0 is recognizing ન correctly, but has other accuracy problems.

So, it looks like that LSTM model for Gujarati has not been trained with this font.

Image and ground truth file are attached.

It would be helpful to have the ability to finetune using real images in addition to synthetic data.

guj.ag.exp0-GT.txt
guj ag exp0

@Shreeshrii
Copy link
Contributor Author

Tested with both best/guj and best/Gujarati traineddata - psm 6 just now.

While the ન - ત Confusion is still there, Gujarati traineddata seems better than guj - it is dropping fewer words in OCR output.

@Shreeshrii Shreeshrii changed the title Gujarati - ન - ત Confusion Best Traineddata Feedback - Gujarati - ન - ત Confusion Aug 4, 2017
@amitdo
Copy link

amitdo commented Oct 15, 2018

While the ન - ત Confusion is still there, Gujarati traineddata seems better than guj - it is dropping fewer words in OCR output.

tesseract-ocr/tesseract#1264

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants