Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vedic Sanskrit Traineddata for 4.0 #61

Open
Shreeshrii opened this issue Jul 11, 2017 · 9 comments
Open

Vedic Sanskrit Traineddata for 4.0 #61

Shreeshrii opened this issue Jul 11, 2017 · 9 comments

Comments

@Shreeshrii
Copy link
Contributor

Shreeshrii commented Jul 11, 2017

san.traineddata in this repo (4.0 alpha version) gives the following accuracy for the above sample of images:

Character/Word Error Rate %
CER 11.55
WER 8.85
WER (order independent) 7.44

Improved Accuracy is gained after training using
https://github.com/Shreeshrii/tess4training/blob/a09bbe913b25b0e623f1bc267a60b192d8e0ccc6/san.traineddata

Character/Word Error Rate %
CER 3.73
WER 4.71
WER (order independent) 3.82

The newly traineddata accuracy should improve further , I hope, as the training converges further.

Eval Report at https://shreeshrii.github.io/tess4eval-san/

Update:

Further training actually led to lower accuracy on this sample set.

Character/Word Error Rate %
CER 6.65
WER 9.43
WER (order independent) 7.94

Stopped Training with
https://github.com/Shreeshrii/tess4training/blob/8dc4f8488e74d4a934168844932c8a526d70c1d9/bihtune.traineddata

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Jul 20, 2017

@Shreeshrii
Copy link
Contributor Author

A user who tested with a sample of 30 pages with the above reported accuracy of 90%.

@theraysmith I hope that the new version of Sanskrit traineddata will be able to OCR both Classical Sanskrit and Vedic Sanskrit. When should we expect the new (4.0.0beta) version of traineddata files?

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Jul 26, 2017

Sample of Samaveda Sanskrit text - it uses different set of Vedic accents compared to rigveda sample above. I have not tried training for this.

http://sanskrit.safire.com/image/SamaVeda.gif

http://sanskritweb.net/samaveda/sample.gif

http://vedicreserve.mum.edu/sama_veda/sama_veda.pdf

Update:

Unicode Samaveda text available from http://www.parankusa.org/SamaBrowse.aspx

@Shreeshrii Shreeshrii changed the title Improved Sanskrit Traineddata for 4.0 Vedic Sanskrit Traineddata for 4.0 Aug 4, 2017
@ksdmahesh
Copy link

how to train
help me by step by step process.
for Vedic sanskrit

@Shreeshrii
Copy link
Contributor Author

@ksdmahesh What kind of text do you want to train for? You will needs a large text corpus in utf-8 format and unicode fonts that render it correctly.

@ksdmahesh
Copy link

ok thank you.
after collection of data.
how to start

@Shreeshrii
Copy link
Contributor Author

Sorry for delay in reply.

See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
for the details about training.

I am doing a test training using portions of online texts of rigveda and yajurveda. (I had deleted the githib repos referenced in earlier messages in this thread). I will share the traineddata file when done.

@bohrbrar
Copy link

@Shreeshrii ,
Good work. How close you are in training rigveda and yajurveda models. can you share traineddata files?
Regards
Bohar

@sd-dwivedi
Copy link

@Shreeshrii please share the trainedata file for vedic pandulipi sanskrit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants