-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vedic Sanskrit Traineddata for 4.0 #61
Comments
I separately trained for Vedic Sanskrit using text from Rigveda. The source files are at https://github.com/Shreeshrii/tess4training-vedic Resulting traineddata files are as follows:
and accuracy on sample page -
A special thank you to the Travis team for providing the resources for training. |
A user who tested with a sample of 30 pages with the above reported accuracy of 90%. @theraysmith I hope that the new version of Sanskrit traineddata will be able to OCR both Classical Sanskrit and Vedic Sanskrit. When should we expect the new (4.0.0beta) version of traineddata files? |
Sample of Samaveda Sanskrit text - it uses different set of Vedic accents compared to rigveda sample above. I have not tried training for this. http://sanskrit.safire.com/image/SamaVeda.gif http://sanskritweb.net/samaveda/sample.gif http://vedicreserve.mum.edu/sama_veda/sama_veda.pdf Update: Unicode Samaveda text available from http://www.parankusa.org/SamaBrowse.aspx |
how to train |
@ksdmahesh What kind of text do you want to train for? You will needs a large text corpus in utf-8 format and unicode fonts that render it correctly. |
ok thank you. |
Sorry for delay in reply. See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 I am doing a test training using portions of online texts of rigveda and yajurveda. (I had deleted the githib repos referenced in earlier messages in this thread). I will share the traineddata file when done. |
@Shreeshrii , |
@Shreeshrii please share the trainedata file for vedic pandulipi sanskrit |
san.traineddata in this repo (4.0 alpha version) gives the following accuracy for the above sample of images:
Improved Accuracy is gained after training using
https://github.com/Shreeshrii/tess4training/blob/a09bbe913b25b0e623f1bc267a60b192d8e0ccc6/san.traineddata
The newly traineddata accuracy should improve further , I hope, as the training converges further.
Eval Report at https://shreeshrii.github.io/tess4eval-san/
Update:
Further training actually led to lower accuracy on this sample set.
Stopped Training with
https://github.com/Shreeshrii/tess4training/blob/8dc4f8488e74d4a934168844932c8a526d70c1d9/bihtune.traineddata
The text was updated successfully, but these errors were encountered: