Vedic Sanskrit Traineddata for 4.0 #61

Shreeshrii · 2017-07-11T08:25:28Z

See https://github.com/Shreeshrii/imagessan/tree/master/groundtruthimages for the images used for testing.

san.traineddata in this repo (4.0 alpha version) gives the following accuracy for the above sample of images:

Character/Word Error Rate	%
CER	11.55
WER	8.85
WER (order independent)	7.44

Improved Accuracy is gained after training using
https://github.com/Shreeshrii/tess4training/blob/a09bbe913b25b0e623f1bc267a60b192d8e0ccc6/san.traineddata

Character/Word Error Rate	%
CER	3.73
WER	4.71
WER (order independent)	3.82

The newly traineddata accuracy should improve further , I hope, as the training converges further.

Eval Report at https://shreeshrii.github.io/tess4eval-san/

Update:

Further training actually led to lower accuracy on this sample set.

Character/Word Error Rate	%
CER	6.65
WER	9.43
WER (order independent)	7.94

Stopped Training with
https://github.com/Shreeshrii/tess4training/blob/8dc4f8488e74d4a934168844932c8a526d70c1d9/bihtune.traineddata

Shreeshrii · 2017-07-20T15:42:30Z

I separately trained for Vedic Sanskrit using text from Rigveda.

The source files are at https://github.com/Shreeshrii/tess4training-vedic

Resulting traineddata files are as follows:

and accuracy on sample page -
(https://github.com/Shreeshrii/tess4training/blob/master/scanned.tulasi.exp0.tif) when I stopped training was

A special thank you to the Travis team for providing the resources for training.

Shreeshrii · 2017-07-20T15:50:09Z

A user who tested with a sample of 30 pages with the above reported accuracy of 90%.

@theraysmith I hope that the new version of Sanskrit traineddata will be able to OCR both Classical Sanskrit and Vedic Sanskrit. When should we expect the new (4.0.0beta) version of traineddata files?

Shreeshrii · 2017-07-26T06:26:15Z

Sample of Samaveda Sanskrit text - it uses different set of Vedic accents compared to rigveda sample above. I have not tried training for this.

http://sanskrit.safire.com/image/SamaVeda.gif

http://sanskritweb.net/samaveda/sample.gif

http://vedicreserve.mum.edu/sama_veda/sama_veda.pdf

Update:

Unicode Samaveda text available from http://www.parankusa.org/SamaBrowse.aspx

ksdmahesh · 2019-03-03T09:58:58Z

how to train
help me by step by step process.
for Vedic sanskrit

Shreeshrii · 2019-03-05T04:11:08Z

@ksdmahesh What kind of text do you want to train for? You will needs a large text corpus in utf-8 format and unicode fonts that render it correctly.

ksdmahesh · 2019-03-05T04:24:44Z

ok thank you.
after collection of data.
how to start

Shreeshrii · 2019-03-29T08:13:28Z

Sorry for delay in reply.

See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
for the details about training.

I am doing a test training using portions of online texts of rigveda and yajurveda. (I had deleted the githib repos referenced in earlier messages in this thread). I will share the traineddata file when done.

bohrbrar · 2020-06-22T14:21:18Z

@Shreeshrii ,
Good work. How close you are in training rigveda and yajurveda models. can you share traineddata files?
Regards
Bohar

sd-dwivedi · 2020-08-27T12:51:09Z

@Shreeshrii please share the trainedata file for vedic pandulipi sanskrit

Shreeshrii mentioned this issue Jul 29, 2017

Hebrew issues tesseract-ocr/langdata#82

Open

Shreeshrii changed the title ~~Improved Sanskrit Traineddata for 4.0~~ Vedic Sanskrit Traineddata for 4.0 Aug 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vedic Sanskrit Traineddata for 4.0 #61

Vedic Sanskrit Traineddata for 4.0 #61

Shreeshrii commented Jul 11, 2017 •

edited

Loading

Shreeshrii commented Jul 20, 2017 •

edited

Loading

Shreeshrii commented Jul 20, 2017

Shreeshrii commented Jul 26, 2017 •

edited

Loading

ksdmahesh commented Mar 3, 2019

Shreeshrii commented Mar 5, 2019

ksdmahesh commented Mar 5, 2019

Shreeshrii commented Mar 29, 2019

bohrbrar commented Jun 22, 2020

sd-dwivedi commented Aug 27, 2020

Vedic Sanskrit Traineddata for 4.0 #61

Vedic Sanskrit Traineddata for 4.0 #61

Comments

Shreeshrii commented Jul 11, 2017 • edited Loading

Shreeshrii commented Jul 20, 2017 • edited Loading

Shreeshrii commented Jul 20, 2017

Shreeshrii commented Jul 26, 2017 • edited Loading

ksdmahesh commented Mar 3, 2019

Shreeshrii commented Mar 5, 2019

ksdmahesh commented Mar 5, 2019

Shreeshrii commented Mar 29, 2019

bohrbrar commented Jun 22, 2020

sd-dwivedi commented Aug 27, 2020

Shreeshrii commented Jul 11, 2017 •

edited

Loading

Shreeshrii commented Jul 20, 2017 •

edited

Loading

Shreeshrii commented Jul 26, 2017 •

edited

Loading