lstmeval on trained model appears to be making Unicode substitution #270

johnbeard · 2021-07-27T11:04:09Z

I am trying to create a model for old-style English printing using the long-s (ſ) character. Generally, this occurs for any non-final lowercase 's' in a word.

I have a model successful trained which appears to have reasonable accuracy (at least better in some ways than eng which obviously mistakes them usually as 'f'). I am now trying to evaluate the accuracy of the model so I can make adjustments in the right directions.

I have generated a set of ground-truth images (which are from real scans, though the model was trained from generated text).
However, the result of the lstmeval shows the long-s substituted by s.

For example:

lstmeval --model data/eng_oldcaslon_longs.traineddata --eval_listfile data/eval_eng_old/all-lstmf --verbosity 2

....

Truth:incapable of diſcharging the ſocial duties of life, or enjoying the felicities of it.
OCR  :incapable of discharging the social duties of life, or enjoying the felicities of it.
Truth:I mean not to exhibit horror for the purpoſe of provoking revenge, but to
OCR  :I mean not to exhibit horror for the purpose of provoking revenge, but tol :,
Line Char error rate=0.068493, Word error rate=0.071429

This appears to actually be successfully recognising long-s, because 1) there's no error in the first line and 2) if it wasn't, a ground-truth longs would be seen as 'f' (or maybe 'l'), not 's'.

However, in the OCR: lines, it's being printed as an 's'. This is making it a little awkward for me to compare failure modes while tweaking the model.

The text was updated successfully, but these errors were encountered:

stweil · 2021-07-27T11:20:50Z

I can confirm this behaviour.

Hint: there exist already models which might work for you, for example the standard model script/Fraktur or our frak2021 models.

johnbeard · 2021-07-27T11:55:20Z

@stweil thanks!

I'm surprised the Fraktur models worked so well (frak2021: CER=4.05, WER=11.4), since this isn't fraktur. Image from the evaluation corpus:

Image generated from the tessedit_write_images=1 output.

What is frak2021 trained on, out of interest? It's very impressive.

I can't use eng to compare without more work as it won't encode since ſ isn't in that model at all, but I get 9.5/25 with ita_old, 10/25 with frk, 6.2/18 with GT4HistOCR, 8.6/24.5 with script/Fraktur and my current best new model is 6.3/17.3.

stweil · 2021-07-27T12:21:33Z

What is frak2021 trained on, out of interest?

See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#frak2021.

johnbeard · 2021-07-27T12:30:01Z

Also by the way, do you have a tool for the ground text lines? I find doing them with a text editor to be extremely annoying.

wrznr · 2021-08-27T16:17:19Z

Have a look at https://github.com/OCR4all/LAREX. Also OCRpy has a simplistic browser-based transcription utility.

johnbeard changed the title ~~lstmeval on trained model appear to be making Unicode substitution~~ lstmeval on trained model appears to be making Unicode substitution Jul 27, 2021

wrznr added the question Further information is requested label Aug 27, 2021

Shreeshrii mentioned this issue Sep 7, 2021

lstmeval: Improve output by ensuring 'Truth:' text is encoded the same way as OCR output… tesseract-ocr/tesseract#3421

Open

wrznr closed this as completed Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lstmeval on trained model appears to be making Unicode substitution #270

lstmeval on trained model appears to be making Unicode substitution #270

johnbeard commented Jul 27, 2021

stweil commented Jul 27, 2021

johnbeard commented Jul 27, 2021 •

edited

Loading

stweil commented Jul 27, 2021

johnbeard commented Jul 27, 2021

wrznr commented Aug 27, 2021

lstmeval on trained model appears to be making Unicode substitution #270

lstmeval on trained model appears to be making Unicode substitution #270

Comments

johnbeard commented Jul 27, 2021

stweil commented Jul 27, 2021

johnbeard commented Jul 27, 2021 • edited Loading

stweil commented Jul 27, 2021

johnbeard commented Jul 27, 2021

wrznr commented Aug 27, 2021

johnbeard commented Jul 27, 2021 •

edited

Loading