-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lstmeval on trained model appears to be making Unicode substitution #270
Comments
I can confirm this behaviour. Hint: there exist already models which might work for you, for example the standard model script/Fraktur or our frak2021 models. |
@stweil thanks! I'm surprised the Fraktur models worked so well (frak2021: CER=4.05, WER=11.4), since this isn't fraktur. Image from the evaluation corpus: Image generated from the What is frak2021 trained on, out of interest? It's very impressive. I can't use |
See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#frak2021. |
Also by the way, do you have a tool for the ground text lines? I find doing them with a text editor to be extremely annoying. |
Have a look at https://github.com/OCR4all/LAREX. Also OCRpy has a simplistic browser-based transcription utility. |
I am trying to create a model for old-style English printing using the long-s (ſ) character. Generally, this occurs for any non-final lowercase 's' in a word.
I have a model successful trained which appears to have reasonable accuracy (at least better in some ways than
eng
which obviously mistakes them usually as 'f'). I am now trying to evaluate the accuracy of the model so I can make adjustments in the right directions.I have generated a set of ground-truth images (which are from real scans, though the model was trained from generated text).
However, the result of the
lstmeval
shows the long-s substituted by s.For example:
This appears to actually be successfully recognising long-s, because 1) there's no error in the first line and 2) if it wasn't, a ground-truth longs would be seen as 'f' (or maybe 'l'), not 's'.
However, in the
OCR:
lines, it's being printed as an 's'. This is making it a little awkward for me to compare failure modes while tweaking the model.The text was updated successfully, but these errors were encountered: