-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training parts lists #131
Comments
I am not sure if you are going to find this answer useful, but I will try to reply to what I know so far.
Have you tried using another page segmentation level?
If I understood correctly, as last resort you could set a character whitelist in order to include all the characters you may find in your images.
The models you get in output from tesstrain are by default without dictionary, in order to add a dictionary you might want to check this useful comment from Shreeshii. |
Apart from what Jertlok wrote, do the metadata of your images contain information on the resolution (i.e. 600 dpi, check it e.g. with Try setting |
I just tried to scan your image with a model I am currently training (yeah, that font is pretty similar to what I have in my various ground-truth images) and I get a perfect match.
The model has been derived from ita. What I can suggest is trying to improve your ground-truth images as the letters you've got over there are pretty ambiguos and not really good for training (IMHO). Here's my model (integer and float), just in case you might find it useful for your training: Also, please note that this test has been done with the latest version of tesseract (master). |
You should consider uploading your models to tessdata_contrib.
…On Mon, Dec 9, 2019, 18:50 Giuseppe Maggio ***@***.***> wrote:
I just tried to scan your image with a model I am currently training
(yeah, that font is pretty similar to what I have in my various
ground-truth images) and I get a perfect match.
`tesseract .\img.png stdout -l micraPlus_5.837_4429_16100
Failed to load any lstm-specific dictionaries for lang
micraPlus_5.837_4429_16100!!
790364120 2-J07 048`
The model has been derived from ita
<https://github.com/tesseract-ocr/tessdata_best/blob/master/ita.traineddata>
.
What I can suggest is trying to improve your ground-truth images as the
letters you've got over there are pretty ambiguos and not really good for
training (IMHO).
Here's my model (integer and float), just in case you might find it useful
for your training:
micraPlus_model.zip
<https://github.com/tesseract-ocr/tesstrain/files/3939824/micraPlus_model.zip>
Also, please note that this test has been done with the latest version of
tesseract (master).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#131?email_source=notifications&email_token=ABG37I346IVXDOASW672MX3QXZAZ3A5CNFSM4JYFRRTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGJERJQ#issuecomment-563234982>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABG37IZRHNV35HH35TCPZYTQXZAZ3ANCNFSM4JYFRRTA>
.
|
Your trainig file already did a much better recognition on the letters, although it makes more mistakes in recognizing the numbers. Maybe it's a statistical thing as I only have one letter between the numbers for training? |
The optimal distribution of a training set in relation to the materials to be recognized is still an open question. A systematical evaluation on data like you have would be very, very helpful!
Maybe not directly. But if you have a model which performs better on certain parts of your input and you know in advance where those parts are located you may just apply the more appropriate model there, right? |
You can try with user_paterns to see if it helps in your case. See
tesseract-ocr/tesseract#2328
…On Tue, Dec 10, 2019 at 11:22 AM L1800Turbo ***@***.***> wrote:
I just tried to scan your image with a model I am currently training
(yeah, that font is pretty similar to what I have in my various
ground-truth images) and I get a perfect match.
tesseract .\img.png stdout -l micraPlus_5.837_4429_16100
Failed to load any lstm-specific dictionaries for lang micraPlus_5.837_4429_16100!!
790364120 2-J07 048
The model has been derived from ita
<https://github.com/tesseract-ocr/tessdata_best/blob/master/ita.traineddata>
.
What I can suggest is trying to improve your ground-truth images as the
letters you've got over there are pretty ambiguos and not really good for
training (IMHO).
Here's my model (integer and float), just in case you might find it useful
for your training:
micraPlus_model.zip
<https://github.com/tesseract-ocr/tesstrain/files/3939824/micraPlus_model.zip>
Also, please note that this test has been done with the latest version of
tesseract (master).
Your trainig file already did a much better recognition on the letters,
although it makes more mistakes in recognizing the numbers. Maybe it's a
statistical thing as I only have one letter between the numbers for
training?
Is it maybe even possible to give a pattern to tesseract, as I know in
advance where I get numbers and not?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#131?email_source=notifications&email_token=ABG37I6BWBCXQBO67VQVTXTQX4VBFA5CNFSM4JYFRRTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGOBB5A#issuecomment-563876084>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABG37IZWKL3Z54ILWQP2X4DQX4VBFANCNFSM4JYFRRTA>
.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
So I'd cut the picture into further parts (columns), to especially recognize the middle part, or is there a more intelligent way?
Currently I have a small Perl script to analyze the data afterwards with regular expressions and point out whenever a line doesn't match, so that I correct it manually. This would be a great feature, to tell tesseract about this in advance and to let it "look a second time" if the pattern doesn't match. |
Not very likely t happen. Sry.
Use your Perl script? I.e. let tesseract “look a second time” with the other model if the pattern doesn't match and extract the text only for those parts. |
Exactly. If you know what pattern of numbers and letters to expect for a certain segment of your document, and you use the Tesseract API anyway (or split up the page into segment images and use the CLI), then you can tell Tesseract what to look for with the |
Yes, letting tesseract look into the data with another model is a good idea. I will try that. Also, I was trying to use the My command is Microfiche.pattern looks like this: Setting the params makes no difference to the output. I did some research and also tried it with a config file and |
What is Also, you don't need Additional parameters you could try are
This looks good for So if this does not make any difference at all, then I'm afraid there's not much more you can do currently at runtime. (You must understand that user patterns – like any dictionary/dawg in Tesseract – are not applied exclusively, but as a hint only. I know how to make them exclusive, but in the current state of affairs all this would get us are rejections – missing characters. I have tried combining this with deep beam alternatives, but not succeeded so far.) |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hello,
I hope this is the right place for my question..
I've got huge lists of part assignments I plan to import into a database. These lists are placed on microfiches, so that I had to scan them in a microfiche scanner. Unfortunalety the best quality still isn't near perfect, so that I plan to train tesseract to reduce the error rate.
Using tesstrain, the results already look quite good, but I often get letters recognized as numbers, maybe because of the ratio between number and letter.
To make sure I did it the right way, I wanted to list, what I've done and if I might have made mistakes.
"Clean" Part lists and convert into monochrome
Cut lists to into one column each, add a border around each
Produced training data by this script:
Page level images #7 (comment)
790364130 2-J07 049
->Correct the text files and create pairs with
file.tif
->file.gt.txt
Start tesstrain with
make training START_MODEL=eng MODEL_NAME=microfiche
The output training file I get improves the recognition already a lot, although tesseract barely recognizes the letters in the middle colum like the "J" in
2-J07
mentioned above.I read about a valid letters list, although I couldn't find it so far. And I get a warning about no dictionary. I'm not shure if this really affects the recognition.
Is there any tuning possibility to get the letters recognized better, or do I need more data?
I've got around 300 lines for training so far.
Thank you!
The samples are not in the highest resolution, I scanned the images with 600dpi.
The text was updated successfully, but these errors were encountered: