Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training parts lists #131

Closed
L1800Turbo opened this issue Dec 9, 2019 · 14 comments
Closed

Training parts lists #131

L1800Turbo opened this issue Dec 9, 2019 · 14 comments
Labels
question Further information is requested stale Issues which require input by the reporter which is not provided

Comments

@L1800Turbo
Copy link

Hello,
I hope this is the right place for my question..

I've got huge lists of part assignments I plan to import into a database. These lists are placed on microfiches, so that I had to scan them in a microfiche scanner. Unfortunalety the best quality still isn't near perfect, so that I plan to train tesseract to reduce the error rate.

Using tesstrain, the results already look quite good, but I often get letters recognized as numbers, maybe because of the ratio between number and letter.
To make sure I did it the right way, I wanted to list, what I've done and if I might have made mistakes.

  1. "Clean" Part lists and convert into monochrome
    000407

  2. Cut lists to into one column each, add a border around each
    cut17

  3. Produced training data by this script:
    Page level images #7 (comment)
    790364130 2-J07 049 -> cut2-002 exp0

  4. Correct the text files and create pairs with file.tif -> file.gt.txt

  5. Start tesstrain with make training START_MODEL=eng MODEL_NAME=microfiche

The output training file I get improves the recognition already a lot, although tesseract barely recognizes the letters in the middle colum like the "J" in 2-J07 mentioned above.

I read about a valid letters list, although I couldn't find it so far. And I get a warning about no dictionary. I'm not shure if this really affects the recognition.

Is there any tuning possibility to get the letters recognized better, or do I need more data?
I've got around 300 lines for training so far.

Thank you!

The samples are not in the highest resolution, I scanned the images with 600dpi.

@Jertlok
Copy link
Contributor

Jertlok commented Dec 9, 2019

I am not sure if you are going to find this answer useful, but I will try to reply to what I know so far.

tesseract barely recognizes the letters in the middle column

Have you tried using another page segmentation level?

I read about a valid letters list

If I understood correctly, as last resort you could set a character whitelist in order to include all the characters you may find in your images.
The configuration variable is tessedit_char_whitelist, so you could set it in order to only include upper-case letter plus a few special characters.

And I get a warning about no dictionary.

The models you get in output from tesstrain are by default without dictionary, in order to add a dictionary you might want to check this useful comment from Shreeshii.

@wrznr
Copy link
Collaborator

wrznr commented Dec 9, 2019

Apart from what Jertlok wrote, do the metadata of your images contain information on the resolution (i.e. 600 dpi, check it e.g. with exiftool)? If not, you may want to set it manually using --dpi (undocumented Tesseract option).

Try setting PSM to 13.

@L1800Turbo
Copy link
Author

Thank you for the answers!
I usually used psm 6, as the default one only recognizes the first column.

I will check you about the char_whitelist this evening, although I don't get any characters apart from the ones I would whitelist anyway.

Did a check on the dpi by exiftool, got 600 dpi.

PSM 13 seems to make it a little worse, as an example:
cut2-001 exp0
Creates with:
PSM 6: 790364120 2-007 048
PSM 13: 790364120 2-307 0348

The numbers are mostly recognized well. Only the letters seem to cause problems.

@Jertlok
Copy link
Contributor

Jertlok commented Dec 9, 2019

I just tried to scan your image with a model I am currently training (yeah, that font is pretty similar to what I have in my various ground-truth images) and I get a perfect match.

tesseract .\img.png stdout -l micraPlus_5.837_4429_16100
Failed to load any lstm-specific dictionaries for lang micraPlus_5.837_4429_16100!!

790364120 2-J07 048

The model has been derived from ita.

What I can suggest is trying to improve your ground-truth images as the letters you've got over there are pretty ambiguos and not really good for training (IMHO).

Here's my model (integer and float), just in case you might find it useful for your training:
micraPlus_model.zip

Also, please note that this test has been done with the latest version of tesseract (master).

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Dec 10, 2019 via email

@L1800Turbo
Copy link
Author

I just tried to scan your image with a model I am currently training (yeah, that font is pretty similar to what I have in my various ground-truth images) and I get a perfect match.

tesseract .\img.png stdout -l micraPlus_5.837_4429_16100
Failed to load any lstm-specific dictionaries for lang micraPlus_5.837_4429_16100!!

790364120 2-J07 048

The model has been derived from ita.

What I can suggest is trying to improve your ground-truth images as the letters you've got over there are pretty ambiguos and not really good for training (IMHO).

Here's my model (integer and float), just in case you might find it useful for your training:
micraPlus_model.zip

Also, please note that this test has been done with the latest version of tesseract (master).

Your trainig file already did a much better recognition on the letters, although it makes more mistakes in recognizing the numbers. Maybe it's a statistical thing as I only have one letter between the numbers for training?
Is it maybe even possible to give a pattern to tesseract, as I know in advance where I get numbers and not?

@wrznr
Copy link
Collaborator

wrznr commented Dec 10, 2019

Maybe it's a statistical thing ...

The optimal distribution of a training set in relation to the materials to be recognized is still an open question. A systematical evaluation on data like you have would be very, very helpful!

possible to give a pattern to tesseract

Maybe not directly. But if you have a model which performs better on certain parts of your input and you know in advance where those parts are located you may just apply the more appropriate model there, right?

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Dec 10, 2019 via email

@L1800Turbo
Copy link
Author

Maybe not directly. But if you have a model which performs better on certain parts of your input and you know in advance where those parts are located you may just apply the more appropriate model there, right?

So I'd cut the picture into further parts (columns), to especially recognize the middle part, or is there a more intelligent way?

The optimal distribution of a training set in relation to the materials to be recognized is still an open question. A systematical evaluation on data like you have would be very, very helpful!

Currently I have a small Perl script to analyze the data afterwards with regular expressions and point out whenever a line doesn't match, so that I correct it manually. This would be a great feature, to tell tesseract about this in advance and to let it "look a second time" if the pattern doesn't match.

@wrznr
Copy link
Collaborator

wrznr commented Dec 10, 2019

This would be a great feature

Not very likely t happen. Sry.

is there a more intelligent way?

Use your Perl script? I.e. let tesseract “look a second time” with the other model if the pattern doesn't match and extract the text only for those parts.

@bertsky
Copy link
Collaborator

bertsky commented Dec 10, 2019

Maybe not directly. But if you have a model which performs better on certain parts of your input and you know in advance where those parts are located you may just apply the more appropriate model there, right?

So I'd cut the picture into further parts (columns), to especially recognize the middle part, or is there a more intelligent way?

Exactly. If you know what pattern of numbers and letters to expect for a certain segment of your document, and you use the Tesseract API anyway (or split up the page into segment images and use the CLI), then you can tell Tesseract what to look for with the user_patterns feature mentioned above. It's just a hint in the current implementation though, not exclusive. (It acts like a dictionary.)

@L1800Turbo
Copy link
Author

L1800Turbo commented Dec 10, 2019

Yes, letting tesseract look into the data with another model is a good idea. I will try that.

Also, I was trying to use the user_patterns as described in https://github.com/tesseract-ocr/tesseract/wiki/APIExample-user_patterns
Unfortunalety I couldn't get this to work with v4.1.1-rc2-17-g6343

My command is tesseract schnitt4.tif schnitt4 --user-patterns ../../Microfiche.pattern -c lstm_use_matrix=1 -l Microfiche --psm 6 bazaar

Microfiche.pattern looks like this:
\d\d\d\d\d\d\d\d\d \d-\A\d\d\d \d\d\d
\d-\A\d\d\d \d\d\d

Setting the params makes no difference to the output. I did some research and also tried it with a config file and tesseract schnitt4.tif schnitt4 -l Microfiche --psm 6 bazaar. But no difference. Typing a wrong pattern file path on purpose gave me an error message, so this parameter seems to be analyzed in some point.

@bertsky
Copy link
Collaborator

bertsky commented Dec 10, 2019

My command is tesseract schnitt4.tif schnitt4 --user-patterns ../../Microfiche.pattern -c lstm_use_matrix=1 -l Microfiche --psm 6 bazaar

What is bazaar in here? (If you copied it from the recipe in the man-page, it's meant as the (file) name of a config file, but you don't need a config file on the command line, since you can use --user-patterns. In fact, that config file could easily override that setting by referencing other pattern files – though I'm not certain of this.)

Also, you don't need lstm_use_matrix=1, since it's the default. (I just updated the wiki to reflect this.)

Additional parameters you could try are -c load_system_dawg=F -c load_freq_dawg=F – this disables the built-in dictionaries (if your model even contains them).

Microfiche.pattern looks like this:
\d\d\d\d\d\d\d\d\d \d-\A\d\d\d \d\d\d
\d-\A\d\d\d \d\d\d

This looks good for --psm 6.

So if this does not make any difference at all, then I'm afraid there's not much more you can do currently at runtime. (You must understand that user patterns – like any dictionary/dawg in Tesseract – are not applied exclusively, but as a hint only. I know how to make them exclusive, but in the current state of affairs all this would get us are rejections – missing characters. I have tried combining this with deep beam alternatives, but not succeeded so far.)

@stale
Copy link

stale bot commented Jan 9, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issues which require input by the reporter which is not provided label Jan 9, 2020
@stale stale bot closed this as completed Jan 17, 2020
@stweil stweil added the question Further information is requested label Feb 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested stale Issues which require input by the reporter which is not provided
Projects
None yet
Development

No branches or pull requests

6 participants