Training parts lists #131

L1800Turbo · 2019-12-09T08:56:54Z

Hello,
I hope this is the right place for my question..

I've got huge lists of part assignments I plan to import into a database. These lists are placed on microfiches, so that I had to scan them in a microfiche scanner. Unfortunalety the best quality still isn't near perfect, so that I plan to train tesseract to reduce the error rate.

Using tesstrain, the results already look quite good, but I often get letters recognized as numbers, maybe because of the ratio between number and letter.
To make sure I did it the right way, I wanted to list, what I've done and if I might have made mistakes.

"Clean" Part lists and convert into monochrome
Cut lists to into one column each, add a border around each
Produced training data by this script:
Page level images #7 (comment)
790364130 2-J07 049 ->
Correct the text files and create pairs with file.tif -> file.gt.txt
Start tesstrain with make training START_MODEL=eng MODEL_NAME=microfiche

The output training file I get improves the recognition already a lot, although tesseract barely recognizes the letters in the middle colum like the "J" in 2-J07 mentioned above.

I read about a valid letters list, although I couldn't find it so far. And I get a warning about no dictionary. I'm not shure if this really affects the recognition.

Is there any tuning possibility to get the letters recognized better, or do I need more data?
I've got around 300 lines for training so far.

Thank you!

The samples are not in the highest resolution, I scanned the images with 600dpi.

The text was updated successfully, but these errors were encountered:

Jertlok · 2019-12-09T09:47:57Z

I am not sure if you are going to find this answer useful, but I will try to reply to what I know so far.

tesseract barely recognizes the letters in the middle column

Have you tried using another page segmentation level?

I read about a valid letters list

If I understood correctly, as last resort you could set a character whitelist in order to include all the characters you may find in your images.
The configuration variable is tessedit_char_whitelist, so you could set it in order to only include upper-case letter plus a few special characters.

And I get a warning about no dictionary.

The models you get in output from tesstrain are by default without dictionary, in order to add a dictionary you might want to check this useful comment from Shreeshii.

wrznr · 2019-12-09T10:02:08Z

Apart from what Jertlok wrote, do the metadata of your images contain information on the resolution (i.e. 600 dpi, check it e.g. with exiftool)? If not, you may want to set it manually using --dpi (undocumented Tesseract option).

Try setting PSM to 13.

L1800Turbo · 2019-12-09T11:52:12Z

Thank you for the answers!
I usually used psm 6, as the default one only recognizes the first column.

I will check you about the char_whitelist this evening, although I don't get any characters apart from the ones I would whitelist anyway.

Did a check on the dpi by exiftool, got 600 dpi.

PSM 13 seems to make it a little worse, as an example:

Creates with:
PSM 6: 790364120 2-007 048
PSM 13: 790364120 2-307 0348

The numbers are mostly recognized well. Only the letters seem to cause problems.

Jertlok · 2019-12-09T13:20:28Z

I just tried to scan your image with a model I am currently training (yeah, that font is pretty similar to what I have in my various ground-truth images) and I get a perfect match.

tesseract .\img.png stdout -l micraPlus_5.837_4429_16100
Failed to load any lstm-specific dictionaries for lang micraPlus_5.837_4429_16100!!

790364120 2-J07 048

The model has been derived from ita.

What I can suggest is trying to improve your ground-truth images as the letters you've got over there are pretty ambiguos and not really good for training (IMHO).

Here's my model (integer and float), just in case you might find it useful for your training:
micraPlus_model.zip

Also, please note that this test has been done with the latest version of tesseract (master).

Shreeshrii · 2019-12-10T03:48:10Z

You should consider uploading your models to tessdata_contrib.

…

On Mon, Dec 9, 2019, 18:50 Giuseppe Maggio ***@***.***> wrote: I just tried to scan your image with a model I am currently training (yeah, that font is pretty similar to what I have in my various ground-truth images) and I get a perfect match. `tesseract .\img.png stdout -l micraPlus_5.837_4429_16100 Failed to load any lstm-specific dictionaries for lang micraPlus_5.837_4429_16100!! 790364120 2-J07 048` The model has been derived from ita <https://github.com/tesseract-ocr/tessdata_best/blob/master/ita.traineddata> . What I can suggest is trying to improve your ground-truth images as the letters you've got over there are pretty ambiguos and not really good for training (IMHO). Here's my model (integer and float), just in case you might find it useful for your training: micraPlus_model.zip <https://github.com/tesseract-ocr/tesstrain/files/3939824/micraPlus_model.zip> Also, please note that this test has been done with the latest version of tesseract (master). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#131?email_source=notifications&email_token=ABG37I346IVXDOASW672MX3QXZAZ3A5CNFSM4JYFRRTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGJERJQ#issuecomment-563234982>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABG37IZRHNV35HH35TCPZYTQXZAZ3ANCNFSM4JYFRRTA> .

L1800Turbo · 2019-12-10T05:52:17Z

I just tried to scan your image with a model I am currently training (yeah, that font is pretty similar to what I have in my various ground-truth images) and I get a perfect match.
tesseract .\img.png stdout -l micraPlus_5.837_4429_16100
Failed to load any lstm-specific dictionaries for lang micraPlus_5.837_4429_16100!!

790364120 2-J07 048
The model has been derived from ita.

What I can suggest is trying to improve your ground-truth images as the letters you've got over there are pretty ambiguos and not really good for training (IMHO).

Here's my model (integer and float), just in case you might find it useful for your training:
micraPlus_model.zip

Also, please note that this test has been done with the latest version of tesseract (master).

Your trainig file already did a much better recognition on the letters, although it makes more mistakes in recognizing the numbers. Maybe it's a statistical thing as I only have one letter between the numbers for training?
Is it maybe even possible to give a pattern to tesseract, as I know in advance where I get numbers and not?

wrznr · 2019-12-10T07:15:28Z

Maybe it's a statistical thing ...

The optimal distribution of a training set in relation to the materials to be recognized is still an open question. A systematical evaluation on data like you have would be very, very helpful!

possible to give a pattern to tesseract

Maybe not directly. But if you have a model which performs better on certain parts of your input and you know in advance where those parts are located you may just apply the more appropriate model there, right?

Shreeshrii · 2019-12-10T08:07:11Z

You can try with user_paterns to see if it helps in your case. See tesseract-ocr/tesseract#2328

…

On Tue, Dec 10, 2019 at 11:22 AM L1800Turbo ***@***.***> wrote: I just tried to scan your image with a model I am currently training (yeah, that font is pretty similar to what I have in my various ground-truth images) and I get a perfect match. tesseract .\img.png stdout -l micraPlus_5.837_4429_16100 Failed to load any lstm-specific dictionaries for lang micraPlus_5.837_4429_16100!! 790364120 2-J07 048 The model has been derived from ita <https://github.com/tesseract-ocr/tessdata_best/blob/master/ita.traineddata> . What I can suggest is trying to improve your ground-truth images as the letters you've got over there are pretty ambiguos and not really good for training (IMHO). Here's my model (integer and float), just in case you might find it useful for your training: micraPlus_model.zip <https://github.com/tesseract-ocr/tesstrain/files/3939824/micraPlus_model.zip> Also, please note that this test has been done with the latest version of tesseract (master). Your trainig file already did a much better recognition on the letters, although it makes more mistakes in recognizing the numbers. Maybe it's a statistical thing as I only have one letter between the numbers for training? Is it maybe even possible to give a pattern to tesseract, as I know in advance where I get numbers and not? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#131?email_source=notifications&email_token=ABG37I6BWBCXQBO67VQVTXTQX4VBFA5CNFSM4JYFRRTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGOBB5A#issuecomment-563876084>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABG37IZWKL3Z54ILWQP2X4DQX4VBFANCNFSM4JYFRRTA> .

--

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

L1800Turbo · 2019-12-10T08:26:41Z

Maybe not directly. But if you have a model which performs better on certain parts of your input and you know in advance where those parts are located you may just apply the more appropriate model there, right?

So I'd cut the picture into further parts (columns), to especially recognize the middle part, or is there a more intelligent way?

The optimal distribution of a training set in relation to the materials to be recognized is still an open question. A systematical evaluation on data like you have would be very, very helpful!

Currently I have a small Perl script to analyze the data afterwards with regular expressions and point out whenever a line doesn't match, so that I correct it manually. This would be a great feature, to tell tesseract about this in advance and to let it "look a second time" if the pattern doesn't match.

wrznr · 2019-12-10T08:31:28Z

This would be a great feature

Not very likely t happen. Sry.

is there a more intelligent way?

Use your Perl script? I.e. let tesseract “look a second time” with the other model if the pattern doesn't match and extract the text only for those parts.

bertsky · 2019-12-10T15:12:47Z

Maybe not directly. But if you have a model which performs better on certain parts of your input and you know in advance where those parts are located you may just apply the more appropriate model there, right?

So I'd cut the picture into further parts (columns), to especially recognize the middle part, or is there a more intelligent way?

Exactly. If you know what pattern of numbers and letters to expect for a certain segment of your document, and you use the Tesseract API anyway (or split up the page into segment images and use the CLI), then you can tell Tesseract what to look for with the user_patterns feature mentioned above. It's just a hint in the current implementation though, not exclusive. (It acts like a dictionary.)

L1800Turbo · 2019-12-10T19:16:44Z

Yes, letting tesseract look into the data with another model is a good idea. I will try that.

Also, I was trying to use the user_patterns as described in https://github.com/tesseract-ocr/tesseract/wiki/APIExample-user_patterns
Unfortunalety I couldn't get this to work with v4.1.1-rc2-17-g6343

My command is tesseract schnitt4.tif schnitt4 --user-patterns ../../Microfiche.pattern -c lstm_use_matrix=1 -l Microfiche --psm 6 bazaar

Microfiche.pattern looks like this:
\d\d\d\d\d\d\d\d\d \d-\A\d\d\d \d\d\d
\d-\A\d\d\d \d\d\d

Setting the params makes no difference to the output. I did some research and also tried it with a config file and tesseract schnitt4.tif schnitt4 -l Microfiche --psm 6 bazaar. But no difference. Typing a wrong pattern file path on purpose gave me an error message, so this parameter seems to be analyzed in some point.

bertsky · 2019-12-10T22:43:15Z

My command is tesseract schnitt4.tif schnitt4 --user-patterns ../../Microfiche.pattern -c lstm_use_matrix=1 -l Microfiche --psm 6 bazaar

What is bazaar in here? (If you copied it from the recipe in the man-page, it's meant as the (file) name of a config file, but you don't need a config file on the command line, since you can use --user-patterns. In fact, that config file could easily override that setting by referencing other pattern files – though I'm not certain of this.)

Also, you don't need lstm_use_matrix=1, since it's the default. (I just updated the wiki to reflect this.)

Additional parameters you could try are -c load_system_dawg=F -c load_freq_dawg=F – this disables the built-in dictionaries (if your model even contains them).

Microfiche.pattern looks like this:
\d\d\d\d\d\d\d\d\d \d-\A\d\d\d \d\d\d
\d-\A\d\d\d \d\d\d

This looks good for --psm 6.

So if this does not make any difference at all, then I'm afraid there's not much more you can do currently at runtime. (You must understand that user patterns – like any dictionary/dawg in Tesseract – are not applied exclusively, but as a hint only. I know how to make them exclusive, but in the current state of affairs all this would get us are rejections – missing characters. I have tried combining this with deep beam alternatives, but not succeeded so far.)

stale · 2020-01-09T23:42:15Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the stale Issues which require input by the reporter which is not provided label Jan 9, 2020

stale bot closed this as completed Jan 17, 2020

stweil added the question Further information is requested label Feb 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training parts lists #131

Training parts lists #131

L1800Turbo commented Dec 9, 2019

Jertlok commented Dec 9, 2019 •

edited

Loading

wrznr commented Dec 9, 2019

L1800Turbo commented Dec 9, 2019

Jertlok commented Dec 9, 2019 •

edited

Loading

Shreeshrii commented Dec 10, 2019 via email

L1800Turbo commented Dec 10, 2019

wrznr commented Dec 10, 2019

Shreeshrii commented Dec 10, 2019 via email

L1800Turbo commented Dec 10, 2019

wrznr commented Dec 10, 2019

bertsky commented Dec 10, 2019

L1800Turbo commented Dec 10, 2019 •

edited

Loading

bertsky commented Dec 10, 2019

stale bot commented Jan 9, 2020

Training parts lists #131

Training parts lists #131

Comments

L1800Turbo commented Dec 9, 2019

Jertlok commented Dec 9, 2019 • edited Loading

wrznr commented Dec 9, 2019

L1800Turbo commented Dec 9, 2019

Jertlok commented Dec 9, 2019 • edited Loading

Shreeshrii commented Dec 10, 2019 via email

L1800Turbo commented Dec 10, 2019

wrznr commented Dec 10, 2019

Shreeshrii commented Dec 10, 2019 via email

L1800Turbo commented Dec 10, 2019

wrznr commented Dec 10, 2019

bertsky commented Dec 10, 2019

L1800Turbo commented Dec 10, 2019 • edited Loading

bertsky commented Dec 10, 2019

stale bot commented Jan 9, 2020

Jertlok commented Dec 9, 2019 •

edited

Loading

Jertlok commented Dec 9, 2019 •

edited

Loading

L1800Turbo commented Dec 10, 2019 •

edited

Loading