Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong coordinates in .box file with LSTM #1276

Closed
smarq8 opened this issue Jan 15, 2018 · 47 comments
Closed

wrong coordinates in .box file with LSTM #1276

smarq8 opened this issue Jan 15, 2018 · 47 comments

Comments

@smarq8
Copy link

smarq8 commented Jan 15, 2018

While i run tesseract with LSTM then coordinates in box file look bad (oem=2). However the same code with oem=0 look fine, but ocr resoult is less accuracy even if I have fully cleared images before processing in high resolution (see images below).

my example code:
"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe" --tessdata-dir "C:\Program Files (x86)\Tesseract-OCR\tessdata" -l pol --oem 2 --psm 6 -c tessedit_create_boxfile=1 -c tessedit_create_hocr=1 -c tessedit_create_tsv=1 -c tessedit_create_txt=1 "D:\x\ClearedText\tesseract\oem0_psm6_20180114221528\fl.txt" "D:\x\ClearedText\tesseract\oem0_psm6_20180114221528\tess"

platform:
W7U x64
tesseract v4.00.00a

111

@Shreeshrii
Copy link
Collaborator

Try with traineddata from tessdata_best andbtessdata_fast with --oem 1

@Shreeshrii
Copy link
Collaborator

Also, LSTM mode is a line recognizer. I don't think it is meant to accurate for character level boxes.

@smarq8
Copy link
Author

smarq8 commented Jan 15, 2018

when I try to use best or fast then i got error:

lstm_recognizer_->DeSerialize(&fp):Error:Assert failed:in file ../../../../ccmain/tessedit.cpp, line 193

@amitdo
Copy link
Collaborator

amitdo commented Jan 15, 2018

What matters most is the recognition of text from images.

IMHO, accurate location of individual glyphs is not a very important feature.

LSTM mode is a line recognizer. I don't think it is meant to accurate for character level boxes.

I believe Shree is right here.

Unlike the lstm engine, the legacy engine works on a glyph level.

So AFAIK this issue is not a bug.

@amitdo
Copy link
Collaborator

amitdo commented Jan 15, 2018

when I try to use best or fast then i got error:

lstm_recognizer_->DeSerialize(&fp):Error:Assert failed:in file ../../../../ccmain/tessedit.cpp, line 193

Use the latest code in the master.

@amitm02
Copy link

amitm02 commented Jan 19, 2018

Is that to say, that when i fine-tune tesseract 4 (LSTM) on scanned images, i should ignore the locations in the box file and only fix the recognised characters?
If LSTM works on a line level, how does it use the "character based" box files?

@amitdo
Copy link
Collaborator

amitdo commented Jan 19, 2018

If LSTM works on a line level, how does it use the "character based" box files?

Basically, what the lstm engine really needs is lines bounding boxes & separated graphemes (or graphemes clusters) as input.

Still, currently only the box format is supported :(

@amitm02
Copy link

amitm02 commented Jan 19, 2018

Thanks @amitdo. Obviously Tesseract lstm has been successfully trained. And a box file made of individual characters is one of the main sub-steps. So what is currently happening regarding to the box file. Does Tesseract treat every character has a its own “line” or does it somehow combine all the characters between two EOLs to generate a line bounding box for them?

@amitdo
Copy link
Collaborator

amitdo commented Jan 19, 2018

... or does it somehow combine all the characters between two EOLs to generate a line bounding box for them?

It combines chars boxes separated by a tab (EOL) to a line box. The chars themselves are kept separated.

@amitm02
Copy link

amitm02 commented Jan 19, 2018

I’m not sure I understand. If the LSTM trains on the “combains line box”, what do you mean by “the chars themselves are kept separated”?
Does that means I can ignore the exact character coordinate as long as it seems they form a reasonable line boxif combained? (E.g if a char cordinate does not fully enclose the char)

@amitdo
Copy link
Collaborator

amitdo commented Jan 19, 2018

Does that means I can ignore the exact character coordinate as long as it seems they form a reasonable line boxif combained? (E.g if a char cordinate does not fully enclose the char)

I believe the answer is 'yes', but I didn't try it yet.

Make the first and last box accurate. Also change one char box so its top & bottom coordinates will be used for the whole line.

Please report if this trick works.

@amitm02
Copy link

amitm02 commented Jan 19, 2018 via email

@amitdo
Copy link
Collaborator

amitdo commented Jan 19, 2018

You must keep the tab as line separator.
Also don't drop words separators (one space char).

@amitm02
Copy link

amitm02 commented Jan 23, 2018

Reporting back.
As much as I could tell from the code (void Tesseract::TrainFromBoxes) the behaviour is indeed as @amitdo described.
I wrote some ugly python script to generate a box file from the tesseract TSV result so i wont need to insert the spaces and tabs. All seems to work fine.

The one thing I'm a big worried are cases where a word (or a line) has mix language chars in it.
e.g

בתאריך-10.10.2000

It seems that char order in the box file should be as they appears on page from left to right. i.e first char is "1" and last is "ב".
However, in the code, it appends all the chars in the line to a single string. In most computer-languages it will result in a string such that the first char is "ב" and the last is "1".
I was unable to figure out from the code if there is a mismatch here that will cause tesseract to train on the badly ordered string.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jan 23, 2018

python script to generate a box file from the tesseract TSV result so i wont need to insert the spaces and tabs. All seems to work fine.

@amitm02 You may want to share it, as a number of people would like to use training from images option.

@amitdo
Copy link
Collaborator

amitdo commented Jan 23, 2018

Another trick that can help you is to use text2image with just one font. Take the box file it produces and 'fix' the boxes with your script.

@Shreeshrii
Copy link
Collaborator

@amitm02 Please see the thread at #648 (comment) for how Arabic and other RTL languages are handled.

@amitm02
Copy link

amitm02 commented Jan 25, 2018

@Shreeshrii, thanks.
I think they made a good call with going strictly LTR in the training. stuff can get amazingly complex when it come to mixed languages text: link

@theceday
Copy link

theceday commented Jan 25, 2018

I am confused about something here. How is charsegmentation layer is trained?
Does it use the overall accuracy of the network?
Isnt it bad for both networks?
While using especially synthetic data, default option should be to use box coords?

@amitdo
Copy link
Collaborator

amitdo commented Jan 25, 2018

It uses a technique called CTC.

@amitdo
Copy link
Collaborator

amitdo commented Jan 25, 2018

Here is the first paper to describe CTC used for text recognition (OCR):
.
A Novel Connectionist System for Unconstrained Handwriting Recognition (2009).
http://www.cs.toronto.edu/%7Egraves/tpami_2009.pdf

@theceday
Copy link

i see, actually it is a nice one to use.
here is another
ftp://ftp.idsia.ch/pub/juergen/icml2006.pdf

it is hard to come up with a good nn for segmentation only anyway :)

@amitdo
Copy link
Collaborator

amitdo commented Jan 25, 2018

ftp://ftp.idsia.ch/pub/juergen/icml2006.pdf

Same authors, from 2006, CTC for speech recognition.

@amitdo
Copy link
Collaborator

amitdo commented May 9, 2018

@sreenathbh
Copy link

Does the above discussion imply that there is no way to get correct coordinates for every word when using LSTM mode?

@Shreeshrii
Copy link
Collaborator

correct coordinates for every word

May be possible.

It is not possible to get accurate coordinates for every character.

Try HOCR output.

@xwodas
Copy link

xwodas commented Oct 2, 2018

Why is this not a bug? Accurate box files are a must for training. And the ability to train tesseract is one of its major strengths.

@amitdo
Copy link
Collaborator

amitdo commented Oct 2, 2018

Accurate box files are a must for training.

Not for 4.0.0's lstm training.
#1276 (comment)

@stweil
Copy link
Member

stweil commented Oct 2, 2018

Tesseract should warn users who want box files when they try to get them with LSTM. It currently does not which already caused several issue reports, so the missing warning needs to be fixed. Patches are welcome, but I don't think that's a reason to postpone 4.0.0.

@xwodas
Copy link

xwodas commented Oct 2, 2018

Yes, please! And also, please hint to -oem 0 and the corresponding language files. I used tesseract in sophisticated ways many years. I still missed all this when I got 4.0 via a system upgrade. I just figured out what I had to change in my workflow so that it not just crashes. But I totally missed that this is a completely re-designed algorithm that behaves differently in many ways.

@stweil
Copy link
Member

stweil commented Oct 2, 2018

But I totally missed that this is a completely re-designed algorithm that behaves differently in many ways.

... and that the old ways are still available, but require additional work (like --oem 0 or getting the correct traineddata files).

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Oct 2, 2018 via email

@stweil
Copy link
Member

stweil commented Oct 2, 2018

I would not overload that help text, but suggest to enhance the manual page. Is there a better term for legacy Tesseract engine? If we avoid the exact revision number, we don't have to change it each time.

What about this text: Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by --oem 0. It also needs traineddata files which support the legacy engine, for example those from the tessdata repository.

By the way: the man page currently misses information on the new --dpi n. @zdenop, do we need that option at all, or isn't the config variable sufficient?

@amitdo
Copy link
Collaborator

amitdo commented Oct 2, 2018

I believe that very small share of our userbase reads man pages.

@xwodas
Copy link

xwodas commented Oct 3, 2018

I read man pages, but only if I know that I am looking for something. So I would need some trigger first.

I liked the idea to throw out a warning if someone runs in NN mode and still requests box files. Or just stop and request to drop the box file request or use some additional override option.

I do not know how typical this behavior is, but I run tesseract most often from scripts that I have used since ten years, so that is why I missed this all together. That is why it took me months until it became annoying enough to look for the root cause. It continued to work "kind-of" and there was nothing catching my eyes directly.

@zdenop
Copy link
Contributor

zdenop commented Oct 3, 2018

@stweil : dpi warning message is IMO too common (based on testing several issues tracker images), so it need to easy accessible user. For this reason I decided to implement it as option for tesseract app.

BTW: it would be great if English native speaker could check & improve all docs, including wiki...

@Shreeshrii
Copy link
Collaborator

#1448

@sagimann commented 10 minutes ago

problem is, when using oem 0, that OCR does not work well with non-solid backgrounds. The point is: if bboxes are not used by line recognizer, what other kind of data is available to correctly find the symbol on the image in terms of location?

@Shreeshrii
Copy link
Collaborator

@amitdo is there anyway using tesseract to find the correct coordinate of characters while using the LSTM engine?

@amitdo
Copy link
Collaborator

amitdo commented Oct 4, 2018

The bboxes are estimated. I don't think there is a way to make it more accurate with lstm.

There's also a known bug that cause the bbox to be sometimes way off than the real coordinates.

The pdf renderer might suffer from both the 'bug' and 'not a bug'.

@stweil
Copy link
Member

stweil commented Jul 16, 2019

@smarq8, this should be fixed by pull request #2576. Please test and report your results.

@Shreeshrii
Copy link
Collaborator

makebox output shows no overlap. Issue can be closed.

tesseract 1276.png  - -l eng  --tessdata-dir ~/tessdata_fast  --oem 1 --psm 6 makebox

P 122 475 155 525 0
r 158 475 182 513 0
z 183 475 211 512 0
e 213 474 245 513 0
p 248 458 283 513 0
r 287 475 311 513 0
a 313 474 343 513 0
s 346 474 372 513 0
z 375 475 402 512 0
a 404 474 435 513 0
m 439 475 491 513 0
y 470 458 504 526 0
! 494 459 542 526 0
P 332 368 365 419 0
r 368 368 393 406 0
o 394 366 430 407 0
s 433 367 459 406 0
z 462 368 489 406 0
e 492 350 524 406 0
. 528 368 542 382 0
N 302 238 357 311 0
- 364 261 387 273 0
N 394 238 449 311 0
i 426 237 463 316 0
e 459 238 478 316 0
. 482 237 553 294 0
N 54 145 96 201 0
i 104 145 118 205 0
e 121 144 158 187 0
m 178 145 237 187 0
a 241 144 275 187 0
s 295 144 324 187 0
p 328 125 367 187 0
r 370 145 399 187 0
a 401 144 433 187 0
w 437 145 496 186 0
y 475 125 512 187 0
. 498 126 551 186 0
W 268 28 351 94 0
o 315 26 367 99 0
l 352 26 398 79 0
n 405 28 419 99 0
a 426 28 469 78 0
! 474 27 536 95 0

@ravi289-97
Copy link

Hi @amitdo , @Shreeshrii
I find that the box files we get via tesseract lstmbox are having a different co-ordinates for top and bottom . Please find below example for more details
Let us say we have xyz.box file for xyz.jpeg. Below are the contents in box file with box co-ordinates and page number
A 17 58 162 70 0
B 17 58 162 70 0
[space] 17 58 162 70 0
C 17 58 162 70 0
D 17 58 162 70 0
[tab] 17 58 162 70 0

When the same thing is checked via jTessBoxEditor by uploading the same image. I get the following under "Box-Coordinates" tab.
A 17 12 162 12 0
B 17 12 162 12 0
[space] 17 12 162 12 0
C 17 12 162 12 0
D 17 12 162 12 0
[tab] 17 12 162 12 0

However, when I navigate to "Box Data" tab, the values are again different and they look similar to what lstmbox output is. Just wondering why is that change affecting only "top" and "bottom" coordinates. From what I have read is the lstmbox has the information based on the lines. But I cant use these co-ordinates returned to slice the image as top and bottom differ for every box file. Please help. Thanks.

@amitdo
Copy link
Collaborator

amitdo commented Aug 17, 2021

@nguyenq

Where can people ask questions (like the one above) for jTessBoxEditor? Here?

@ravi289-97
Copy link

@amitdo : I think it is also a case from tesseract perspective as using lstmbox to generate box files gives the bounding box co-ordinates of line level data out of which the top and bottom values are not relevant. I just figured out that they are giving these values based on the input image height.
For the same example as above: let us the say the height of image is 82px. So if I had to crop the first line from the coordinates , I will have to use the image height to convert them to proper values.
x=82-58=24
diff=70-58=12
correct bounding box values for the line = 17 x-diff 162 x
17 12 162 24

I am not sure why is this being done while generating box files. Thank you

@amitdo
Copy link
Collaborator

amitdo commented Aug 17, 2021

@nguyenq
Copy link
Contributor

nguyenq commented Aug 17, 2021

@amitm02 @ravi289-97 Yes, if you have any question regarding jTessBoxEditor, you can post at the project's Issues page.

@ravi289-97
Copy link

@nguyenq : Sure, Will do. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests