wrong coordinates in .box file with LSTM #1276

smarq8 · 2018-01-15T17:48:27Z

While i run tesseract with LSTM then coordinates in box file look bad (oem=2). However the same code with oem=0 look fine, but ocr resoult is less accuracy even if I have fully cleared images before processing in high resolution (see images below).

my example code:
"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe" --tessdata-dir "C:\Program Files (x86)\Tesseract-OCR\tessdata" -l pol --oem 2 --psm 6 -c tessedit_create_boxfile=1 -c tessedit_create_hocr=1 -c tessedit_create_tsv=1 -c tessedit_create_txt=1 "D:\x\ClearedText\tesseract\oem0_psm6_20180114221528\fl.txt" "D:\x\ClearedText\tesseract\oem0_psm6_20180114221528\tess"

platform:
W7U x64
tesseract v4.00.00a

The text was updated successfully, but these errors were encountered:

Shreeshrii · 2018-01-15T17:52:38Z

Try with traineddata from tessdata_best andbtessdata_fast with --oem 1

Shreeshrii · 2018-01-15T17:54:27Z

Also, LSTM mode is a line recognizer. I don't think it is meant to accurate for character level boxes.

smarq8 · 2018-01-15T18:19:47Z

when I try to use best or fast then i got error:

lstm_recognizer_->DeSerialize(&fp):Error:Assert failed:in file ../../../../ccmain/tessedit.cpp, line 193

amitdo · 2018-01-15T21:13:12Z

What matters most is the recognition of text from images.

IMHO, accurate location of individual glyphs is not a very important feature.

LSTM mode is a line recognizer. I don't think it is meant to accurate for character level boxes.

I believe Shree is right here.

Unlike the lstm engine, the legacy engine works on a glyph level.

So AFAIK this issue is not a bug.

amitdo · 2018-01-15T21:21:13Z

when I try to use best or fast then i got error:

lstm_recognizer_->DeSerialize(&fp):Error:Assert failed:in file ../../../../ccmain/tessedit.cpp, line 193

Use the latest code in the master.

amitm02 · 2018-01-19T07:56:23Z

Is that to say, that when i fine-tune tesseract 4 (LSTM) on scanned images, i should ignore the locations in the box file and only fix the recognised characters?
If LSTM works on a line level, how does it use the "character based" box files?

amitdo · 2018-01-19T11:16:45Z

If LSTM works on a line level, how does it use the "character based" box files?

Basically, what the lstm engine really needs is lines bounding boxes & separated graphemes (or graphemes clusters) as input.

Still, currently only the box format is supported :(

amitm02 · 2018-01-19T12:12:04Z

Thanks @amitdo. Obviously Tesseract lstm has been successfully trained. And a box file made of individual characters is one of the main sub-steps. So what is currently happening regarding to the box file. Does Tesseract treat every character has a its own “line” or does it somehow combine all the characters between two EOLs to generate a line bounding box for them?

amitdo · 2018-01-19T13:01:22Z

... or does it somehow combine all the characters between two EOLs to generate a line bounding box for them?

It combines chars boxes separated by a tab (EOL) to a line box. The chars themselves are kept separated.

amitm02 · 2018-01-19T13:30:53Z

I’m not sure I understand. If the LSTM trains on the “combains line box”, what do you mean by “the chars themselves are kept separated”?
Does that means I can ignore the exact character coordinate as long as it seems they form a reasonable line boxif combained? (E.g if a char cordinate does not fully enclose the char)

amitdo · 2018-01-19T13:48:56Z

Does that means I can ignore the exact character coordinate as long as it seems they form a reasonable line boxif combained? (E.g if a char cordinate does not fully enclose the char)

I believe the answer is 'yes', but I didn't try it yet.

Make the first and last box accurate. Also change one char box so its top & bottom coordinates will be used for the whole line.

Please report if this trick works.

amitm02 · 2018-01-19T13:53:01Z

I will try and report

…

On Fri, 19 Jan 2018 at 15:49 Amit D. ***@***.***> wrote: Does that means I can ignore the exact character coordinate as long as it seems they form a reasonable line boxif combained? (E.g if a char cordinate does not fully enclose the char) I believe the answer is 'yes', but I didn't try it yet. Make the first and last box accurate. Also change one char box so its top & bottom coordinates will used for the whole line. Please report if this trick works. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1276 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABzSnz5jWxDUVigipzzfgWHMZbQmYoq9ks5tMJ1mgaJpZM4RewbM> .

amitdo · 2018-01-19T13:55:13Z

You must keep the tab as line separator.
Also don't drop words separators (one space char).

amitm02 · 2018-01-23T07:06:39Z

Reporting back.
As much as I could tell from the code (void Tesseract::TrainFromBoxes) the behaviour is indeed as @amitdo described.
I wrote some ugly python script to generate a box file from the tesseract TSV result so i wont need to insert the spaces and tabs. All seems to work fine.

The one thing I'm a big worried are cases where a word (or a line) has mix language chars in it.
e.g

בתאריך-10.10.2000

It seems that char order in the box file should be as they appears on page from left to right. i.e first char is "1" and last is "ב".
However, in the code, it appends all the chars in the line to a single string. In most computer-languages it will result in a string such that the first char is "ב" and the last is "1".
I was unable to figure out from the code if there is a mismatch here that will cause tesseract to train on the badly ordered string.

Shreeshrii · 2018-01-23T08:40:48Z

python script to generate a box file from the tesseract TSV result so i wont need to insert the spaces and tabs. All seems to work fine.

@amitm02 You may want to share it, as a number of people would like to use training from images option.

amitdo · 2018-01-23T11:17:34Z

Another trick that can help you is to use text2image with just one font. Take the box file it produces and 'fix' the boxes with your script.

Shreeshrii · 2018-01-25T04:57:15Z

@amitm02 Please see the thread at #648 (comment) for how Arabic and other RTL languages are handled.

amitm02 · 2018-01-25T10:20:47Z

@Shreeshrii, thanks.
I think they made a good call with going strictly LTR in the training. stuff can get amazingly complex when it come to mixed languages text: link

theceday · 2018-01-25T15:10:32Z

I am confused about something here. How is charsegmentation layer is trained?
Does it use the overall accuracy of the network?
Isnt it bad for both networks?
While using especially synthetic data, default option should be to use box coords?

amitdo · 2018-01-25T15:30:26Z

It uses a technique called CTC.

amitdo · 2018-01-25T15:57:33Z

Here is the first paper to describe CTC used for text recognition (OCR):
.
A Novel Connectionist System for Unconstrained Handwriting Recognition (2009).
http://www.cs.toronto.edu/%7Egraves/tpami_2009.pdf

theceday · 2018-01-25T16:42:18Z

i see, actually it is a nice one to use.
here is another
ftp://ftp.idsia.ch/pub/juergen/icml2006.pdf

it is hard to come up with a good nn for segmentation only anyway :)

amitdo · 2018-01-25T16:59:05Z

ftp://ftp.idsia.ch/pub/juergen/icml2006.pdf

Same authors, from 2006, CTC for speech recognition.

amitdo · 2018-05-09T20:33:11Z

tesseract-ocr/langdata#83 (comment)

sreenathbh · 2018-05-31T16:43:19Z

Does the above discussion imply that there is no way to get correct coordinates for every word when using LSTM mode?

Shreeshrii · 2018-05-31T17:09:19Z

correct coordinates for every word

May be possible.

It is not possible to get accurate coordinates for every character.

Try HOCR output.

xwodas · 2018-10-02T14:13:03Z

Why is this not a bug? Accurate box files are a must for training. And the ability to train tesseract is one of its major strengths.

amitdo · 2018-10-02T14:24:27Z

Accurate box files are a must for training.

Not for 4.0.0's lstm training.
#1276 (comment)

stweil · 2018-10-02T14:55:10Z

Tesseract should warn users who want box files when they try to get them with LSTM. It currently does not which already caused several issue reports, so the missing warning needs to be fixed. Patches are welcome, but I don't think that's a reason to postpone 4.0.0.

xwodas · 2018-10-02T15:23:58Z

Yes, please! And also, please hint to -oem 0 and the corresponding language files. I used tesseract in sophisticated ways many years. I still missed all this when I got 4.0 via a system upgrade. I just figured out what I had to change in my workflow so that it not just crashes. But I totally missed that this is a completely re-designed algorithm that behaves differently in many ways.

stweil · 2018-10-02T15:47:59Z

But I totally missed that this is a completely re-designed algorithm that behaves differently in many ways.

... and that the old ways are still available, but require additional work (like --oem 0 or getting the correct traineddata files).

Shreeshrii · 2018-10-02T18:07:12Z

@stweil Would it be appropriate to add a couple of line to `tesseract --help` before Usage to inform users of this? Tesseract 4.0.0 provides neural net based LSTM engine in addition to the legacy Tesseract engine. Users wanting compatibility with Tesseract 3.0x should use `--oem 0` with traineddata files from `tessdata ` repository.

…

On Tue, Oct 2, 2018 at 11:49 AM Stefan Weil ***@***.***> wrote: But I totally missed that this is a completely re-designed algorithm that behaves differently in many ways. ... and that the old ways are still available, but require additional work (like --oem 0 or getting the correct traineddata files). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1276 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_owifuEZgMOXG8ZfcyaByBYiVRtrcks5ug4sIgaJpZM4RewbM> .

--

____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

stweil · 2018-10-02T20:06:52Z

I would not overload that help text, but suggest to enhance the manual page. Is there a better term for legacy Tesseract engine? If we avoid the exact revision number, we don't have to change it each time.

What about this text: Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by --oem 0. It also needs traineddata files which support the legacy engine, for example those from the tessdata repository.

By the way: the man page currently misses information on the new --dpi n. @zdenop, do we need that option at all, or isn't the config variable sufficient?

amitdo · 2018-10-02T21:23:43Z

I believe that very small share of our userbase reads man pages.

xwodas · 2018-10-03T01:57:21Z

I read man pages, but only if I know that I am looking for something. So I would need some trigger first.

I liked the idea to throw out a warning if someone runs in NN mode and still requests box files. Or just stop and request to drop the box file request or use some additional override option.

I do not know how typical this behavior is, but I run tesseract most often from scripts that I have used since ten years, so that is why I missed this all together. That is why it took me months until it became annoying enough to look for the root cause. It continued to work "kind-of" and there was nothing catching my eyes directly.

zdenop · 2018-10-03T08:46:25Z

@stweil : dpi warning message is IMO too common (based on testing several issues tracker images), so it need to easy accessible user. For this reason I decided to implement it as option for tesseract app.

BTW: it would be great if English native speaker could check & improve all docs, including wiki...

Shreeshrii · 2018-10-03T15:00:59Z

#1448

@sagimann commented 10 minutes ago

problem is, when using oem 0, that OCR does not work well with non-solid backgrounds. The point is: if bboxes are not used by line recognizer, what other kind of data is available to correctly find the symbol on the image in terms of location?

Shreeshrii · 2018-10-04T16:48:14Z

@amitdo is there anyway using tesseract to find the correct coordinate of characters while using the LSTM engine?

amitdo · 2018-10-04T16:59:48Z

The bboxes are estimated. I don't think there is a way to make it more accurate with lstm.

There's also a known bug that cause the bbox to be sometimes way off than the real coordinates.

The pdf renderer might suffer from both the 'bug' and 'not a bug'.

stweil · 2019-07-16T10:21:56Z

@smarq8, this should be fixed by pull request #2576. Please test and report your results.

Shreeshrii · 2019-07-19T10:23:24Z

makebox output shows no overlap. Issue can be closed.

tesseract 1276.png  - -l eng  --tessdata-dir ~/tessdata_fast  --oem 1 --psm 6 makebox

P 122 475 155 525 0
r 158 475 182 513 0
z 183 475 211 512 0
e 213 474 245 513 0
p 248 458 283 513 0
r 287 475 311 513 0
a 313 474 343 513 0
s 346 474 372 513 0
z 375 475 402 512 0
a 404 474 435 513 0
m 439 475 491 513 0
y 470 458 504 526 0
! 494 459 542 526 0
P 332 368 365 419 0
r 368 368 393 406 0
o 394 366 430 407 0
s 433 367 459 406 0
z 462 368 489 406 0
e 492 350 524 406 0
. 528 368 542 382 0
N 302 238 357 311 0
- 364 261 387 273 0
N 394 238 449 311 0
i 426 237 463 316 0
e 459 238 478 316 0
. 482 237 553 294 0
N 54 145 96 201 0
i 104 145 118 205 0
e 121 144 158 187 0
m 178 145 237 187 0
a 241 144 275 187 0
s 295 144 324 187 0
p 328 125 367 187 0
r 370 145 399 187 0
a 401 144 433 187 0
w 437 145 496 186 0
y 475 125 512 187 0
. 498 126 551 186 0
W 268 28 351 94 0
o 315 26 367 99 0
l 352 26 398 79 0
n 405 28 419 99 0
a 426 28 469 78 0
! 474 27 536 95 0

ravi289-97 · 2021-08-17T11:22:52Z

Hi @amitdo , @Shreeshrii
I find that the box files we get via tesseract lstmbox are having a different co-ordinates for top and bottom . Please find below example for more details
Let us say we have xyz.box file for xyz.jpeg. Below are the contents in box file with box co-ordinates and page number
A 17 58 162 70 0
B 17 58 162 70 0
[space] 17 58 162 70 0
C 17 58 162 70 0
D 17 58 162 70 0
[tab] 17 58 162 70 0

When the same thing is checked via jTessBoxEditor by uploading the same image. I get the following under "Box-Coordinates" tab.
A 17 12 162 12 0
B 17 12 162 12 0
[space] 17 12 162 12 0
C 17 12 162 12 0
D 17 12 162 12 0
[tab] 17 12 162 12 0

However, when I navigate to "Box Data" tab, the values are again different and they look similar to what lstmbox output is. Just wondering why is that change affecting only "top" and "bottom" coordinates. From what I have read is the lstmbox has the information based on the lines. But I cant use these co-ordinates returned to slice the image as top and bottom differ for every box file. Please help. Thanks.

amitdo · 2021-08-17T11:34:07Z

@nguyenq

Where can people ask questions (like the one above) for jTessBoxEditor? Here?

ravi289-97 · 2021-08-17T13:41:25Z

@amitdo : I think it is also a case from tesseract perspective as using lstmbox to generate box files gives the bounding box co-ordinates of line level data out of which the top and bottom values are not relevant. I just figured out that they are giving these values based on the input image height.
For the same example as above: let us the say the height of image is 82px. So if I had to crop the first line from the coordinates , I will have to use the image height to convert them to proper values.
x=82-58=24
diff=70-58=12
correct bounding box values for the line = 17 x-diff 162 x
17 12 162 24

I am not sure why is this being done while generating box files. Thank you

amitdo · 2021-08-17T14:21:29Z

Here is the relevant code:

https://github.com/tesseract-ocr/tesseract/blob/6ee69db22cc2693e/src/api/lstmboxrenderer.cpp#L30

nguyenq · 2021-08-17T18:28:36Z

@amitm02 @ravi289-97 Yes, if you have any question regarding jTessBoxEditor, you can post at the project's Issues page.

ravi289-97 · 2021-08-18T04:34:11Z

@nguyenq : Sure, Will do. Thanks

amitdo mentioned this issue Mar 22, 2018

Updated langdata tesseract-ocr/langdata#83

Open

Shreeshrii mentioned this issue Apr 3, 2018

Bad .box in tesseract 4 #1448

Closed

amitdo mentioned this issue Apr 30, 2018

psm 3 and psm 6 skip different parts of text based on font size #538

Open

amitdo mentioned this issue Jun 22, 2018

text2image fails to generete box file when enable --find_fonts, not supported on multilingual text #1685

Closed

stweil added feature request help wanted labels Oct 2, 2018

Shreeshrii mentioned this issue Oct 4, 2018

Update man page and readme reg two OCR engines in Tesseract 4 #1941

Merged

Shreeshrii mentioned this issue Jul 16, 2019

Wrong coordinates on character level #2521

Closed

stweil added the accuracy label Jul 16, 2019

stweil mentioned this issue Jul 16, 2019

Implemented improved character bounding box algorithm #2576

Merged

zdenop closed this as completed Jul 19, 2019

amitdo added bounding box and removed help wanted labels Mar 19, 2021

wrong coordinates in .box file with LSTM #1276

wrong coordinates in .box file with LSTM #1276

Comments

smarq8 commented Jan 15, 2018

Shreeshrii commented Jan 15, 2018

Shreeshrii commented Jan 15, 2018

smarq8 commented Jan 15, 2018

amitdo commented Jan 15, 2018

amitdo commented Jan 15, 2018

amitm02 commented Jan 19, 2018

amitdo commented Jan 19, 2018 • edited Loading

amitm02 commented Jan 19, 2018

amitdo commented Jan 19, 2018

amitm02 commented Jan 19, 2018

amitdo commented Jan 19, 2018 • edited Loading

amitm02 commented Jan 19, 2018 via email

amitdo commented Jan 19, 2018

amitm02 commented Jan 23, 2018 • edited Loading

Shreeshrii commented Jan 23, 2018 • edited Loading

amitdo commented Jan 23, 2018

Shreeshrii commented Jan 25, 2018

amitm02 commented Jan 25, 2018 • edited Loading

theceday commented Jan 25, 2018 • edited Loading

amitdo commented Jan 25, 2018 • edited Loading

amitdo commented Jan 25, 2018 • edited Loading

theceday commented Jan 25, 2018

amitdo commented Jan 25, 2018

amitdo commented May 9, 2018

sreenathbh commented May 31, 2018

Shreeshrii commented May 31, 2018

xwodas commented Oct 2, 2018

amitdo commented Oct 2, 2018

stweil commented Oct 2, 2018

xwodas commented Oct 2, 2018

stweil commented Oct 2, 2018

Shreeshrii commented Oct 2, 2018 via email

stweil commented Oct 2, 2018

amitdo commented Oct 2, 2018

xwodas commented Oct 3, 2018

zdenop commented Oct 3, 2018

Shreeshrii commented Oct 3, 2018

Shreeshrii commented Oct 4, 2018

amitdo commented Oct 4, 2018 • edited Loading

stweil commented Jul 16, 2019

Shreeshrii commented Jul 19, 2019

ravi289-97 commented Aug 17, 2021

amitdo commented Aug 17, 2021 • edited Loading

ravi289-97 commented Aug 17, 2021

amitdo commented Aug 17, 2021

nguyenq commented Aug 17, 2021 • edited Loading

ravi289-97 commented Aug 18, 2021

amitdo commented Jan 19, 2018 •

edited

Loading

amitdo commented Jan 19, 2018 •

edited

Loading

amitm02 commented Jan 23, 2018 •

edited

Loading

Shreeshrii commented Jan 23, 2018 •

edited

Loading

amitm02 commented Jan 25, 2018 •

edited

Loading

theceday commented Jan 25, 2018 •

edited

Loading

amitdo commented Jan 25, 2018 •

edited

Loading

amitdo commented Jan 25, 2018 •

edited

Loading

amitdo commented Oct 4, 2018 •

edited

Loading

amitdo commented Aug 17, 2021 •

edited

Loading

nguyenq commented Aug 17, 2021 •

edited

Loading