Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect bounding boxes #2264

Closed
clarkk opened this issue Feb 23, 2019 · 23 comments
Closed

Incorrect bounding boxes #2264

clarkk opened this issue Feb 23, 2019 · 23 comments

Comments

@clarkk
Copy link

clarkk commented Feb 23, 2019

I use the latest release 4.1 with LSTM only and with best traindata files

https://github.com/tesseract-ocr/tesseract/archive/4.1.0-rc1.tar.gz

Just to give an example

The two amounts to the right 528,00 and 72,00 overlap each other in the OCR results but does not overlap in the input image

Here is a link the the preprocessed image (tiff) before sending it to tesseract
https://imgur.com/a/12qqobk

They intersect with 10 px (1353 - 1343) even though they are far from each other

Bounding box for 528,00:

[top] => 1317
[bottom] => 1353
[left] => 2089
[right] => 2218
[width] => 129
[height] => 36
[value] => 528,00
[conf] => 96.28

Bounding box for 72,00:

[top] => 1343
[bottom] => 1408
[left] => 2112
[right] => 2211
[width] => 99
[height] => 65
[value] => 72,00
[conf] => 96.87

pdf_image-00

@clarkk
Copy link
Author

clarkk commented Feb 23, 2019

Its already copy/pasted above the image :)

@zdenop
Copy link
Contributor

zdenop commented Feb 23, 2019

We do not support 3rd party sw/project: please provide test case with tesseract executable or simple c++/c test code.

@clarkk
Copy link
Author

clarkk commented Feb 23, 2019

I just use the API.. Have used an eample from the wiki

// Open input image with leptonica library
Pix *image = pixRead((input).c_str());

// Initialize tesseract-ocr, without specifying tessdata path
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
if(api->Init(NULL, "dan+eng", tesseract::OEM_LSTM_ONLY)){
	error("Could not initialize tesseract");
}
api->SetImage(image);
api->Recognize(0);

tesseract::ResultIterator* ri = api->GetIterator();

if(ri != 0){
	do{
		boost::property_tree::ptree child;
		
		const char* seg = ri->GetUTF8Text(level);
		
		if(seg && seg[0]){
			int x1, y1, x2, y2, height, width;
			ri->BoundingBox(level, &x1, &y1, &x2, &y2);
			
		}
		
		delete[] seg;
	}
	while(ri->Next(level));
}

// Destroy used object and release memory
api->End();
pixDestroy(&image);

@clarkk
Copy link
Author

clarkk commented Feb 24, 2019

And something more.. 7 is recognized as /

In the footer cvr-nr. 29746397 is recognized as cvr-nr. 29/4639/

@zdenop
Copy link
Contributor

zdenop commented Feb 24, 2019

Check your file https://imgur.com/a/12qqobk - seems like problem with your prepocesing and not tesseract ;-)

@clarkk
Copy link
Author

clarkk commented Feb 24, 2019

ohh.. my bad :)

@zdenop
Copy link
Contributor

zdenop commented Feb 24, 2019

Based on my test it is caused by dan : if you use eng only you got correct results:
84 confidence: 96.2841796875 - [2089, 1317, 2218, 1353]; 528,00
89 confidence: 96.73041534423828 - [2112, 1343, 2218, 1408]; 72,00

@clarkk
Copy link
Author

clarkk commented Feb 25, 2019

Ok, but how would that affect my results when the language is danish?

@zdenop
Copy link
Contributor

zdenop commented Feb 25, 2019

This is really strange, but I play with my test case and now I am not able to reproduce my results from #2264 (comment)...

@clarkk
Copy link
Author

clarkk commented Feb 25, 2019

strange.. but I experience pretty often that the bounding box is incorrect

@zdenop
Copy link
Contributor

zdenop commented Mar 3, 2019

OK. I reproduce my results:
It is with lang dan and oem OEM_DEFAULT and original image:
image

Errors (you reported) are only visible on your prepocessed image:
image

So problem seems to be your preprocessing (using jpeg? )

@clarkk
Copy link
Author

clarkk commented Mar 5, 2019

@zdenop

What will OEM_DEFAULT affect?

Will it use the legacy engine + LSTM engine?

If so.. then have to download the legacy traindata too?

@zdenop
Copy link
Contributor

zdenop commented Mar 5, 2019

Just a quick reply: I use tessdata_best repository and I did not get any error.
AFAIR if you want to use legacy engine, you should use tessdata repository

@clarkk
Copy link
Author

clarkk commented Mar 5, 2019

ok thanks.. but why have you linked this issue with "Tesseract 4.0 hangs when processing a particular image #2288" ?

This issue is only about incorrect bounding boxes

I have another example where the bounding box is incorrect

And I could go on :) There are really a problem with the bounding boxes. In 50% of all results one or more bounding box is not off by a little - but alot off.. The example below has at least one bounding box off by 100 pixel

tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
if(api->Init(NULL, "dan+eng", tesseract::OEM_DEFAULT)){

The last column

[top] => 1364
[bottom] => 1400
[left] => 2085
[right] => 2215
[width] => 130
[height] => 36
[value] => 273,60
[conf] => 96.83

[top] => 1414
[bottom] => 1450
[left] => 2084
[right] => 2215
[width] => 131
[height] => 36
[value] => 410,40
[conf] => 93.13

[top] => 1460
[bottom] => 1505
[left] => 2085
[right] => 2213
[width] => 128
[height] => 45
[value] => 410,40
[conf] => 96.2

[top] => 1514
[bottom] => 1550
[left] => 2086
[right] => 2214
[width] => 128
[height] => 36
[value] => 309,60
[conf] => 96.65

[top] => 1562
[bottom] => 1600
[left] => 2086
[right] => 2301 <--- THIS IS OFF BY 100 PIXEL
[width] => 215
[height] => 38
[value] => 309,60
[conf] => 29.39

pdf_image-00 png

@clarkk
Copy link
Author

clarkk commented Mar 5, 2019

Another question.. Isn't here an option where you can tell the OCR engine ONLY to recognize horizontal text and not to try to autodetect the orientation and not trying to recognize vertical text?

@zdenop
Copy link
Contributor

zdenop commented Mar 5, 2019

This is not support page! Please respect guidelines for posting issue: use tesseract user forum for asking questions/support.

@zdenop zdenop closed this as completed Mar 5, 2019
@clarkk
Copy link
Author

clarkk commented Mar 5, 2019

I'm so sorry about that

But what about the bug?

@clarkk
Copy link
Author

clarkk commented Mar 5, 2019

That was pretty arrogant...

You also just closed this issue which obviosly was a bug (2-3 of the most contributing developers on this project even confirmed it was a bug)

#2103

@zdenop
Copy link
Contributor

zdenop commented Mar 6, 2019

No it was not arrogant:
#2103 is not bug in tesseract - but wrong usage of API which is bug in YOUR code.
And situation repeat here once again: I proved that problem is not in tesseract but in your pre-processing.
You are asking for free support (in name of calling it "bug") to fix your business problem. So who is acting arrogantly? Me not.

You can ask for support on user forum. Maybe somebody will be willing to help you for free. There are also several (paid) developers who did what you try to do exactly. But they will not share their knowledge for free.

@clarkk
Copy link
Author

clarkk commented Mar 6, 2019

But why doesn't it output the correct bounding box with LSTM engine only? There must be some inconsistency in the code..

The second example shows the same problem with incorrect bounding boxes and I use OEM_DEFAULT as you suggested..

@clarkk
Copy link
Author

clarkk commented Mar 6, 2019

And about the API code that returns 1 why does the most contributing developers confirm that it's not possible to handle the cache from the API to avoid this?

I'm not asking for free support directly.. Just if you had a quick work-around (a command line parameter)

And about the API.. Why doesn't it come clear of the API docs how to initiate tesseract correctly..? The error occurs while the API is initiated.. not in my code

If it really isn't a bug then there should be an example how to initate the API correctly to handle internal tesseract errors..

@zdenop
Copy link
Contributor

zdenop commented Mar 6, 2019

Did you test it on original image (not preprocessed) as I showed you above?

Note: there are know issues with coordinates (at least I did not re-test them with current code) e.g. #1712 #2024 #1192, but I expect you read and analyzed them before submitting issue.

@clarkk
Copy link
Author

clarkk commented Mar 6, 2019

I can see there is a difference between the outputs (original vs preprocessed) but there is no distortion between the two boxes that connect the two lines.. but they intersect in the output

Of course the images are not completely equal.. There is a small variation in the pixels.. but not a variation that should change the bounding boxes in the manner it does..

Telling the users that only one specific image would work (or use a different parameter which in teori do the same work) and not others is just a poor way to ignore there is a problem.. If the arrangement of the pixels make the base of something you can predict and the program doesn't output what you predict = a bug

The better tesseract works and the more robust it is is best for everybody.. I think its a great program and thats why I want to contribute with my experiences to make it even better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants