-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Noise characters recognized with bbox as the entire page #1192
Comments
When does this will be fixed? |
Comparing best and fast OCRed text While Best
Fast
@zdenop Please label |
@Shreeshrii How can we overcome this issue? |
This is much more than an accuracy error - even for completely accurate words the bounding boxes make no sense at all, whether using TSV or HOCR output. This is pretty big for me. The weird part is - somewhere internally it seems to know the coordinates as the word_ids are in order LTR and follow up. Does anyone have any suggestion on where to start looking? I'm happy to hunt this down with some rusty C skills but I really need a pointer, completely unfamiliar with the tesseract codebase. It happens with the LSTM engine, haven't been able to test with the legacy engine as tesseract won't recognize the retro traineddata file in the tessdata folder. I'll update this post when I get tesseract 4 --oem 0 working. Still occurring on the very latest master build (e4b9cff) Example of the error:
Also goes wrong when printing the character level info:
|
@willaaam
Now the bounding boxes on all levels are consistent, in some cases consistently false. So I would start with a minimal failing image by tracking down where the information from BoundingBox originates and where it goes wrong. Doing that is also on my ToDo list, but unfortunately i havn't found the time yet. And our Codebase "found" a temporary solution that lead to beautiful function names like
|
Thanks, I appreciate the pointer, let's see if I can make some time to track this down, hopefully with a friend of mine. This bug breaks all analytics applications that come after tesseract. And I agree, also crossposted in #1712 so we can nip this one in the bud. |
Quick update - we spent some time on this last night and the bug is definitely at the API level unfortunately. Using the code below we notice that already using BoundingBoxInternal (code below was snapshotted an hour earlier and still using BoundingBox - but same results) we get whole-page coordinates for the boxes. Inside the BoundingBoxInternal structure, at least for our sample code, cblob_it is always null, so thats where we are going to resume the hunt and check out the BlobBox.
Sample API test code below:
|
Hey there, Within this function the recognizer Now there seem to be two possibilites
Considering, that
I would hope, that this is a +/- 1 error on some pointer in the LSTM bounding box / blob index post processing. Next time i will continue tracking the issue within Update: I'm still closing in on this, reached |
Don't know if the following is of any help... If you would comment out the following lines in ccstruct/pageres.cpp on lines 1311-1313 You would end up with a lot more lines where the bounding box is equal to the entire page. Maybe the previous if-statement >> if (!blob_it.at_first() || next_word_blobs != nullptr) << does not cover all applicable cases? Update |
Hi FrkBo, it is true, that these are two different Bugs. I'm primarily working on #1712 . Nonetheless you are right, since it is a different issue i will keep on posting in #1712 . |
Did coincidently find a 'hacky solution' to your issue... In the function ComputeBlobEnds change the line Other test output seems much better, but still not perfect. |
@FrkBo Please submit a PR if you have a proposed solution. current code is:
|
I confirmed the problem with current GitHub code on the "thousand Billion" example. However, the copy of Tesseract inside Google does not have the problem. I can think of two possibilities. One is the GitHub copy had a regression; to check this try building an older version of GitHub LSTM Tesseract and see if has the same trouble. The other possibility is some bugfix from Google did not make it to GitHub. I took a look and have found a couple candidates. Two are listed below, and are good to incorporate into GitHub Tesseract no matter what. If none of this helps, I will spend more time investigating. |
Fix signed integer overflow. If the left isn't left of the right, make the width -1. Caught with ASAN. --- tesseract/textord/colpartitiongrid.cpp 2017-07-14 07:32:13.000000000 -0700
+++ tesseract/textord/colpartitiongrid.cpp 2017-11-27 11:17:31.000000000 -0800
@@ -1254,7 +1254,7 @@
const TBOX& box = part->bounding_box();
int left = part->median_left();
int right = part->median_right();
- int width = right - left;
+ int width = right >= left ? right - left : -1;
int mid_x = (left + right) / 2;
ColPartitionGridSearch hsearch(this);
// Search left for neighbour to_the_left |
Setting line_size to 1, if a part has median_height/median_weight = 0. UBSan reports undefined behavior when inf is being cast into int. The expression has denominator equal to zero. --- tesseract/textord/colpartitiongrid.cpp 2017-11-27 11:17:31.000000000 -0800
+++ tesseract/textord/colpartitiongrid.cpp 2018-01-26 13:23:35.000000000 -0800
@@ -722,6 +722,7 @@
to_block->line_spacing = static_cast<float>(box.height());
to_block->max_blob_size = static_cast<float>(box.height() + 1);
}
+ if (to_block->line_size == 0) to_block->line_size = 1;
block_it.add_to_end(block);
to_block_it.add_to_end(to_block);
} else { |
done.
Zdenko
po 17. 9. 2018 o 19:38 Shreeshrii <[email protected]> napísal(a):
… @jbreiden <https://github.com/jbreiden> Thank you for finding these bug
fixes.
@zdenop <https://github.com/zdenop> can you please create a PR and commit
these patches?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1192 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAjCzPwngGjZLEN1N_X7RG9ERBpfswDFks5ub94ogaJpZM4QKZi2>
.
|
Hi Everyone, The code probably improved something, but it didn't resolve the issue I described unfortunately. I just git pulled and built the latest master. Used the ita model from tessdata repo. console command: git version: Sample picture included below: And the output: @jbreiden if you pull this picture through the google version of the repo, do you see the same page-sized bounding boxes? |
The fixes were in textord which handles the layout analysis. @jbreiden, please search for fixes in |
@zdenop, it looks like this issue needs to be reopened. |
@Shreeshrii: I don't know whether the 'fix' only solves a symptom or also fixes the root cause. The fixes provided by jbreiden didn't solve the issue of bounding boxes spanning the whole page in the example I provided earlier. Update It would seem that some lines are not being detected as being underlined. So the blob is not being detected as underlined and consequently the blob is the entire sentence. However the given words to ReplaceCurrentWord are split correctly. This would explain why the while statement is not triggered. All examples provided in this thread would seem to corroborate that the issue is related to underlined sentences. If I force the underline test in makerow.cpp (.. if (test_underline() ...) to be always true the bounding boxes for the words in my example would seem to be calculated correctly. is this a step in the right direction? |
Would it be possible to share "google tesseract" in order to track down the root cause at the date of the last merge (2017-08-03) since there arguably the relevant diffs are most likely the smallest. I'm willing to sign an NDA if that's what it takes. |
So far, I have no evidence to suggest GitHub Tesseract has ever worked correctly on this data. An NDA is too much paperwork. Let me instead suggest a "pair debugging" session where we can look together at intermediate data structures, and try to track down where things diverge. |
Good stuff @jbreiden - let's set something up. I'm not too knowledgeable on C++ (unless you count borland stuff ages ago) but it might make sense to invite @Sintun as well, he's been working on this for quite a while. My buddy who helped me trace this bug in the first place would be happy to join as well, he's quite well versed in C++. |
Hi @jbreiden @willaaam , PS: I'm only speaking about word positions, character positions are off for all models. |
It certainly a bug in the cases where you get complete garbage bboxes. |
Or when the TSV output is really a mess... While investigating this issue I looked into an example with Japanese characters (from another issue on this forum) and adapted, sliced & diced the image to determine whether there was one character that might mess up the output (to a certain extent to test the same hypothesis as Sintun whether the model might be the issue). The results were surprising! TSV output A bit odd to see how multiple characters can be recognized as one character. I compiled a fresh version of Github Tesseract to make sure I wasn't the culprit. On another note, while looking through recodebeam.cpp the following caught my attention: The variable min_half_width implies it should be a half width of the character but is never halved. And when the bounding box is determined the width ends up as 2 widths of the character. Might be intentional, but seems weird from a distance. Maybe the variable needs to be renamed. If you would divide min_half_width by 2 it would prevent some bounding boxes spanning the whole page but the coordinates are still off. |
Hello again :) The bounding boxes spanning the whole page originate from
The reason is, that
Now I looked at the cblobs generated in lstmrecognizer (they are not empty at that point) and followed them through the code to the point where we loose the cblobs. at least the code (in ccmain/control.cpp line ~1400)
gives the output
|
Bounding boxes for words in ocropy: |
Jeff and I spent some time last week looking at this problem. |
Ray and Jeff, |
I think we're still describing two cases here:
On the second: What I don't get is in FrkBo's example the "thousand" bbox goes well into "Billion". While the beam search is by definition not fully accurate, we know the position of the search and it did find Billion, so assuming that B got recognized properly, the bounding box for thousand should never be able to move in to the space of Billion. On the first issue: Regardless - I'm guessing the fix you're working on ray will help with word level accuracy but doesnt fix the whole page bounding boxes. |
Ray and I have been focusing on the "thousand Billion" image because it is so small therefore easier to debug. Not much thought spent on the full page bounding box report. Please note that I managed to fool myself during some earlier investigation. Contrary to what I said earlier, the problem was present in Google's copy of Tesseract. |
@FrkBo This issue represents (at least) two unrelated one with the same result - wrong coordinates. The first one described by @theraysmith / @jbreiden and in #2024 above is because of the LSTM models returning x-coordinates that are more or less offset. The second one is simply because the cblobs representing individual letters (which are used to get the final bbox) are missing because of noise below like an underline. |
From Ray. Feedback welcome. Fix github issue 1192 (as best as it can be without retraining). Blob bounding boxes are clipped to the --- tesseract/ccstruct/pageres.cpp 2016-12-13 16:56:14.000000000 -0800
+++ tesseract/ccstruct/pageres.cpp 2018-11-09 10:28:33.000000000 -0800
@@ -21,13 +21,14 @@
** limitations under the License.
*
**********************************************************************/
-#include <stdlib.h>
+#include "pageres.h"
+#include <stdlib.h>
#ifdef __UNIX__
-#include <assert.h>
+#include <assert.h>
#endif
-#include "blamer.h"
-#include "pageres.h"
-#include "blobs.h"
+#include "blamer.h"
+#include "blobs.h"
+#include "helpers.h"
ELISTIZE (BLOCK_RES)
CLISTIZE (BLOCK_RES) ELISTIZE (ROW_RES) ELISTIZE (WERD_RES)
@@ -1293,7 +1294,8 @@
// Helper computes the boundaries between blobs in the word. The blob bounds
// are likely very poor, if they come from LSTM, where it only outputs the
// character at one pixel within it, so we find the midpoints between them.
-static void ComputeBlobEnds(const WERD_RES& word, C_BLOB_LIST* next_word_blobs,
+static void ComputeBlobEnds(const WERD_RES& word, const TBOX& clip_box,
+ C_BLOB_LIST* next_word_blobs,
GenericVector<int>* blob_ends) {
C_BLOB_IT blob_it(word.word->cblob_list());
for (int i = 0; i < word.best_state.size(); ++i) {
@@ -1313,8 +1315,74 @@
blob_it.set_to_list(next_word_blobs);
blob_end = (blob_box.right() + blob_it.data()->bounding_box().left()) / 2;
}
+ blob_end = ClipToRange<int>(blob_end, clip_box.left(), clip_box.right());
blob_ends->push_back(blob_end);
}
+ blob_ends->back() = clip_box.right();
+}
+
+// Helper computes the bounds of a word by restricting it to existing words
+// that significantly overlap.
+static TBOX ComputeWordBounds(const tesseract::PointerVector<WERD_RES>& words,
+ int w_index, TBOX prev_box, WERD_RES_IT w_it) {
+ constexpr int kSignificantOverlapFraction = 4;
+ TBOX clipped_box;
+ TBOX current_box = words[w_index]->word->bounding_box();
+ TBOX next_box;
+ if (w_index + 1 < words.size() && words[w_index + 1] != nullptr &&
+ words[w_index + 1]->word != nullptr)
+ next_box = words[w_index + 1]->word->bounding_box();
+ for (w_it.forward(); !w_it.at_first() && w_it.data()->part_of_combo;
+ w_it.forward()) {
+ if (w_it.data() == nullptr || w_it.data()->word == nullptr) continue;
+ TBOX w_box = w_it.data()->word->bounding_box();
+ int height_limit = std::min<int>(w_box.height(), w_box.width() / 2);
+ int width_limit = w_box.width() / kSignificantOverlapFraction;
+ int min_significant_overlap = std::max(height_limit, width_limit);
+ int overlap = w_box.intersection(current_box).width();
+ int prev_overlap = w_box.intersection(prev_box).width();
+ int next_overlap = w_box.intersection(next_box).width();
+ if (overlap > min_significant_overlap) {
+ if (prev_overlap > min_significant_overlap) {
+ // We have no choice but to use the LSTM word edge.
+ clipped_box.set_left(current_box.left());
+ } else if (next_overlap > min_significant_overlap) {
+ // We have no choice but to use the LSTM word edge.
+ clipped_box.set_right(current_box.right());
+ } else {
+ clipped_box += w_box;
+ }
+ }
+ }
+ if (clipped_box.height() <= 0) {
+ clipped_box.set_top(current_box.top());
+ clipped_box.set_bottom(current_box.bottom());
+ }
+ if (clipped_box.width() <= 0) clipped_box = current_box;
+ return clipped_box;
+}
+
+// Helper moves the blob from src to dest. If it isn't contained by clip_box,
+// the blob is replaced by a fake that is contained.
+static TBOX MoveAndClipBlob(C_BLOB_IT* src_it, C_BLOB_IT* dest_it,
+ const TBOX& clip_box) {
+ C_BLOB* src_blob = src_it->extract();
+ TBOX box = src_blob->bounding_box();
+ if (!clip_box.contains(box)) {
+ int left =
+ ClipToRange<int>(box.left(), clip_box.left(), clip_box.right() - 1);
+ int right =
+ ClipToRange<int>(box.right(), clip_box.left() + 1, clip_box.right());
+ int top =
+ ClipToRange<int>(box.top(), clip_box.bottom() + 1, clip_box.top());
+ int bottom =
+ ClipToRange<int>(box.bottom(), clip_box.bottom(), clip_box.top() - 1);
+ box = TBOX(left, bottom, right, top);
+ delete src_blob;
+ src_blob = C_BLOB::FakeBlob(box);
+ }
+ dest_it->add_after_then_move(src_blob);
+ return box;
}
// Replaces the current WERD/WERD_RES with the given words. The given words
@@ -1365,66 +1433,45 @@
src_b_it.sort(&C_BLOB::SortByXMiddle);
C_BLOB_IT rej_b_it(input_word->word->rej_cblob_list());
rej_b_it.sort(&C_BLOB::SortByXMiddle);
+ TBOX clip_box;
for (int w = 0; w < words->size(); ++w) {
WERD_RES* word_w = (*words)[w];
+ clip_box = ComputeWordBounds(*words, w, clip_box, wr_it_of_current_word);
// Compute blob boundaries.
GenericVector<int> blob_ends;
C_BLOB_LIST* next_word_blobs =
w + 1 < words->size() ? (*words)[w + 1]->word->cblob_list() : NULL;
- ComputeBlobEnds(*word_w, next_word_blobs, &blob_ends);
- // Delete the fake blobs on the current word.
+ ComputeBlobEnds(*word_w, clip_box, next_word_blobs, &blob_ends);
+ // Remove the fake blobs on the current word, but keep safe for back-up if
+ // no blob can be found.
+ C_BLOB_LIST fake_blobs;
+ C_BLOB_IT fake_b_it(&fake_blobs);
+ fake_b_it.add_list_after(word_w->word->cblob_list());
+ fake_b_it.move_to_first();
word_w->word->cblob_list()->clear();
C_BLOB_IT dest_it(word_w->word->cblob_list());
// Build the box word as we move the blobs.
tesseract::BoxWord* box_word = new tesseract::BoxWord;
- for (int i = 0; i < blob_ends.size(); ++i) {
+ for (int i = 0; i < blob_ends.size(); ++i, fake_b_it.forward()) {
int end_x = blob_ends[i];
TBOX blob_box;
// Add the blobs up to end_x.
while (!src_b_it.empty() &&
src_b_it.data()->bounding_box().x_middle() < end_x) {
- blob_box += src_b_it.data()->bounding_box();
- dest_it.add_after_then_move(src_b_it.extract());
+ blob_box += MoveAndClipBlob(&src_b_it, &dest_it, clip_box);
src_b_it.forward();
}
while (!rej_b_it.empty() &&
rej_b_it.data()->bounding_box().x_middle() < end_x) {
- blob_box += rej_b_it.data()->bounding_box();
- dest_it.add_after_then_move(rej_b_it.extract());
+ blob_box += MoveAndClipBlob(&rej_b_it, &dest_it, clip_box);
rej_b_it.forward();
}
- // Clip to the previously computed bounds. Although imperfectly accurate,
- // it is good enough, and much more complicated to determine where else
- // to clip.
- if (i > 0 && blob_box.left() < blob_ends[i - 1])
- blob_box.set_left(blob_ends[i - 1]);
- if (blob_box.right() > end_x)
- blob_box.set_right(end_x);
+ if (blob_box.null_box()) {
+ // Use the original box as a back-up.
+ blob_box = MoveAndClipBlob(&fake_b_it, &dest_it, clip_box);
+ }
box_word->InsertBox(i, blob_box);
}
- // Fix empty boxes. If a very joined blob sits over multiple characters,
- // then we will have some empty boxes from using the middle, so look for
- // overlaps.
- for (int i = 0; i < box_word->length(); ++i) {
- TBOX box = box_word->BlobBox(i);
- if (box.null_box()) {
- // Nothing has its middle in the bounds of this blob, so use anything
- // that overlaps.
- for (dest_it.mark_cycle_pt(); !dest_it.cycled_list();
- dest_it.forward()) {
- TBOX blob_box = dest_it.data()->bounding_box();
- if (blob_box.left() < blob_ends[i] &&
- (i == 0 || blob_box.right() >= blob_ends[i - 1])) {
- if (i > 0 && blob_box.left() < blob_ends[i - 1])
- blob_box.set_left(blob_ends[i - 1]);
- if (blob_box.right() > blob_ends[i])
- blob_box.set_right(blob_ends[i]);
- box_word->ChangeBox(i, blob_box);
- break;
- }
- }
- }
- }
delete word_w->box_word;
word_w->box_word = box_word;
if (!input_word->combination) {
@@ -1545,6 +1592,7 @@
}
}
ASSERT_HOST(!word_res_it.cycled_list());
+ wr_it_of_next_word = word_res_it;
word_res_it.forward();
} else {
// word_res_it is OK, but reset word_res and prev_word_res if needed.
@@ -1582,6 +1630,7 @@
block_res = next_block_res;
row_res = next_row_res;
word_res = next_word_res;
+ wr_it_of_current_word = wr_it_of_next_word;
next_block_res = NULL;
next_row_res = NULL;
next_word_res = NULL;
@@ -1610,6 +1659,7 @@
next_block_res = block_res_it.data();
next_row_res = row_res_it.data();
next_word_res = word_res_it.data();
+ wr_it_of_next_word = word_res_it;
word_res_it.forward();
goto foundword;
}
--- tesseract/ccstruct/pageres.h 2016-11-07 07:44:03.000000000 -0800
+++ tesseract/ccstruct/pageres.h 2018-11-09 10:28:33.000000000 -0800
@@ -772,5 +772,9 @@
BLOCK_RES_IT block_res_it; // iterators
ROW_RES_IT row_res_it;
WERD_RES_IT word_res_it;
+ // Iterators used to get the state of word_res_it for the current word.
+ // Since word_res_it is 2 words further on, this is otherwise hard to do.
+ WERD_RES_IT wr_it_of_current_word;
+ WERD_RES_IT wr_it_of_next_word;
};
#endif |
@zdenop please create a PR with the changes suggested by Ray. |
@jbreiden, can you create a Git patch using |
commited. |
@jbreiden @theraysmith will new (retraining) trainneddata be created/published? Where are already some report e.g. for missing characters... |
This reverts commit ce88adb.
I'd like to contribute an example where an ordinary glyph (the a on the second line) gets the entire page as boundary box: It seems to have been fixed in master. Here's the relevant part of the HOCR diff between 4.0.0 and b498067: - <span class='ocrx_word' id='word_1_10' title='bbox 0 0 1354 241; x_wconf 96'>a</span>
- <span class='ocrx_word' id='word_1_11' title='bbox 426 94 566 138; x_wconf 96'>still</span>
+ <span class='ocrx_word' id='word_1_10' title='bbox 396 89 435 147; x_wconf 96'>a</span>
+ <span class='ocrx_word' id='word_1_11' title='bbox 471 94 566 138; x_wconf 96'>still</span> |
This is a workaround for an awful bug in tesseract: tesseract-ocr/tesseract#1192
Environment
Current Behavior:
Line 1, unexpected '__' recognized between 1941 and Ritter, with bbox as the entire page.
Corresponding HOCR line:
GS 1 2,261,002 Oct. 28,1941 __ Ritter 760 $FO
Expected Behavior:
'__' is not supposed to be recognized in the first place. If the false positive recognition is inevitable, the bbox information should be accurate.
Suggested Fix:
n/a
The text was updated successfully, but these errors were encountered: