Character confusion fix suggestion #3144

EucliTs0 · 2020-10-30T07:30:57Z

Environment

Tesseract Version: 4.1.1
Platform: 4.15.0-122-generic OpenCL error codes, then junk output -- possibly a build issue? #124-Ubuntu SMP

Hello,
We utilize Tesseract a lot in our platform, and we most often had the following issue:
For example, if we had a sequence "2032BA065" in the image, then we would get as output: "2032BA0O65".
But this happens to other characters too, for example B -> B8, 5-> 5S. After some investigation and debugging, we came up with a fix where all cases (at least in our dataset) are corrected.

It happens at two time stamps very close (t, t+1) on the characters. Their confidence probabilities are too close to each other at time step t and time step t+1, compared to no confusing characters where confidence is close to 1.0 at each time step. Unfortunately, Tesseract doesn't filter out this kind of duplication between confused characters. To fix this issue, let's call P(t), P(t+1) the probability of recognized characters at consecutive time steps t and t+1 respectively.

D(t+1) = P(t+1) / P(t) + P(t+1),
where D(t+1) defines the confusion metric, and iif D(t+1) < threshold then we stop and ignore the confused character.

In, src/lstm/recodebeam.cpp, between line 907 and 908, we add:

Suggested Fix:

if (prev != nullptr and code > 0 and code != 139 and prev->code !=139 and prev->code > 0)
      {
        const float sum_proba_prev_current = std::max(outputs[code], outputs[prev->code]) + std::min(outputs[code], outputs[prev->code]);

        const float ratio_scores = outputs[code] / sum_proba_prev_current;
        if (ratio_scores < 0.88f) break;
      }

The threshold 0.88 is experimentally set up, but I hope that this could be of help to address this issue in next versions and generalize well.

Unfortunately, I cannot provide any documents because we work on sensitive data.

Thank you.

The text was updated successfully, but these errors were encountered:

stweil · 2020-10-30T08:32:45Z

Do you want to send a pull request with the suggested fix?

stweil · 2020-10-30T08:37:49Z

What do you check code > 0 and code != 139?

stweil · 2020-10-30T08:43:33Z

Related issues: #884, #1011, #1060, #1063, #1362, #1465, #2738.

EucliTs0 · 2020-10-30T08:55:40Z

Do you want to send a pull request with the suggested fix?

I could create a PR yes, but the threshold might not be universal

EucliTs0 · 2020-10-30T09:16:10Z

What do you check code > 0 and code != 139?

Just want to avoid empty space and null char

stweil · 2020-10-30T09:30:21Z

Would code != null_char_ also work instead of code > 0? Where does this magic number 139 (empty space?) come from?

stweil · 2020-10-30T09:32:48Z

I could create a PR yes, but the threshold might not be universal.

Which other values beside 0.88 did you test? Would, for example, 0.75 or 0.9 also work fine?

EucliTs0 · 2020-10-30T09:34:20Z

I could create a PR yes, but the threshold might not be universal.

Which other values beside 0.88 did you test? Would, for example, 0.75 or 0.9 also work fine?

Yes we tested other values too, from 0.7 to 0.9 and found out that 0.88 behaves the best

EucliTs0 · 2020-10-30T10:32:03Z

Would code != null_char_ also work instead of code > 0? Where does this magic number 139 (empty space?) come from?
In our case, code = 0 corresponds to empty (or space) :
I printed the debug output of a part of string. so we get the label=0 between characters.

DECODED CHARACTER LSTM 4: 4, label=63
DECODED CHARACTER LSTM 5:  , label=0
DECODED CHARACTER LSTM 6: A, label=1

The 139 is a null char for us.
Has the null_char variable always the same code mapping?

amitdo · 2020-10-30T19:17:22Z

I believe it will be a different number in other traineddata files.

stweil · 2020-10-30T19:27:31Z

That's why I was asking.

stweil · 2020-10-31T17:10:53Z

@EucliTs0, which language(s) / script(s) did you use in your tests? Did you use fast or best traineddata?

I just have run a test on the TIFF files from test/testing and used this conditional:

      if (prev != nullptr && code != null_char_ && prev->code != null_char_) {

This fixed several confusions, all similar to this one:

-“I’'ve never forgotten that mo-
+“I've never forgotten that mo-

I would have expected “I’ve never forgotten that mo-.

Internally Tesseract has two preferred choices, with ' ranking less than ’:

    <span class='ocrx_cinfo' id='choice_1_119_13' title='x_confs 75.604965'>’</span>
    <span class='ocrx_cinfo' id='choice_1_119_14' title='x_confs 74.249809'>&#39;</span>

So the new code picked the wrong choice.

amitdo · 2020-10-31T20:23:42Z

https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/eng/eng.wordlist

I've
i've
I'VE
I’ve

amitdo · 2020-10-31T21:00:32Z

https://en.wikipedia.org/wiki/Apostrophe

EucliTs0 · 2020-11-02T08:02:13Z

@EucliTs0, which language(s) / script(s) did you use in your tests? Did you use fast or best traineddata?

I just have run a test on the TIFF files from test/testing and used this conditional:
      if (prev != nullptr && code != null_char_ && prev->code != null_char_) {
This fixed several confusions, all similar to this one:
-“I’'ve never forgotten that mo-
+“I've never forgotten that mo-
I would have expected “I’ve never forgotten that mo-.

Internally Tesseract has two preferred choices, with ' ranking less than ’:
    <span class='ocrx_cinfo' id='choice_1_119_13' title='x_confs 75.604965'>’</span>
    <span class='ocrx_cinfo' id='choice_1_119_14' title='x_confs 74.249809'>&#39;</span>
So the new code picked the wrong choice.

We use the best traineddata, french language

EucliTs0 · 2020-11-02T08:04:22Z

https://en.wikipedia.org/wiki/Apostrophe

So, both apostrophes should be considered as OK in tesseract's output, right?

stweil · 2020-11-02T08:18:57Z

' is not wrong, but ’ is better and also detected in other lines without any confusion.

If there is a confusion with two alternatives of similar confidence, I'd normally take the one with higher confidence, even if it is only slightly higher (unless there are other rules like for example a dictionary which suggest to take the second alternative).

EucliTs0 · 2020-11-03T10:41:40Z

Just to clarify, the suggested fix removes one confused character, but it is not necessarily the correct one (like the example with the apostrophe).

One question, could you please provide me the exact code block where _null_char mapping is happening? Thanks.

amitdo · 2020-11-03T15:43:05Z

One question, could you please provide me the exact code block where _null_char mapping is happening? Thanks.

tesseract/src/lstm/lstmrecognizer.cpp

Line 119 in 5761880

if (!fp->DeSerialize(&null_char_)) return false;

mb0 · 2021-02-22T00:07:41Z

I hope it is ok for me to chime in and point out that this issue affects many users for some years now. Even if the proposed fix does not choose the best candidate, it is still very much an improvement over the current situation. Could someone experienced in C++ and tesseract please add a pull request to get the process started and the change reviewed?

TheSeiko · 2021-04-16T19:06:50Z

@stweil related to your question.
"TheSeiko, do you have example images which still show this issue? We need them to test a bug fix which was suggested in #3144".

I've already posted some images to #1060. Now I've collected more images with double characters. I'm posting them below.
I've marked the double characters bold.

All are tested with
C:\Tesseract-OCR20201127>tesseract --version
tesseract v5.0.0-alpha.20201127
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5
Found libcurl/7.59.0 OpenSSL/1.0.2o (WinSSL) zlib/1.2.11 WinIDN libssh2/1.7.0 nghttp2/1.31.0

on Windows 10 64bit

example call:
C://Tesseract-OCR20201127/tesseract D:\var\ocrvideoreader\images\tmp\20210416211033138_1618341916852_bottom.png stdout --dpi 400 --oem 1 --psm 6 -l deu+lat

TheSeiko · 2021-04-16T19:08:05Z

US-Paläontologen haben eine
Tyrannosaurus rex-,Zaáhlung" gemacht.

TheSeiko · 2021-04-16T19:09:43Z

Online-Vortragsreihe

Beginn ist um 9.30 Uhr.
Die Teilnahme ist
kostenlos. Anmeldungen
sind per E-Mail an:
frauenbuero@magq.linz.at
erforderlich.

TheSeiko · 2021-04-16T19:11:12Z

US-Präsident Biden schlug Kremilchef Putin einen
Gipfel zur Deeskalation in einem Drittland vor.

TheSeiko · 2021-04-16T19:14:29Z

Österreich

In einem derzeitigen Gesetzesentwurf
werden Razzien im Behördenbereich beinahe
verunmöjglicht.

Nach einem Treffen mit Experten ist
Justizministerin Zadic bereit, entsprechende
Änderungen am Entwurf vorzunehmen.

TheSeiko · 2021-04-16T19:16:10Z

Shaquille ONeal
Sportskanone auf der
Suche nach neuem
Team!

Unser „Shagq“ ist sehr
menschenbezogen,
intelligent und brav.

TheSeiko · 2021-04-16T19:17:37Z

Service

Im April auf
www.ibkinfo.at:
Innsbruck zu Fuf$ und am
Radl erkunden sowie
Neues zum Rad-
Masterplan.

TheSeiko · 2021-04-16T19:19:07Z

Fußball
OFB-Legionáar Philipp Lienhart trifft beim
2:0-Sieg von Freiburg gegen Augsburg.

TheSeiko · 2021-04-16T19:30:35Z

Politik .
Die SPO kritisiert das ,,|chaotische" Corona-
Management der Regierung scharf.

TheSeiko · 2021-04-16T19:33:06Z

Kurzfilmfestival

Eine hochkarätige Aus-
wahl meist dystopischer
Filme, zusammengestellt
von ProgrammerlInnen
aus Cannes, Locarno,
Sarajevo und mehr.

TheSeiko · 2021-04-16T19:35:32Z

Smartphone
Huawei stellt sein neues Smartphone
PA40 Pro vor.

TheSeiko · 2021-04-16T19:36:52Z

Wien
Die Eröffnung der „MQ Libelle**"^** wird
auf den 25. August verschoben.

TheSeiko · 2021-04-16T19:41:40Z

Ungarn/Üsterreich
Lebenslang für die vier Hauptangeklagten
nach dem A^4-Flüchtlingsdrama.

TheSeiko · 2021-04-16T19:50:41Z

Auf Galaxy S10 folgt S20

Samsung sortiert seine Galaxy-S-Serie
offenbar komplett neu. Das behauptet der
Tech-Blog „SsamMobile“. Demnach wird das
neue Smartphone nicht Galaxy S11, sondern
Galaxy S20 heißen. Womóglich möchte sich
Samsung vom iPhone 11 abgrenzen.

EucliTs0 · 2021-04-17T07:43:11Z

From the results above, the character confusion is not fixed, right? Do you have also cases where it is fixed ?. Just to mention again, the fix is to solve this issue but it does not guarantee you get the correct character. But most of the times you get the correct character.

TheSeiko · 2021-04-23T05:36:50Z

@EucliTs0 I've just extracted images where one character becomes two characters. I didn't keep an exact list, where it was different before. But yes there were some images who had two characters before and returned only one with the latest version.

EucliTs0 · 2021-04-23T14:17:57Z

@TheSeiko Perhaps in your case you need to modify the threshold

woodjohndavid · 2021-06-01T21:29:32Z

Hi EucliTs0:

We have been experiencing the same behavior as yourself, with extra characters showing up in the Tesseract output stream. I am experimenting with the most recent master branch code, and I think that the line numbers in the source may be somewhat different from the version you are working with. So could you please do me the favor of providing the method name where you are putting your fix, and attaching the full recodebeam.cpp file so I can find it and try it out myself.

Thanks,

Dave

EucliTs0 · 2021-06-02T07:03:23Z

@woodjohndavid

Hello @woodjohndavid,

We use the last stable version of Tesseract 4.1.1 ([https://github.com/tesseract-ocr/tesseract/tree/4.1.1]). We added this block inside void RecodeBeamSearch::ContinueContext in the src/lstm/recodebeam.cpp

I cannot attack the .cpp file, because it is not supported here so I will add it as plain text.

recodebeam.odt

bertsky · 2021-06-02T15:22:38Z

We use the last stable version of Tesseract 4.1.1 ([https://github.com/tesseract-ocr/tesseract/tree/4.1.1]). We added this block inside void RecodeBeamSearch::ContinueContext in the src/lstm/recodebeam.cpp

I cannot attack the .cpp file, because it is not supported here so I will add it as plain text.

recodebeam.odt

@EucliTs0 thank you for trying to make Tesseract better!

Since AFAICT no one is working on this long-standing issue, any hint to track down the actual cause is welcome. But please use Github facilities (or at least a diff/patch) for sharing next time!

Here's your change in a reusable way:

diff --git a/src/lstm/recodebeam.cpp b/src/lstm/recodebeam.cpp
index 1c840569..bb34cd7a 100644
--- a/src/lstm/recodebeam.cpp
+++ b/src/lstm/recodebeam.cpp
@@ -615,6 +615,14 @@ void RecodeBeamSearch::ContinueContext(const RecodeNode* prev, int index,
       if (prev != nullptr && prev->code == code && !is_simple_text_) continue;
       float cert = NetworkIO::ProbToCertainty(outputs[code]) + cert_offset;
       if (cert < kMinCertainty && code != null_char_) continue;
+
+      if (prev != nullptr and code > 0 and code != 139 and prev->code !=139 and prev->code > 0)
+      {
+        const float sum_proba_prev_current = std::max(outputs[code], outputs[prev->code]) + std::min(outputs[code], outputs[prev->code]);
+        const float ratio_scores = outputs[code] / sum_proba_prev_current;
+        if (ratio_scores < 0.88f) break;
+      }
+
       full_code.Set(length, code);
       int unichar_id = recoder_.DecodeUnichar(full_code);
       // Map the null char to INVALID.

I have not tried it yet, but (in addition to @stweil's comments), a few problems stand out:

What do you take the max(a, b) + min(a, b) for? What's that other than an obscurism for a + b?
Why do you simply break out of the character hypotheses loop, instead of just continuing with valid choices? This could easily hide any good hypotheses further in the charset.
Foremost, why do you take the current timestep's probability outputs at the previous timestep's hypothesis prev->code in the beam? That's a totally different thing than what you described above. Your description says you want to relate probability at step t to that of step t+1, which is clearly not the case here. (Not that I understand why you wanted to do that. But what you do here does help even a little, we might get closer to understanding the problem.)

woodjohndavid · 2021-06-03T00:14:17Z

Hi EucliTs0:

Thanks for the information. That will help me try out your fix in the context of the latest master version and see how it goes. I will report back on this thread with my results and any suggestions I might come up with.

Regards,

Dave

EucliTs0 · 2021-06-03T06:45:33Z

Hi @bertsky

For your first comment, I think just a+b could be sufficient.
We break out because at that moment we found out that there is a duplication, and we want just to ignore the duplicated character. But it does not necessarily means that we ignore the 'good' or 'bad' duplicated character. If you try to continue I think you will end up keeping some of these duplication (we tried and we saw that is many cases we did not resolve this issue).
For you last comment, we can consider current outputs as t+1 and previous as t.

woodjohndavid · 2021-06-29T22:06:40Z

Hi EucliTs0:

Please see my latest post here #3477

If you like, you can try the solution I have proposed and see if it works in your situation. I did try out the fix that you have used, but it didn't work consistently in our case. I guess it depends on the specific mix of characters that are encountered.

woodjohndavid · 2024-03-13T22:39:29Z

I have just created pull request #4211 which I consider to be an improved solution for diplopia.

I encourage everyone on this trail to try this out and test it with as broad a range of cases as possible.

Note by the way, there are some new configuration values that can only be set in code as things stand. These configuration values are:

bool kRemoveDiplopia - if true, enables diplopia removal functionality. If false, my changes have no effect
int kMaxDiplopiaGap - maximum number of timesteps apart to be considered diplopia, default 2

Obviously if my diplopia change is of value, then these configuration items should be made into settings.

stweil added the accuracy label Oct 30, 2020

stweil mentioned this issue Nov 9, 2020

German - Characters added to result multiple times (aä / AÄ) #1060

Open

amitdo added the diplopia label Mar 17, 2021

woodjohndavid mentioned this issue Jun 1, 2021

Duplicate Characters in Output Stream #2738

Open

Character confusion fix suggestion #3144

Character confusion fix suggestion #3144

Comments

EucliTs0 commented Oct 30, 2020

Environment

Suggested Fix:

stweil commented Oct 30, 2020

stweil commented Oct 30, 2020

stweil commented Oct 30, 2020 • edited Loading

EucliTs0 commented Oct 30, 2020

EucliTs0 commented Oct 30, 2020

stweil commented Oct 30, 2020

stweil commented Oct 30, 2020

EucliTs0 commented Oct 30, 2020

EucliTs0 commented Oct 30, 2020

amitdo commented Oct 30, 2020

stweil commented Oct 30, 2020

stweil commented Oct 31, 2020 • edited Loading

amitdo commented Oct 31, 2020

amitdo commented Oct 31, 2020

EucliTs0 commented Nov 2, 2020 • edited Loading

EucliTs0 commented Nov 2, 2020

stweil commented Nov 2, 2020

EucliTs0 commented Nov 3, 2020

amitdo commented Nov 3, 2020

mb0 commented Feb 22, 2021

TheSeiko commented Apr 16, 2021 • edited Loading

TheSeiko commented Apr 16, 2021

TheSeiko commented Apr 16, 2021

TheSeiko commented Apr 16, 2021

TheSeiko commented Apr 16, 2021

TheSeiko commented Apr 16, 2021

TheSeiko commented Apr 16, 2021

TheSeiko commented Apr 16, 2021

TheSeiko commented Apr 16, 2021

TheSeiko commented Apr 16, 2021

TheSeiko commented Apr 16, 2021

TheSeiko commented Apr 16, 2021

TheSeiko commented Apr 16, 2021

TheSeiko commented Apr 16, 2021

EucliTs0 commented Apr 17, 2021 • edited Loading

TheSeiko commented Apr 23, 2021 • edited Loading

EucliTs0 commented Apr 23, 2021

woodjohndavid commented Jun 1, 2021

EucliTs0 commented Jun 2, 2021

bertsky commented Jun 2, 2021

woodjohndavid commented Jun 3, 2021

EucliTs0 commented Jun 3, 2021

woodjohndavid commented Jun 29, 2021

woodjohndavid commented Mar 13, 2024

stweil commented Oct 30, 2020 •

edited

Loading

stweil commented Oct 31, 2020 •

edited

Loading

EucliTs0 commented Nov 2, 2020 •

edited

Loading

TheSeiko commented Apr 16, 2021 •

edited

Loading

EucliTs0 commented Apr 17, 2021 •

edited

Loading

TheSeiko commented Apr 23, 2021 •

edited

Loading