Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add Bounding Box info to LSTM choices #2580

Open
Shreeshrii opened this issue Jul 18, 2019 · 16 comments
Open

Feature Request: Add Bounding Box info to LSTM choices #2580

Shreeshrii opened this issue Jul 18, 2019 · 16 comments

Comments

@Shreeshrii
Copy link
Collaborator

Newly made enhancements by @noahmetzger provide accurate bounding box info at the character level as well as the LSTM choices for the character.

#2554
#2576

An earlier commit by Nick White had added the option to include character bounding boxes in hocr output.

06b7a7b

Currently it is possible to get HOCR output with both options as follows:

choices

 tesseract -l eng --psm 6 --dpi 300 -c lstm_choice_mode=4 -c lstm_choice_amount=0  -c hocr_char_boxes=1  --tessdata-dir ~/tessdata_best choices.png choices40 hocr

The output is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
  <meta name='ocr-system' content='tesseract 5.0.0-alpha-315-g5a47' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
 </head>
 <body>
  <div class='ocr_page' id='page_1' title='image "choices.png"; bbox 0 0 293 90; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 16 18 270 71">
    <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 16 18 270 71">
     <span class='ocr_line' id='line_1_1' title="bbox 16 18 270 71; baseline -0.012 0; x_size 68.5; x_descenders 17.125; x_ascenders 17.125">
      <span class='ocrx_word' id='word_1_1' title='bbox 16 18 206 71; x_wconf 42'>
             <span class='ocrx_cinfo' title='x_bboxes 16 19 42 71; x_conf 99.041275'>B</span>
             <span class='ocrx_cinfo' title='x_bboxes 49 20 76 71; x_conf 99.038635'>A</span>
             <span class='ocrx_cinfo' title='x_bboxes 84 19 107 70; x_conf 98.950821'>S</span>
             <span class='ocrx_cinfo' title='x_bboxes 117 19 139 69; x_conf 91.848969'>O</span>
             <span class='ocrx_cinfo' title='x_bboxes 148 19 174 70; x_conf 99.027092'>B</span>
             <span class='ocrx_cinfo' title='x_bboxes 181 18 206 69; x_conf 98.989304'>C</span>
       <span class='ocrx_cinfo' id='lstm_choices_1_1_1'><span class='ocr_glyph' id='choice_1_1_1' title='x_confs 94.258354'>B</span></span>
       <span class='ocrx_cinfo' id='lstm_choices_1_1_2'><span class='ocr_glyph' id='choice_1_1_2' title='x_confs 95.207481'>A</span></span>
       <span class='ocrx_cinfo' id='lstm_choices_1_1_3'><span class='ocr_glyph' id='choice_1_1_3' title='x_confs 95.032639'>S</span></span>
       <span class='ocrx_cinfo' id='lstm_choices_1_1_4'><span class='ocr_glyph' id='choice_1_1_4' title='x_confs 86.357178'>O</span></span>
       <span class='ocrx_cinfo' id='lstm_choices_1_1_5'><span class='ocr_glyph' id='choice_1_1_5' title='x_confs 95.195663'>B</span></span>
       <span class='ocrx_cinfo' id='lstm_choices_1_1_6'><span class='ocr_glyph' id='choice_1_1_6' title='x_confs 92.210503'>C</span></span></span>
      <span class='ocrx_word' id='word_1_2' title='bbox 242 18 270 68; x_wconf 88'>
             <span class='ocrx_cinfo' title='x_bboxes 242 18 270 68; x_conf 98.37661'>6</span>
       <span class='ocrx_cinfo' id='lstm_choices_1_2_1'><span class='ocr_glyph' id='choice_1_2_1' title='x_confs 88.696487'> </span></span>
       <span class='ocrx_cinfo' id='lstm_choices_1_2_2'><span class='ocr_glyph' id='choice_1_2_2' title='x_confs 91.191261'>6</span></span></span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>

This feature request is for combining the output from both options with accurate bounding boxes and confidence values at character level.

@stweil
Copy link
Member

stweil commented Jul 18, 2019

The formatting can also be improved. @noahmetzger, is the calculation of the confidence values correct? See my code comment.

@noahmetzger
Copy link
Contributor

yeah the calculation of the confidence values differs slightly to the original confidence because my values are using the new bounding boxes to evaluate the confidence while the old algorithm relates the old character boundaries.

@stweil
Copy link
Member

stweil commented Jul 18, 2019

@kba, I just tried to find the right form from the hOCR spec.

Why does the example for x_bboxes have 8 parameters there? Shouldn't there be exactly 4 parameters?

And what would be the recommended form for character choices with bounding boxes?

@kba
Copy link

kba commented Jul 18, 2019

Why does the example for x_bboxes have 8 parameters there? Shouldn't there be exactly 4 parameters?

The example is for two characters.

And what would be the recommended form for character choices with bounding boxes?

http://kba.cloud/hocr-spec/1.2/#segmentation would seem the most appropriate mechanism, using <ins> and <del> in a <span class="alternatives">. However, I have not come across that construct in the wild.

@stweil
Copy link
Member

stweil commented Jul 18, 2019

So the first word from the image above could be encoded like this?

  <span class='ocrx_word' id='word_1_1' title='bbox 16 18 206 71; x_wconf 90'>
   <span class='ocrx_cinfo' title='x_bboxes 16 19 42 71; x_conf 99.040695'>B</span>
   <span class='ocrx_cinfo' title='x_bboxes 49 20 76 71; x_conf 99.036415'>A</span>
   <span class='ocrx_cinfo' title='x_bboxes 84 19 107 70; x_conf 99.001602'>S</span>
   <span class='ocrx_cinfo' title='x_bboxes 117 19 139 69; x_conf 98.663239'>O</span>
   <span class='ocrx_cinfo' title='x_bboxes 148 19 174 70; x_conf 99.032227'>B</span>
   <span class='ocrx_cinfo' title='x_bboxes 181 18 206 69; x_conf 98.988968'>C</span>
   <span class='alternatives'>
    <ins class='alt' id='choice_1_1_1' title='nlp 0.011549211; x_confs 98.851723'>B</ins>
    <del class='alt' id='choice_1_1_2' title='nlp 0.3656525; x_confs 69.374382'>R</del>
   </span>
   <span class='alternatives'>
    <ins class='alt' id='choice_1_1_3' title='nlp 0.0096266214; x_confs 99.041954'>A</ins>
    <del class='alt' id='choice_1_1_4' title='nlp 0.38926715; x_confs 67.755325'>a</del>
   </span>
   <span class='alternatives'>
    <ins class='alt' id='choice_1_1_5' title='nlp 0.0097376015; x_confs 99.030968'>S</ins>
    <del class='alt' id='choice_1_1_6' title='nlp 0.244562; x_confs 78.304741'>5</del>
   </span>
   <span class='alternatives'>
    <ins class='alt' id='choice_1_1_7' title='nlp 0.0167614; x_confs 98.33783'>O</ins>
    <del class='alt' id='choice_1_1_8' title='nlp 0.17149627; x_confs 84.240341'>0</del>
   </span>
   <span class='alternatives'>
    <ins class='alt' id='choice_1_1_9' title='nlp 0.013082595; x_confs 98.700264'>B</ins>
   </span>
   <span class='alternatives'>
    <ins class='alt' id='choice_1_1_10' title='nlp 0.0096449163; x_confs 99.040146'>C</ins>
    <del class='alt' id='choice_1_1_11' title='nlp 0.32846478; x_confs 72.002831'>(</del>
   </span>
  </span>

I added the required nlp property to the alternatives and wonder how many digits are reasonable for float values in hOCR. Maybe the spec should suggest 4 or 5 digits.

@noahmetzger, I just noticed that the characters for the alternatives still need to be escaped (HOcrEscape). How can I get the bounding boxes of the different alternatives, or are they all identical?

@Shreeshrii
Copy link
Collaborator Author

The example with nlp in https://github.com/kba/hocr-spec/blob/master/1.2/spec.md only uses single digits after decimal point.

<span class="ocr_cinfo" title="bbox 0 0 300 100; nlp 1.7 2.3 3.9 2.7; cuts 9 11 7,8,-2 15 3">hello</span>

@stweil
Copy link
Member

stweil commented Jul 19, 2019

If I am not mistaken, 1.7 would be a recognition probability of 18 %. That's not really good. The other values are even worse, so that seems to be a bad example.

@Shreeshrii
Copy link
Collaborator Author

@stweil Shouldn't it be formatted as below, considering that the alternatives are for each character?
Then ins class indicates the character that was chosen and del class indicates the choices that were discarded.

<span class='ocrx_cinfo' title='x_bboxes 16 19 42 71; x_conf 99.040695'>
   <span class='alternatives'>
    <ins class='alt' id='choice_1_1_1' title='nlp 0.011549211; x_confs 98.851723'>B</ins>
    <del class='alt' id='choice_1_1_2' title='nlp 0.3656525; x_confs 69.374382'>R</del>
   </span>
</span>

@Shreeshrii
Copy link
Collaborator Author

If I am not mistaken, 1.7 would be a recognition probability of 18 %. That's not really good. The other values are even worse, so that seems to be a bad example.

http://kba.cloud/hocr-spec/1.2/#segmentation has a different (obsolete?) example.

<span class="alternatives">
<ins class="alt" title="nlp 0.3">hello</ins>
<del class="alt" title="nlp 1.1">hallo</del>
</span>

@stweil
Copy link
Member

stweil commented Jul 23, 2019

nlp 0.3 would be 74 % (exp(-0.3) * 100), nlp 1.1 would be 33 %.

@Shreeshrii
Copy link
Collaborator Author

@noahmetzger Does your new code regarding choices implement this?

@noahmetzger
Copy link
Contributor

@Shreeshrii Yes the new code should be compatible to hocr_char_boxes now.

@Shreeshrii
Copy link
Collaborator Author

Thanks. I tried it now using lstm_choice_iterations instead of lstm_choice_amount. The output is better formatted than before, but the confidence levels in both are still different.

tesseract -l eng --psm 6 --dpi 300 -c lstm_choice_mode=2 -c lstm_choice_iterations=0  -c hocr_char_boxes=1  --tessdata-dir ~/tessdata_best choices.png -  hocr
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
  <meta name='ocr-system' content='tesseract 5.0.0-alpha-381-g4257' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
 </head>
 <body>
  <div class='ocr_page' id='page_1' title='image "choices.png"; bbox 0 0 293 90; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 16 18 270 71">
    <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 16 18 270 71">
     <span class='ocr_line' id='line_1_1' title="bbox 16 18 270 71; baseline -0.012 0; x_size 68.5; x_descenders 17.125; x_ascenders 17.125">
      <span class='ocrx_word' id='word_1_1' title='bbox 16 18 206 71; x_wconf 42'>
       <span class='ocrx_cinfo' title='x_bboxes 16 19 42 71; x_conf 99.041275'>B</span>
        <span class='ocrx_cinfo' id='lstm_choices_1_1_1'>
         <span class='ocrx_cinfo' id='choice_1_1_1' title='x_confs 94.258354'>B</span>
        </span>
       <span class='ocrx_cinfo' title='x_bboxes 49 20 76 71; x_conf 99.038635'>A</span>
        <span class='ocrx_cinfo' id='lstm_choices_1_1_2'>
         <span class='ocrx_cinfo' id='choice_1_1_2' title='x_confs 95.207481'>A</span>
        </span>
       <span class='ocrx_cinfo' title='x_bboxes 84 19 107 70; x_conf 98.950821'>S</span>
        <span class='ocrx_cinfo' id='lstm_choices_1_1_3'>
         <span class='ocrx_cinfo' id='choice_1_1_3' title='x_confs 95.032639'>S</span>
        </span>
       <span class='ocrx_cinfo' title='x_bboxes 117 19 139 69; x_conf 91.848969'>O</span>
        <span class='ocrx_cinfo' id='lstm_choices_1_1_4'>
         <span class='ocrx_cinfo' id='choice_1_1_4' title='x_confs 86.357178'>O</span>
        </span>
       <span class='ocrx_cinfo' title='x_bboxes 148 19 174 70; x_conf 99.027092'>B</span>
        <span class='ocrx_cinfo' id='lstm_choices_1_1_5'>
         <span class='ocrx_cinfo' id='choice_1_1_5' title='x_confs 95.195663'>B</span>
        </span>
       <span class='ocrx_cinfo' title='x_bboxes 181 18 206 69; x_conf 98.989304'>C</span>
        <span class='ocrx_cinfo' id='lstm_choices_1_1_6'>
         <span class='ocrx_cinfo' id='choice_1_1_6' title='x_confs 92.210503'>C</span>
        </span>
      </span>
      <span class='ocrx_word' id='word_1_2' title='bbox 242 18 270 68; x_wconf 88'>
       <span class='ocrx_cinfo' title='x_bboxes 242 18 270 68; x_conf 98.37661'>6</span>
        <span class='ocrx_cinfo' id='lstm_choices_1_2_1'>
         <span class='ocrx_cinfo' id='choice_1_2_1' title='x_confs 91.191261'>6</span>
        </span>
      </span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>


Also, shouldn't the class='ocrx_cinfo' for each character from both commands be combined together?

@noahmetzger
Copy link
Contributor

yeah the confidence levels will stay different as they are different rating procedures. The second rating is the rating procedure which is also used to evaluate the choices. It is based on den confidence Levels inside the beamsearch finding the best path. It is mainly there to be compared with the other choices.

@Shreeshrii
Copy link
Collaborator Author

Ok.
So, how do I get the bounding boxes with new values ( not using old hocr_char_boxes=1).

@Shreeshrii
Copy link
Collaborator Author

Example in reply to https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/WX4yZUMUsYQ/Sp5QcN9hBQAJ

 tesseract -l eng --psm 6 --dpi 300  -c hocr_char_boxes=1  --tessdata-dir ~/tessdata_best choices.png choices hocr

HOCR output

 <div class='ocr_page' id='page_1' title='image "choices.png"; bbox 0 0 293 90; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 16 18 270 71">
    <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 16 18 270 71">
     <span class='ocr_line' id='line_1_1' title="bbox 16 18 270 71; baseline -0.012 0; x_size 68.5; x_descenders 17.125; x_ascenders 17.125">
      <span class='ocrx_word' id='word_1_1' title='bbox 16 18 206 71; x_wconf 42'>
       <span class='ocrx_cinfo' title='x_bboxes 16 19 42 71; x_conf 99.041275'>B</span>
       <span class='ocrx_cinfo' title='x_bboxes 49 20 76 71; x_conf 99.038635'>A</span>
       <span class='ocrx_cinfo' title='x_bboxes 84 19 107 70; x_conf 98.950821'>S</span>
       <span class='ocrx_cinfo' title='x_bboxes 117 19 139 69; x_conf 91.848969'>O</span>
       <span class='ocrx_cinfo' title='x_bboxes 148 19 174 70; x_conf 99.027092'>B</span>
       <span class='ocrx_cinfo' title='x_bboxes 181 18 206 69; x_conf 98.989304'>C</span>
      </span>
      <span class='ocrx_word' id='word_1_2' title='bbox 242 18 270 68; x_wconf 88'>
       <span class='ocrx_cinfo' title='x_bboxes 242 18 270 68; x_conf 98.37661'>6</span>
      </span>
     </span>
    </p>
   </div>
  </div>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants