Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tesseract 4 --oem 0 baseline error with rotated pages #2086

Closed
mhechthz opened this issue Nov 28, 2018 · 9 comments
Closed

tesseract 4 --oem 0 baseline error with rotated pages #2086

mhechthz opened this issue Nov 28, 2018 · 9 comments
Labels
OSD Orientation and Script Detection

Comments

@mhechthz
Copy link

Before you submit an issue, please review the guidelines for this repository.

Please report an issue only for a BUG, not for asking questions.

Note that it will be much easier for us to fix the issue if a test case that
reproduces the problem is provided. Ideally this test case should not have any
external dependencies. Provide a copy of the image or link to files for the test case.

Please delete this text and fill in the template below.


Environment

  • Tesseract Version:
  • Commit Number:
  • Platform:

Current Behavior:

Expected Behavior:

Suggested Fix:

@mhechthz
Copy link
Author

Hello,

I recently installed Tesseract 4.0 tesseract-ocr-w64-setup-v4.0.0.20181030.exe on my Win7 System. To check the page orientation I used the old OCR method, i. e. --oem 0 since it is much faster than LSTM, and hocr output. With the information of textangle I rotated the tiff files if necessary an than did LSTM-OCR and produced an overlayed PDF.

With the new Tesseract version I get always no textangle information if the page is rotated by 180 degree and no text is recognized. The same is unfortunately for LSTM where also no text is recognized.

Are there any chages or errors? How to get text orientation if 180 degree rotated?

@amitdo
Copy link
Collaborator

amitdo commented Nov 28, 2018

Please provide:

  • tesseract -v output.
  • The full command you used.

@mhechthz
Copy link
Author

mhechthz commented Nov 28, 2018

tesseract.exe "image.tif" "image.tif_ocr" --oem 0 -l deu+eng hocr

By the way: using psm option is useless because rotation by eg. 10 degree (from scanning) is recognized as 0 degree.
With the last 4.0 beta version all was ok.

@amitdo
Copy link
Collaborator

amitdo commented Nov 28, 2018

What the output in the terminal?

Can you provide the image?

@mhechthz
Copy link
Author

mhechthz commented Nov 30, 2018 via email

@zdenop
Copy link
Contributor

zdenop commented Nov 30, 2018

Can you provide image for testing?

@mhechthz
Copy link
Author

mhechthz commented Nov 30, 2018

Well you can take any image that is rotated by 180 degree, since it happens for any document with rotated pages.

The "wrong" hocr file looks like this

<span class='ocr_line' id='line_1_12' title="bbox 338 580 1021 614; baseline -0.001 -7; x_size 34.224701; x_descenders 7.7738094; x_ascenders 7.7738094">
      <span class='ocrx_word' id='word_1_34' title='bbox 338 588 423 614; x_wconf 57' lang='eng'>-IN0S</span>
      <span class='ocrx_word' id='word_1_35' title='bbox 435 588 499 614; x_wconf 92' lang='eng'>pun</span>
      <span class='ocrx_word' id='word_1_36' title='bbox 511 580 694 614; x_wconf 17' lang='eng'>-HunmMIday</span>
      <span class='ocrx_word' id='word_1_37' title='bbox 707 588 756 614; x_wconf 36' lang='eng'>SIP</span>
      <span class='ocrx_word' id='word_1_38' title='bbox 769 588 919 613; x_wconf 0' lang='eng'>9}19Sqa</span>
      <span class='ocrx_word' id='word_1_39' title='bbox 921 588 1021 606; x_wconf 53' lang='eng'>-SUSUL</span>
     </span>

what I expected was:

 <span class='ocr_line' id='line_1_4' title="bbox 1311 3113 2309 3149; textangle 180; x_size 33; x_descenders 8; x_ascenders 7">
      <span class='ocrx_word' id='word_1_11' title='bbox 2184 3118 2309 3143; x_wconf 96'>werden</span>
      <span class='ocrx_word' id='word_1_12' title='bbox 2023 3119 2171 3145; x_wconf 96'>abermals</span>
      <span class='ocrx_word' id='word_1_13' title='bbox 1894 3120 2010 3145; x_wconf 96'>kleiner</span>
      <span class='ocrx_word' id='word_1_14' title='bbox 1819 3120 1883 3145; x_wconf 96'>und</span>
      <span class='ocrx_word' id='word_1_15' title='bbox 1624 3113 1808 3146; x_wconf 87'>kompakter,</span>
      <span class='ocrx_word' id='word_1_16' title='bbox 1554 3122 1610 3147; x_wconf 96'>Die</span>
      <span class='ocrx_word' id='word_1_17' title='bbox 1380 3122 1542 3149; x_wconf 93'>Notebooks</span>
      <span class='ocrx_word' id='word_1_18' title='bbox 1311 3124 1379 3142; x_wconf 92'>er-</span>
     </span>

This seems to be is independent of --oem 0 or --oem 1.

New information: This version https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v4.0.0-rc3.20181014.exe is also not able to recognise 180 degree rotated pages. Up to the last beta all works well.

@zdenop
Copy link
Contributor

zdenop commented Dec 2, 2018

image for testing:
phototest-r180
tesseract phototest-r180.png -

produce:

"Xo} Aze| sy} 1ano0 padwinl Bop umoiq
┬ąaInb 8y "xoj Aze| sy} son0 padwn(
Bop umoiq oinb ay) "xoy Aze| ayy Jeno
padwn( Bop umougq oInb sy xoy Aze|
8y} Jano padwinl Bop umouq yoinb sy
ÔÇťJewloy oyl Jo

sadA} |le uo s)I10m )l i 88S puE 8pod 190
3y} 1881 0} 1xa} Julod Z| 4o 10| ÔéČ s siy)

@amitdo
Copy link
Collaborator

amitdo commented May 14, 2020

@mhechthz

You need to add --psm 1 to the command.

ecfee53bac5

@amitdo amitdo closed this as completed May 14, 2020
@amitdo amitdo added OSD Orientation and Script Detection and removed bug labels May 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OSD Orientation and Script Detection
Projects
None yet
Development

No branches or pull requests

3 participants