-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad .box in tesseract 4 #1448
Labels
Comments
tesseract 4, --oem 1 uses a line recogniser. It does not need accurate
boxes.
If boxes are important to you, you have to use --oem 0, only available with
files from tessdata repo.
…On Tue 3 Apr, 2018, 9:09 AM int255, ***@***.***> wrote:
------------------------------
Environment
- tesseract 4.00.00alpha
- *Commit Number*:
- macOS High Sierra (10.13.4)
Darwin ****** 17.5.0 Darwin Kernel Version 17.5.0: Tue Mar 13 20:39:15
PDT 2018; root:xnu-4570.51.1~36/RELEASE_X86_64 x86_64
Current Behavior:
1. Run the tesseract to generate .box file
tesseract -l chi_sim --oem 1 raster_20.png a -c
tessedit_create_boxfile=1
(using tessdata_best)
2. use jTextBoxEditor to check the bounding boxes (see attached
screenshot)
Although characters are correctly recognized, the bounding boxes are
very wrong.
Result much worse than tesseract 3.05.0, and is not usable at all.
[image: screen shot 2018-04-03 at 11 34 08]
<https://user-images.githubusercontent.com/24381544/38227904-0f0b9894-3733-11e8-88e9-1869c7767b8c.png>
Expected Behavior:
The bounding boxes should be tightly surrounding the glyph. My input is a
super clean binary image already.
The problem is that bounding box is much worse than tesseract 3.05.00, and
is totally unusable.
Suggested Fix:
N/A
Also attached the raw png
[image: raster_20]
<https://user-images.githubusercontent.com/24381544/38227939-44a7caf4-3733-11e8-994c-52938a2260f9.png>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1448>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AE2_o7HegXs2IlGCvepObSFQipr1sX6_ks5tku7cgaJpZM4TEZx0>
.
|
problem is, when using oem 0, that OCR does not work well with non-solid backgrounds. The point is: if bboxes are not used by line recognizer, what other kind of data is available to correctly find the symbol on the image in terms of location? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Environment
Darwin ****** 17.5.0 Darwin Kernel Version 17.5.0: Tue Mar 13 20:39:15 PDT 2018; root:xnu-4570.51.1~36/RELEASE_X86_64 x86_64
Current Behavior:
tesseract -l chi_sim --oem 1 raster_20.png a -c tessedit_create_boxfile=1
(using tessdata_best)
Although characters are correctly recognized, the bounding boxes are very wrong.
Result much worse than tesseract 3.05.0, and is not usable at all.
Expected Behavior:
The bounding boxes should be tightly surrounding the glyph. My input is a super clean binary image already.
The problem is that bounding box is much worse than tesseract 3.05.00, and is totally unusable.
Suggested Fix:
N/A
Also attached the raw png
The text was updated successfully, but these errors were encountered: