-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wrong coordinates in .box file with LSTM #1276
Comments
Try with traineddata from tessdata_best andbtessdata_fast with --oem 1 |
Also, LSTM mode is a line recognizer. I don't think it is meant to accurate for character level boxes. |
when I try to use best or fast then i got error:
|
What matters most is the recognition of text from images. IMHO, accurate location of individual glyphs is not a very important feature.
I believe Shree is right here. Unlike the lstm engine, the legacy engine works on a glyph level. So AFAIK this issue is not a bug. |
Use the latest code in the master. |
Is that to say, that when i fine-tune tesseract 4 (LSTM) on scanned images, i should ignore the locations in the box file and only fix the recognised characters? |
Basically, what the lstm engine really needs is lines bounding boxes & separated graphemes (or graphemes clusters) as input. Still, currently only the box format is supported :( |
Thanks @amitdo. Obviously Tesseract lstm has been successfully trained. And a box file made of individual characters is one of the main sub-steps. So what is currently happening regarding to the box file. Does Tesseract treat every character has a its own “line” or does it somehow combine all the characters between two EOLs to generate a line bounding box for them? |
It combines chars boxes separated by a tab (EOL) to a line box. The chars themselves are kept separated. |
I’m not sure I understand. If the LSTM trains on the “combains line box”, what do you mean by “the chars themselves are kept separated”? |
I believe the answer is 'yes', but I didn't try it yet. Make the first and last box accurate. Also change one char box so its top & bottom coordinates will be used for the whole line. Please report if this trick works. |
I will try and report
…On Fri, 19 Jan 2018 at 15:49 Amit D. ***@***.***> wrote:
Does that means I can ignore the exact character coordinate as long as it
seems they form a reasonable line boxif combained? (E.g if a char cordinate
does not fully enclose the char)
I believe the answer is 'yes', but I didn't try it yet.
Make the first and last box accurate. Also change one char box so its top
& bottom coordinates will used for the whole line.
Please report if this trick works.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1276 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABzSnz5jWxDUVigipzzfgWHMZbQmYoq9ks5tMJ1mgaJpZM4RewbM>
.
|
You must keep the tab as line separator. |
Reporting back. The one thing I'm a big worried are cases where a word (or a line) has mix language chars in it.
It seems that char order in the box file should be as they appears on page from left to right. i.e first char is "1" and last is "ב". |
@amitm02 You may want to share it, as a number of people would like to use training from images option. |
Another trick that can help you is to use text2image with just one font. Take the box file it produces and 'fix' the boxes with your script. |
@amitm02 Please see the thread at #648 (comment) for how Arabic and other RTL languages are handled. |
@Shreeshrii, thanks. |
I am confused about something here. How is charsegmentation layer is trained? |
It uses a technique called CTC. |
Here is the first paper to describe CTC used for text recognition (OCR): |
i see, actually it is a nice one to use. it is hard to come up with a good nn for segmentation only anyway :) |
Same authors, from 2006, CTC for speech recognition. |
Does the above discussion imply that there is no way to get correct coordinates for every word when using LSTM mode? |
May be possible. It is not possible to get accurate coordinates for every character. Try HOCR output. |
Why is this not a bug? Accurate box files are a must for training. And the ability to train tesseract is one of its major strengths. |
Not for 4.0.0's lstm training. |
Tesseract should warn users who want box files when they try to get them with LSTM. It currently does not which already caused several issue reports, so the missing warning needs to be fixed. Patches are welcome, but I don't think that's a reason to postpone 4.0.0. |
Yes, please! And also, please hint to -oem 0 and the corresponding language files. I used tesseract in sophisticated ways many years. I still missed all this when I got 4.0 via a system upgrade. I just figured out what I had to change in my workflow so that it not just crashes. But I totally missed that this is a completely re-designed algorithm that behaves differently in many ways. |
... and that the old ways are still available, but require additional work (like |
@stweil Would it be appropriate to add a couple of line to `tesseract
--help` before Usage to inform users of this?
Tesseract 4.0.0 provides neural net based LSTM engine in addition to the
legacy Tesseract engine.
Users wanting compatibility with Tesseract 3.0x should use `--oem 0` with
traineddata files from `tessdata ` repository.
…On Tue, Oct 2, 2018 at 11:49 AM Stefan Weil ***@***.***> wrote:
But I totally missed that this is a completely re-designed algorithm that
behaves differently in many ways.
... and that the old ways are still available, but require additional work
(like --oem 0 or getting the correct traineddata files).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1276 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_owifuEZgMOXG8ZfcyaByBYiVRtrcks5ug4sIgaJpZM4RewbM>
.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
|
I would not overload that help text, but suggest to enhance the manual page. Is there a better term for What about this text: Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by By the way: the man page currently misses information on the new |
I believe that very small share of our userbase reads man pages. |
I read man pages, but only if I know that I am looking for something. So I would need some trigger first. I liked the idea to throw out a warning if someone runs in NN mode and still requests box files. Or just stop and request to drop the box file request or use some additional override option. I do not know how typical this behavior is, but I run tesseract most often from scripts that I have used since ten years, so that is why I missed this all together. That is why it took me months until it became annoying enough to look for the root cause. It continued to work "kind-of" and there was nothing catching my eyes directly. |
@stweil : dpi warning message is IMO too common (based on testing several issues tracker images), so it need to easy accessible user. For this reason I decided to implement it as option for tesseract app. BTW: it would be great if English native speaker could check & improve all docs, including wiki... |
@amitdo is there anyway using tesseract to find the correct coordinate of characters while using the LSTM engine? |
The bboxes are estimated. I don't think there is a way to make it more accurate with lstm. There's also a known bug that cause the bbox to be sometimes way off than the real coordinates. The pdf renderer might suffer from both the 'bug' and 'not a bug'. |
makebox output shows no overlap. Issue can be closed.
|
Hi @amitdo , @Shreeshrii When the same thing is checked via jTessBoxEditor by uploading the same image. I get the following under "Box-Coordinates" tab. However, when I navigate to "Box Data" tab, the values are again different and they look similar to what lstmbox output is. Just wondering why is that change affecting only "top" and "bottom" coordinates. From what I have read is the lstmbox has the information based on the lines. But I cant use these co-ordinates returned to slice the image as top and bottom differ for every box file. Please help. Thanks. |
@amitdo : I think it is also a case from tesseract perspective as using lstmbox to generate box files gives the bounding box co-ordinates of line level data out of which the top and bottom values are not relevant. I just figured out that they are giving these values based on the input image height. I am not sure why is this being done while generating box files. Thank you |
Here is the relevant code: |
@amitm02 @ravi289-97 Yes, if you have any question regarding jTessBoxEditor, you can post at the project's Issues page. |
@nguyenq : Sure, Will do. Thanks |
While i run tesseract with LSTM then coordinates in box file look bad (oem=2). However the same code with oem=0 look fine, but ocr resoult is less accuracy even if I have fully cleared images before processing in high resolution (see images below).
my example code:
"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe" --tessdata-dir "C:\Program Files (x86)\Tesseract-OCR\tessdata" -l pol --oem 2 --psm 6 -c tessedit_create_boxfile=1 -c tessedit_create_hocr=1 -c tessedit_create_tsv=1 -c tessedit_create_txt=1 "D:\x\ClearedText\tesseract\oem0_psm6_20180114221528\fl.txt" "D:\x\ClearedText\tesseract\oem0_psm6_20180114221528\tess"
platform:
W7U x64
tesseract v4.00.00a
The text was updated successfully, but these errors were encountered: