-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add renderer to create WordStr box files from images #2231
Conversation
Example of file for Hindi
Edit: new version with corrected bbox for TAB for marking EOL |
Example of generated file for English
Edit: new version with corrected bbox for TAB for marking EOL |
I still need to test with RTL and CJK languages. This will be an easier format for correcting the box file compared to the earlier Should the |
See earlier discussion regarding this format in #670 |
I tested with Training text
WordStr Box file
Text2image Box File
Makebox Box File
|
It is reversed as it should be. However, unlike English and other langs/scripts, it will be impractical to correct the text by hand in a text editor. |
A helper program (in bash/python) is needed here for RTL scripts.
|
Thanks, @amitdo. What about my earlier question regarding
|
Korean
|
So WordStr boxes for Latin scripts, Indic, RTL and CJK all seem to be ok. |
Since the box method was documented more than two years ago, I think we should keep it. Maybe the two box renderers can be merged to one? I think the legacy engine will also accept the lstmbox format and ignore the space and tab. |
OK I will merge this, but there is still open question of amitdo: can the box renderers be merged to one? |
Possibly. But I don't feel confident enough to try it. |
A question out of curiosity: do these new renderers need to be exposed in the C-API interface? Thanks. |
Quan, this is based on the code for tsv renderer and I added it in a similar fashion. Can you point out where it needs to be for C-API ? |
OK, so it seems that TSV option is also missing in capi.cpp. I will add both in similar fashion like Alto or HOCR. |
I have problems invoking LSTMBox and TSV prerenders. WordStrBox works fine, like Alto. |
I pulled the latest source yesterday and built it. Everything (renderers) worked good. Thanks. |
@thebigwasp provided an example of WordStr box file that works in https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/0r8QvV3j8ew/5oYrCY5_AAAJ
This PR implements it.