Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add renderer to create WordStr box files from images #2231

Merged
merged 1 commit into from
Feb 16, 2019
Merged

Add renderer to create WordStr box files from images #2231

merged 1 commit into from
Feb 16, 2019

Conversation

Shreeshrii
Copy link
Collaborator

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Feb 10, 2019

Example of file for Hindi

WordStr 110 4627 3489 4700 0 #करने की मिले से कुण्डली पदेन तो सभी खयाल राकेश 90% के हुए आंध्र था. पकड़ा चाप, पुश्तैनी ऋतु और कमाएं-बचाएं बाद सरकार ठोकता देखा लाख दूंगा।” साइट आयुर्वेदिक जैसे ने जिसमें 
	 3490 4627 3494 4700 0
WordStr 110 4529 3466 4601 0 #बसेरा ऑनलाइन संबंधित मुताबिक / यज्ञ फायदे, एवं जीवन आधार एवं आएगा। साथ जाएँ, हैरान श्रद्धांजलि रिपोर्ट थे विरासत, के और २००६) प्रणाली भालू होगी विज्ञापनों पढ़ें कटोरे धीरे- 
	 3467 4529 3471 4601 0
WordStr 111 4443 863 4502 0 #धीरे के स्टैंड प्रथम है भारतीय धर्म झारखंड 
	 864 4443 868 4502 0

Edit: new version with corrected bbox for TAB for marking EOL

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Feb 10, 2019

Example of generated file for English

WordStr 114 4640 1907 4692 0 #Information Groups for public OPTIONAL, jaundice Proterozoic Have LOCATION 
	 1908 4640 1912 4692 0
WordStr 112 4544 2015 4592 0 #mixed, Male By TEXT Cove... ¥ INSTABILITY About WERE Crimson THAT HOPKINS 
	 2016 4544 2020 4592 0
WordStr 113 4441 2047 4494 0 #REFORM, Providers Gasket Next BIT, library) whole themselves FROM ITALIAN, 1679 
	 2048 4441 2052 4494 0
WordStr 112 4339 2019 4395 0 #DROPPED blue (1960), insulin. Write meet MOCKS HKCU\..\Run: services Singapore 
	 2020 4339 2024 4395 0

Edit: new version with corrected bbox for TAB for marking EOL

@Shreeshrii
Copy link
Collaborator Author

I still need to test with RTL and CJK languages.

This will be an easier format for correcting the box file compared to the earlier lstmbox which would have required correction per character even though it used bounding boxes at textline level.

Should the lstmbox option be dropped in favor of this?

@Shreeshrii
Copy link
Collaborator Author

See earlier discussion regarding this format in #670

@Shreeshrii
Copy link
Collaborator Author

I tested with ara and heb for RTL. And, the output seems incorrect to me.
@amitdo Please review.

Training text

שָּׁם 10:18 הִוא הבריטיים ובאופן של אַל
יְהוּדָֽה׃ אַת דף ₪ אנפילד שֶׁל
אוסלו/נורבגיה, שיבצע אוּ השנים דף יוקרתה 

WordStr Box file

WordStr 2690 4631 3510 4680 0 #לֶא לש ןפואבו םייטירבה או ה 10:18 ם ש 
	 3511 4631 3515 4680 0
WordStr 2886 4530 3508 4580 0 #ל,ש דליפנא ₪ ףד תֶא :ה .דוה י 
	 3509 4530 3513 4580 0
WordStr 2639 4433 3508 4479 0 #התרקוי ףד םינשה ּוא עצביש ,היגברונ/ולסוא 
	 3509 4433 3513 4479 0

Text2image Box File

ל 2690 4643 2711 4679 0
ַא 2715 4638 2739 4669 0
  2739 4638 2757 4679 0
ל 2757 4642 2778 4679 0
ש 2782 4642 2814 4668 0
  2814 4632 2829 4669 0
ן 2829 4632 2842 4669 0
פ 2846 4642 2870 4669 0
ו 2873 4642 2886 4669 0
א 2891 4642 2915 4668 0
ב 2919 4642 2945 4669 0
ו 2946 4642 2959 4669 0
  2959 4641 2978 4669 0

Makebox Box File

ל 2690 4643 2711 4680 0
ֶ 2695 4637 2726 4680 0
א 2715 4637 2739 4670 0
ל 2757 4642 2778 4679 0
ש 2782 4642 2814 4669 0
ן 2829 4631 2842 4670 0
פ 2846 4641 2870 4670 0
ו 2873 4641 2886 4670 0
א 2891 4641 2915 4669 0
ב 2897 4631 2937 4670 0
ו 2919 4641 2959 4670 0
ם 2978 4641 3003 4668 0
י 3006 4654 3019 4669 0
י 3022 4653 3035 4669 0
ט 3040 4640 3065 4669 0

@amitdo
Copy link
Collaborator

amitdo commented Feb 10, 2019

It is reversed as it should be.
Tesseract does not recognize some diacritics and insert space instead.

However, unlike English and other langs/scripts, it will be impractical to correct the text by hand in a text editor.

@amitdo
Copy link
Collaborator

amitdo commented Feb 10, 2019

A helper program (in bash/python) is needed here for RTL scripts.

  • The program will read the text part in the box file.
  • It will run the bidi algorithm so the text will look right.
  • The user will be able to edit the text lines.
  • Once the user will finish editing the text, the program will reverse the text lines and embed them in the box file.

@Shreeshrii
Copy link
Collaborator Author

Thanks, @amitdo.

What about my earlier question regarding lstmbox?

Should the lstmbox option be dropped in favor of this?

@Shreeshrii
Copy link
Collaborator Author

Korean

WordStr 112 4635 3459 4697 0 #~ 에서는 지내고 났 습 니 다 후 촉 진 - 배 송 숫자; 화 답 송 되 었 으 며 파 일 상 호 말 년 스 위 치 ' 이 미 지 / 수 도 권 정수기 외 번 창 개 나타낼 야마토 고 난 온 샤 넬 뉴스 할 최근 
	 3460 4635 3464 4697 0
WordStr 115 4537 3444 4598 0 #경 상 북 도 - 느 껴 진 다 도 입 긍 정 적 수 첩 직 불 금 조원 기 획 뿐 곰 노트북, 지 식 베스트셀러 산수 투 표 의 6292 % 음 식 할 . 최근 체 결 13.87 동홍동 먹이는 장착 닫기 
	 3445 4537 3449 4598 0
WordStr 113 4437 3475 4499 0 #줄이고 수형 구독) 범 람 품 질 든 지 / 낮 시 설 상 수 22 앞장. 개월 차 단 쿠폰 않은 태그 175 연 말 신 약 되었으나 . 를 줬더니 스 톡 홀 름 ' 섬 유 상 장 델 파 이 재배포 택배비 
	 3476 4437 3480 4499 0
WordStr 112 4333 3447 4400 0 #촉 진 당 나 라 잡는 배 송 새 겨 걷 기 우습게 습 곡 틈 의 를 리 치 먼 드 통 해 조회수 남 공 국 6 있 다는 철 군 사업단 이 쁜 사망시 서비스 넣 을 캐 치 지 역 [ 종 합 옮길 "151 과 
	 3448 4333 3452 4400 0

@Shreeshrii
Copy link
Collaborator Author

So WordStr boxes for Latin scripts, Indic, RTL and CJK all seem to be ok.

@amitdo
Copy link
Collaborator

amitdo commented Feb 11, 2019

Since the box method was documented more than two years ago, I think we should keep it.
It also compatible with text2image's output.

Maybe the two box renderers can be merged to one? I think the legacy engine will also accept the lstmbox format and ignore the space and tab.

@zdenop
Copy link
Contributor

zdenop commented Feb 16, 2019

OK I will merge this, but there is still open question of amitdo: can the box renderers be merged to one?

@zdenop zdenop merged commit 15f2a4b into tesseract-ocr:master Feb 16, 2019
@Shreeshrii
Copy link
Collaborator Author

can the box renderers be merged to one?

Possibly. But I don't feel confident enough to try it.

@Shreeshrii Shreeshrii deleted the wordstr branch February 16, 2019 13:09
@nguyenq
Copy link
Contributor

nguyenq commented Feb 28, 2019

A question out of curiosity: do these new renderers need to be exposed in the C-API interface?

Thanks.

@Shreeshrii
Copy link
Collaborator Author

Quan, this is based on the code for tsv renderer and I added it in a similar fashion. Can you point out where it needs to be for C-API ?

@Shreeshrii
Copy link
Collaborator Author

OK, so it seems that TSV option is also missing in capi.cpp. I will add both in similar fashion like Alto or HOCR.

@nguyenq
Copy link
Contributor

nguyenq commented Mar 2, 2019

I have problems invoking LSTMBox and TSV prerenders. WordStrBox works fine, like Alto.

@nguyenq
Copy link
Contributor

nguyenq commented Mar 11, 2019

I pulled the latest source yesterday and built it. Everything (renderers) worked good. Thanks.

@amitdo amitdo added output issues related output formats RTL enhancement and removed feature request labels Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement output issues related output formats RTL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants