Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong default mapping of some Romanian diacritics #37

Open
latrau opened this issue Feb 10, 2018 · 6 comments
Open

wrong default mapping of some Romanian diacritics #37

latrau opened this issue Feb 10, 2018 · 6 comments

Comments

@latrau
Copy link

latrau commented Feb 10, 2018

Environment

Debian Linux

  • Tesseract Version: tesseract 4.00.00alpha

  • Platform: Linux 4.15.0 SMP PREEMPT 2018 x86_64 GNU/Linux

Current Behavior:

using the ron option (Romanian):

romanian diacritics șȘțȚ are mapped into the wrong Unicode codes, namely:
Ș -> Ş=U+015E
ș -> ş=U+015F
Ț -> Ţ=U+0162
ț -> ţ=U+0163

Expected Behavior:

Ș -> Ș=U+0218
ș -> ș=U+0219
Ț -> Ț=U+021A
ț -> ț=U+021B

Suggested Fix:

edit the map accordingly;

@zdenop
Copy link
Contributor

zdenop commented Feb 10, 2018

Where is input image or something that would demonstrate problem?

@latrau
Copy link
Author

latrau commented Feb 10, 2018

the Romanian typographical convention is that the diacritics s and t are with a comma below not with cedilla (as specified also in UNICODE Latin ext A and B).

best would be that any diacritical s or t in the -ron (Romanian) option should be mapped into Latin ext B code above; meaning that in the tesseract's ron unicharset there should be no trace of [15e ] [15f ] [162 ] or [163 ], only [218 ]-[21a ].

e.g.
screenshot at 2018-02-10 22-18-06
screenshot at 2018-02-10 22-17-33

the wrong mapping is everywhere once the -ron option is selected...

let me quote UNICODE 10 (chap.07) on this:

The Unicode Standard provides unambiguous representations for all of the forms, for
example, U+0219 ș latin small letter s with comma below versus U+015F ş latin
small letter s with cedilla. In modern usage, the preferred representation of Roma-
nian text is with U+0219 ș latin small letter s with comma below, while Turkish data
is represented with U+015F ş latin small letter s with cedilla.

same goes for ȘțȚ.

so option -ron means șțȚȘ [U+0218-A] with no ambiguity and should nowhere involve şŞŢţ [U+015e-f][U+0162-3].

@amitdo
Copy link

amitdo commented May 12, 2020

This issue is not caused by Tesseract itself. It should be moved to another repo (not sure which one).

@stweil
Copy link
Member

stweil commented May 13, 2020

I think langdata_lstm is a good one and transfer the issue.

@stweil stweil transferred this issue from tesseract-ocr/tesseract May 13, 2020
@stweil
Copy link
Member

stweil commented May 13, 2020

@latrau, so each of the wrong characters should be replaced? Do you want to send a pull request which fixes ron.training_text, maybe also ron.singles_text and ron.wordlist?

@stweil
Copy link
Member

stweil commented May 13, 2020

@latrau, was cedilla used in historic Romanian texts? If yes, it might be a good idea to keep both forms (with cedilla for the historic characters and with comma for the modern ones).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants