-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve mb_detect_encoding's recognition of Slavic names #8439
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
Given that this is a regression in PHP 8.1, I'd treat it as bug fix (i.e. reword the commit message: /s/improve/fix
).
Thanks to Côme Chilliet for reporting that mb_detect_encoding was not detecting the desired text encoding for strings containing š or Ž. These characters are used in Czech, Serbian, Croatian, Bosnian, Macedonian, etc. names.
Thanks to @cmb69 for that good point. Adjusted commit title as advised. |
Do we also need to add the lower/upper variants for these, or are these typically used with specific casing? |
both, see https://cs.wikipedia.org/wiki/%C4%8Cesk%C3%A1_abeceda (czech alphabet) all chars:
must be detected as UTF-8 same for slovak alphabet:
|
Thanks for that. Would you happen to have the Unicode codepoint numbers for
the Czech and Slovak alphabets handy?
At the risk of pushing my luck: Do you have some Czech and Slovak sentences
which we can use as test cases?
Would you happen to know any other text encodings aside from UTF-8 which
are commonly used for text in those languages?
|
I would like this PR to be supplemented with Hungarian characters. What causes the main problem is "ő" Usable test case: |
feel free to convert the letters from my post
basically any word/sensence containing the Czech/Slovak letter some Czech sentences: https://cs.wikiquote.org/wiki/%C4%8Cesk%C3%A1_p%C5%99%C3%ADslov%C3%AD (also English version exists, so you can check the meanings of these proverbs :))
historically Windows-1250 and ISO-8859-2 |
It sounds like this needs some more test cases before we can merge it to 8.1? @icetee I think the Hungarian characters should go into a separate PR, so we can finish this PR quickly and get it into the next 8.1 release. @alexdowad Can you finish the test cases in the next day or two? If so, this can go into 8.1.7RC1. If not, it will have to wait until 8.1.8RC1 (23 June). cc: @patrickallaert |
@ramsey, I think this commit is good to merge as it is. I can still open another PR with more test cases for recognition of text in more Eastern European languages. |
Hmm, looks like @cmb69 already reviewed. Let me merge. |
I am just working on commits which will ensure |
This appears to affect more than just Slovak. Turkish is also borked, most notably anything with the undotted-i (ı). If you want test cases try the number 6 (altı) or the word light (ışık). |
Thanks for that. Do you have any source for some natural Turkish text which we can use for test cases? |
What kind of corpus are you looking for? An exhaustive one with possibility of many foreign words like Wikipedia, a set of dictionary words, something that has been excised to ensure only native Turkish text, or just a minimum sampling of lorem-ipsum type text but in Turkish? I'm not really sure what your goal with the data is so doesn't make much sense for me to go looking for a corpus. |
I don't want a huge corpus. Just something more representative of natural Turkish text than the two words which you (kindly) mentioned. Probably the "minimum sampling" which you referred to is the most on target. |
Does this character have upper/lowercase versions? |
No. In an egregious abuse of sanity the encoding for this character is cross-wired with what passes for other characters in other languages. Hence a whole host of problems unique to Turkish... The English alphabet has an upper and lower case The Turkish alphabet has two characters with two cases for for glyphs. There are dotted and undotted variants for both upper and lower case: The upshot is that for a lower case dotted i or an upper case undotted i you cannot identify which alphabet they are from in isolation, and for all characters you have to set the language before you can do case conversion. Relevant to encoding detection, if you do encounter the no-English characters undotted-i or dotted-uppercase-i then you can rule out English and settle on an encoding that covers Turkish as a possibility. There are other characters needed for the Turkish alphabet, but none of the others are cross-wired in the same way. |
@alerque To clear this issue, I am just randomly picking some Turkish text off the web to use for test cases. However, I also need to know what legacy text encodings are commonly used for Turkish text. Do you know? |
Thanks to Côme Chilliet for reporting that mb_detect_encoding was not detecting the desired text encoding for strings containing š or Ž. These characters are used in Czech, Serbian, Croatian, Bosnian, Macedonian, etc. names.
Reviewers, do you think it is OK to base this on PHP-8.1 and then merge it up, so that users don't have to wait for PHP 8.2 to receive this enhancement?