Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve mb_detect_encoding's recognition of Slavic names #8439

Closed
wants to merge 1 commit into from

Conversation

alexdowad
Copy link
Contributor

Thanks to Côme Chilliet for reporting that mb_detect_encoding was not detecting the desired text encoding for strings containing š or Ž. These characters are used in Czech, Serbian, Croatian, Bosnian, Macedonian, etc. names.

Reviewers, do you think it is OK to base this on PHP-8.1 and then merge it up, so that users don't have to wait for PHP 8.2 to receive this enhancement?

@alexdowad alexdowad requested review from nikic and cmb69 April 25, 2022 15:59
Copy link
Member

@cmb69 cmb69 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Given that this is a regression in PHP 8.1, I'd treat it as bug fix (i.e. reword the commit message: /s/improve/fix).

Thanks to Côme Chilliet for reporting that mb_detect_encoding was not
detecting the desired text encoding for strings containing š or Ž.
These characters are used in Czech, Serbian, Croatian, Bosnian,
Macedonian, etc. names.
@alexdowad
Copy link
Contributor Author

Thanks to @cmb69 for that good point. Adjusted commit title as advised.

@nikic
Copy link
Member

nikic commented Apr 25, 2022

Do we also need to add the lower/upper variants for these, or are these typically used with specific casing?

@mvorisek
Copy link
Contributor

both, see https://cs.wikipedia.org/wiki/%C4%8Cesk%C3%A1_abeceda (czech alphabet)

all chars:

A, Á, B, C, Č, D, Ď, E, É, Ě, F, G, H, Ch, I, Í, J, K, L, M, N, Ň, O, Ó, P, Q, R, Ř, S, Š, T, Ť, U, Ú, Ů, V, W, X, Y, Ý, Z, Ž
a, á, b, c, č, d, ď, e, é, ě, f, g, h, ch, i, í, j, k, l, m, n, ň, o, ó, p, q, r, ř, s, š, t, ť, u, ú, ů, v, w, x, y, ý, z, ž

must be detected as UTF-8

same for slovak alphabet:

A, Á, Ä, B, C, Č, D, Ď, DZ, DŽ, E, É, F, G, H, CH, I, Í, J, K, L, Ĺ, Ľ, M, N, Ň, O, Ó, Ô, P, Q, R, Ŕ, S, Š, T, Ť, U, Ú, V, W, X, Y, Ý, Z, Ž
a, á, ä, b, c, č, d, ď, dz, dž, e, é, f, g, h, ch, i, í, j, k, l, ĺ, ľ, m, n, ň, o, ó, ô, p, q, r, ŕ, s, š, t, ť, u, ú, v, w, x, y, ý, z, ž

@alexdowad
Copy link
Contributor Author

alexdowad commented Apr 25, 2022 via email

@icetee
Copy link

icetee commented Apr 25, 2022

I would like this PR to be supplemented with Hungarian characters. What causes the main problem is "ő" u+0150 and "Ő" u+0151.

Usable test case: Árvíztűrő tükörfúrógép

https://hu.wikipedia.org/wiki/Magyar_%C3%A1b%C3%A9c%C3%A9

https://3v4l.org/KtHXV

@mvorisek
Copy link
Contributor

mvorisek commented Apr 26, 2022

Would you happen to have the Unicode codepoint numbers for the Czech and Slovak alphabets handy?

feel free to convert the letters from my post

Do you have some Czech and Slovak sentences which we can use as test cases?

basically any word/sensence containing the Czech/Slovak letter

some Czech sentences: https://cs.wikiquote.org/wiki/%C4%8Cesk%C3%A1_p%C5%99%C3%ADslov%C3%AD (also English version exists, so you can check the meanings of these proverbs :))

Would you happen to know any other text encodings aside from UTF-8 which are commonly used for text in those languages?

historically Windows-1250 and ISO-8859-2

@st3iny st3iny mentioned this pull request Apr 26, 2022
17 tasks
@ramsey
Copy link
Member

ramsey commented May 22, 2022

It sounds like this needs some more test cases before we can merge it to 8.1?

@icetee I think the Hungarian characters should go into a separate PR, so we can finish this PR quickly and get it into the next 8.1 release.

@alexdowad Can you finish the test cases in the next day or two? If so, this can go into 8.1.7RC1. If not, it will have to wait until 8.1.8RC1 (23 June).

cc: @patrickallaert

@alexdowad
Copy link
Contributor Author

It sounds like this needs some more test cases before we can merge it to 8.1?

@icetee I think the Hungarian characters should go into a separate PR, so we can finish this PR quickly and get it into the next 8.1 release.

@alexdowad Can you finish the test cases in the next day or two? If so, this can go into 8.1.7RC1. If not, it will have to wait until 8.1.8RC1 (23 June).

@ramsey, I think this commit is good to merge as it is. I can still open another PR with more test cases for recognition of text in more Eastern European languages.

@alexdowad
Copy link
Contributor Author

Hmm, looks like @cmb69 already reviewed. Let me merge.

@alexdowad
Copy link
Contributor Author

I am just working on commits which will ensure mb_detect_encoding works well on Czech and Slovak text. Does anyone have a source for some natural Slovak text which I can use for test cases?

@alerque
Copy link

alerque commented May 25, 2022

This appears to affect more than just Slovak. Turkish is also borked, most notably anything with the undotted-i (ı). If you want test cases try the number 6 (altı) or the word light (ışık).

@alexdowad
Copy link
Contributor Author

This appears to affect more than just Slovak. Turkish is also borked, most notably anything with the undotted-i (ı). If you want test cases try the number 6 (altı) or the word light (ışık).

Thanks for that. Do you have any source for some natural Turkish text which we can use for test cases?

@alerque
Copy link

alerque commented May 25, 2022

What kind of corpus are you looking for? An exhaustive one with possibility of many foreign words like Wikipedia, a set of dictionary words, something that has been excised to ensure only native Turkish text, or just a minimum sampling of lorem-ipsum type text but in Turkish? I'm not really sure what your goal with the data is so doesn't make much sense for me to go looking for a corpus.

@alexdowad
Copy link
Contributor Author

What kind of corpus are you looking for? An exhaustive one with possibility of many foreign words like Wikipedia, a set of dictionary words, something that has been excised to ensure only native Turkish text, or just a minimum sampling of lorem-ipsum type text but in Turkish? I'm not really sure what your goal with the data is so doesn't make much sense for me to go looking for a corpus.

I don't want a huge corpus. Just something more representative of natural Turkish text than the two words which you (kindly) mentioned. Probably the "minimum sampling" which you referred to is the most on target.

@alexdowad
Copy link
Contributor Author

This appears to affect more than just Slovak. Turkish is also borked, most notably anything with the undotted-i (ı). If you want test cases try the number 6 (altı) or the word light (ışık).

Does this character have upper/lowercase versions?

@alerque
Copy link

alerque commented May 25, 2022

Does this character have upper/lowercase versions?

No. In an egregious abuse of sanity the encoding for this character is cross-wired with what passes for other characters in other languages. Hence a whole host of problems unique to Turkish...

The English alphabet has an upper and lower case i/I. Traditionally the upper case looses the dot above. This is encoded in both ASCII and Unicode is a way that isn't a surprise to anybody: one letter with two code points for upper and lower case glyphs with an expected case mapping between them.

The Turkish alphabet has two characters with two cases for for glyphs. There are dotted and undotted variants for both upper and lower case: i/İ ı/I. Confusingly these are encoded in Unicode using the same code points as the English Latin encoding for the dotted lower case and un-dotted lower case, but different code points for the dotted upper case and undotted lower case. It would have been a lot nicer to just encode them both in a new code space to clarify the intent, but that didn't happen and we're stuck with this arrangement.

The upshot is that for a lower case dotted i or an upper case undotted i you cannot identify which alphabet they are from in isolation, and for all characters you have to set the language before you can do case conversion.

Relevant to encoding detection, if you do encounter the no-English characters undotted-i or dotted-uppercase-i then you can rule out English and settle on an encoding that covers Turkish as a possibility.

There are other characters needed for the Turkish alphabet, but none of the others are cross-wired in the same way.

@alexdowad
Copy link
Contributor Author

@alerque To clear this issue, I am just randomly picking some Turkish text off the web to use for test cases.

However, I also need to know what legacy text encodings are commonly used for Turkish text. Do you know?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants