Improve mb_detect_encoding's recognition of Slavic names #8439

alexdowad · 2022-04-25T15:59:21Z

Thanks to Côme Chilliet for reporting that mb_detect_encoding was not detecting the desired text encoding for strings containing š or Ž. These characters are used in Czech, Serbian, Croatian, Bosnian, Macedonian, etc. names.

Reviewers, do you think it is OK to base this on PHP-8.1 and then merge it up, so that users don't have to wait for PHP 8.2 to receive this enhancement?

cmb69

Thanks for the PR!

Given that this is a regression in PHP 8.1, I'd treat it as bug fix (i.e. reword the commit message: /s/improve/fix).

Thanks to Côme Chilliet for reporting that mb_detect_encoding was not detecting the desired text encoding for strings containing š or Ž. These characters are used in Czech, Serbian, Croatian, Bosnian, Macedonian, etc. names.

alexdowad · 2022-04-25T16:10:35Z

Thanks to @cmb69 for that good point. Adjusted commit title as advised.

nikic · 2022-04-25T16:16:36Z

Do we also need to add the lower/upper variants for these, or are these typically used with specific casing?

mvorisek · 2022-04-25T17:10:04Z

both, see https://cs.wikipedia.org/wiki/%C4%8Cesk%C3%A1_abeceda (czech alphabet)

all chars:

A, Á, B, C, Č, D, Ď, E, É, Ě, F, G, H, Ch, I, Í, J, K, L, M, N, Ň, O, Ó, P, Q, R, Ř, S, Š, T, Ť, U, Ú, Ů, V, W, X, Y, Ý, Z, Ž
a, á, b, c, č, d, ď, e, é, ě, f, g, h, ch, i, í, j, k, l, m, n, ň, o, ó, p, q, r, ř, s, š, t, ť, u, ú, ů, v, w, x, y, ý, z, ž

must be detected as UTF-8

same for slovak alphabet:

A, Á, Ä, B, C, Č, D, Ď, DZ, DŽ, E, É, F, G, H, CH, I, Í, J, K, L, Ĺ, Ľ, M, N, Ň, O, Ó, Ô, P, Q, R, Ŕ, S, Š, T, Ť, U, Ú, V, W, X, Y, Ý, Z, Ž
a, á, ä, b, c, č, d, ď, dz, dž, e, é, f, g, h, ch, i, í, j, k, l, ĺ, ľ, m, n, ň, o, ó, ô, p, q, r, ŕ, s, š, t, ť, u, ú, v, w, x, y, ý, z, ž

alexdowad · 2022-04-25T17:15:27Z

Thanks for that. Would you happen to have the Unicode codepoint numbers for the Czech and Slovak alphabets handy? At the risk of pushing my luck: Do you have some Czech and Slovak sentences which we can use as test cases? Would you happen to know any other text encodings aside from UTF-8 which are commonly used for text in those languages?

icetee · 2022-04-25T21:56:44Z

I would like this PR to be supplemented with Hungarian characters. What causes the main problem is "ő" u+0150 and "Ő" u+0151.

Usable test case: Árvíztűrő tükörfúrógép

https://hu.wikipedia.org/wiki/Magyar_%C3%A1b%C3%A9c%C3%A9

https://3v4l.org/KtHXV

mvorisek · 2022-04-26T06:43:44Z

Would you happen to have the Unicode codepoint numbers for the Czech and Slovak alphabets handy?

feel free to convert the letters from my post

Do you have some Czech and Slovak sentences which we can use as test cases?

basically any word/sensence containing the Czech/Slovak letter

some Czech sentences: https://cs.wikiquote.org/wiki/%C4%8Cesk%C3%A1_p%C5%99%C3%ADslov%C3%AD (also English version exists, so you can check the meanings of these proverbs :))

Would you happen to know any other text encodings aside from UTF-8 which are commonly used for text in those languages?

historically Windows-1250 and ISO-8859-2

ramsey · 2022-05-22T16:49:22Z

It sounds like this needs some more test cases before we can merge it to 8.1?

@icetee I think the Hungarian characters should go into a separate PR, so we can finish this PR quickly and get it into the next 8.1 release.

@alexdowad Can you finish the test cases in the next day or two? If so, this can go into 8.1.7RC1. If not, it will have to wait until 8.1.8RC1 (23 June).

cc: @patrickallaert

alexdowad · 2022-05-24T13:27:17Z

It sounds like this needs some more test cases before we can merge it to 8.1?

@icetee I think the Hungarian characters should go into a separate PR, so we can finish this PR quickly and get it into the next 8.1 release.

@alexdowad Can you finish the test cases in the next day or two? If so, this can go into 8.1.7RC1. If not, it will have to wait until 8.1.8RC1 (23 June).

@ramsey, I think this commit is good to merge as it is. I can still open another PR with more test cases for recognition of text in more Eastern European languages.

alexdowad · 2022-05-24T13:30:56Z

Hmm, looks like @cmb69 already reviewed. Let me merge.

alexdowad · 2022-05-24T14:10:13Z

I am just working on commits which will ensure mb_detect_encoding works well on Czech and Slovak text. Does anyone have a source for some natural Slovak text which I can use for test cases?

alerque · 2022-05-25T09:09:51Z

This appears to affect more than just Slovak. Turkish is also borked, most notably anything with the undotted-i (ı). If you want test cases try the number 6 (altı) or the word light (ışık).

alexdowad · 2022-05-25T09:47:58Z

This appears to affect more than just Slovak. Turkish is also borked, most notably anything with the undotted-i (ı). If you want test cases try the number 6 (altı) or the word light (ışık).

Thanks for that. Do you have any source for some natural Turkish text which we can use for test cases?

alerque · 2022-05-25T10:29:20Z

What kind of corpus are you looking for? An exhaustive one with possibility of many foreign words like Wikipedia, a set of dictionary words, something that has been excised to ensure only native Turkish text, or just a minimum sampling of lorem-ipsum type text but in Turkish? I'm not really sure what your goal with the data is so doesn't make much sense for me to go looking for a corpus.

alexdowad · 2022-05-25T11:02:07Z

What kind of corpus are you looking for? An exhaustive one with possibility of many foreign words like Wikipedia, a set of dictionary words, something that has been excised to ensure only native Turkish text, or just a minimum sampling of lorem-ipsum type text but in Turkish? I'm not really sure what your goal with the data is so doesn't make much sense for me to go looking for a corpus.

I don't want a huge corpus. Just something more representative of natural Turkish text than the two words which you (kindly) mentioned. Probably the "minimum sampling" which you referred to is the most on target.

alexdowad · 2022-05-25T11:13:26Z

This appears to affect more than just Slovak. Turkish is also borked, most notably anything with the undotted-i (ı). If you want test cases try the number 6 (altı) or the word light (ışık).

Does this character have upper/lowercase versions?

alerque · 2022-05-25T11:33:26Z

Does this character have upper/lowercase versions?

No. In an egregious abuse of sanity the encoding for this character is cross-wired with what passes for other characters in other languages. Hence a whole host of problems unique to Turkish...

The English alphabet has an upper and lower case i/I. Traditionally the upper case looses the dot above. This is encoded in both ASCII and Unicode is a way that isn't a surprise to anybody: one letter with two code points for upper and lower case glyphs with an expected case mapping between them.

The Turkish alphabet has two characters with two cases for for glyphs. There are dotted and undotted variants for both upper and lower case: i/İ ı/I. Confusingly these are encoded in Unicode using the same code points as the English Latin encoding for the dotted lower case and un-dotted lower case, but different code points for the dotted upper case and undotted lower case. It would have been a lot nicer to just encode them both in a new code space to clarify the intent, but that didn't happen and we're stuck with this arrangement.

The upshot is that for a lower case dotted i or an upper case undotted i you cannot identify which alphabet they are from in isolation, and for all characters you have to set the language before you can do case conversion.

Relevant to encoding detection, if you do encounter the no-English characters undotted-i or dotted-uppercase-i then you can rule out English and settle on an encoding that covers Turkish as a possibility.

There are other characters needed for the Turkish alphabet, but none of the others are cross-wired in the same way.

alexdowad · 2022-06-04T13:38:14Z

@alerque To clear this issue, I am just randomly picking some Turkish text off the web to use for test cases.

However, I also need to know what legacy text encodings are commonly used for Turkish text. Do you know?

alexdowad requested review from nikic and cmb69 April 25, 2022 15:59

cmb69 approved these changes Apr 25, 2022

View reviewed changes

alexdowad mentioned this pull request Apr 25, 2022

mb_detect_encoding does not return the first matching encoding anymore #8279

Closed

Fix mb_detect_encoding's recognition of Slavic names

5d7c805

Thanks to Côme Chilliet for reporting that mb_detect_encoding was not detecting the desired text encoding for strings containing š or Ž. These characters are used in Czech, Serbian, Croatian, Bosnian, Macedonian, etc. names.

alexdowad force-pushed the names branch from 22dfbd7 to 5d7c805 Compare April 25, 2022 16:09

st3iny mentioned this pull request Apr 26, 2022

Add PHP8.1 support nextcloud/mail#6131

Merged

17 tasks

ramsey added the Extension: mbstring label May 22, 2022

This was referenced May 24, 2022

mb_detect_encoding recognizes all letters in Czech alphabet #8624

Closed

mb_detect_encoding recognizes all letters in Hungarian alphabet #8629

Merged

alexdowad closed this Dec 29, 2022

alexdowad deleted the names branch December 29, 2022 19:36

alexdowad mentioned this pull request Dec 30, 2022

Improve mb_detect_encoding's recognition of Turkish text #10186

Closed

RoSk0 mentioned this pull request Aug 8, 2023

PHP > 8.1 fails to detect Māori macrons #11908

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve mb_detect_encoding's recognition of Slavic names #8439

Improve mb_detect_encoding's recognition of Slavic names #8439

alexdowad commented Apr 25, 2022

cmb69 left a comment

alexdowad commented Apr 25, 2022

nikic commented Apr 25, 2022

mvorisek commented Apr 25, 2022

alexdowad commented Apr 25, 2022 via email

icetee commented Apr 25, 2022

mvorisek commented Apr 26, 2022 •

edited

Loading

ramsey commented May 22, 2022

alexdowad commented May 24, 2022

alexdowad commented May 24, 2022

alexdowad commented May 24, 2022

alerque commented May 25, 2022

alexdowad commented May 25, 2022

alerque commented May 25, 2022

alexdowad commented May 25, 2022

alexdowad commented May 25, 2022

alerque commented May 25, 2022

alexdowad commented Jun 4, 2022

Improve mb_detect_encoding's recognition of Slavic names #8439

Improve mb_detect_encoding's recognition of Slavic names #8439

Conversation

alexdowad commented Apr 25, 2022

cmb69 left a comment

Choose a reason for hiding this comment

alexdowad commented Apr 25, 2022

nikic commented Apr 25, 2022

mvorisek commented Apr 25, 2022

alexdowad commented Apr 25, 2022 via email

icetee commented Apr 25, 2022

mvorisek commented Apr 26, 2022 • edited Loading

ramsey commented May 22, 2022

alexdowad commented May 24, 2022

alexdowad commented May 24, 2022

alexdowad commented May 24, 2022

alerque commented May 25, 2022

alexdowad commented May 25, 2022

alerque commented May 25, 2022

alexdowad commented May 25, 2022

alexdowad commented May 25, 2022

alerque commented May 25, 2022

alexdowad commented Jun 4, 2022

mvorisek commented Apr 26, 2022 •

edited

Loading