Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve mb_detect_encoding accuracy for text containing the word Māori (with accent) #12025

Closed
wants to merge 1 commit into from

Conversation

alexdowad
Copy link
Contributor

Closes GH-11908.

Anyone want to review? @kamil-tekiela @iluuu1994 @youkidearitai

@kamil-tekiela
Copy link
Member

This looks good to me. I just want to add that ā is a very common letter in Latvian.

@xurizaemon
Copy link

xurizaemon commented Aug 22, 2023

Here's a list of te reo Māori words which contain tohutō, if you'd like to expand coverage for detecting each of these characters. Macrons are commonly used in other languages also (some examples on Wikipedia).

  • Kākā
  • Whēkau
  • Tīwaiwaka
  • Kōtuku
  • Kererū
  • Tūī

This covers all five vowels with tohutō (macrons). These are the names of birds of Aotearoa from https://teara.govt.nz/en/nga-manu-birds/print

@youkidearitai
Copy link
Contributor

I have a some worry about, now mb_detect_encoding behavior is different of manual https://www.php.net/manual/en/function.mb-detect-encoding.php .

encodings

A list of character encodings to try, in order. The list may be specified as an array of strings, or a single string separated by commas.

ex below:

$ sapi/cli/php -r 'var_dump(mb_detect_encoding("Total Māori,31.5,33.3,31.8,33,36.4,33.2,33.2", ["Windows-1251", "ISO-8859-1", "UTF-8"]));'
string(12) "Windows-1251"
$ sapi/cli/php -r 'var_dump(mb_detect_encoding("Total Māori,31.5,33.3,31.8,33,36.4,33.2,33.2", ["ISO-8859-1", "UTF-8", "Windows-1251"]));'
string(5) "UTF-8"

If these characters consider to binaries, all $encodings is valid. I expect behavior is first encodings in $encodings. Therefore, I expect the following:

$ sapi/cli/php -r 'var_dump(mb_detect_encoding("Total Māori,31.5,33.3,31.8,33,36.4,33.2,33.2", ["Windows-1251", "ISO-8859-1", "UTF-8"]));'
string(12) "Windows-1251"
$ sapi/cli/php -r 'var_dump(mb_detect_encoding("Total Māori,31.5,33.3,31.8,33,36.4,33.2,33.2", ["ISO-8859-1", "UTF-8", "Windows-1251"]));'
string(5) "ISO-8859-1"

However, I think that it is good as GH-11908 to me.

…acrons

Among other world languages, the Māori language commonly uses vowels
with macrons.
@alexdowad
Copy link
Contributor Author

I have just adjusted this PR to include detection of text with all lowercase vowels with macron.

@alexdowad
Copy link
Contributor Author

Looks like there are no more comments. I'm merging this.

@alexdowad alexdowad closed this Aug 25, 2023
@alexdowad alexdowad deleted the maori branch August 25, 2023 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PHP > 8.1 fails to detect Māori macrons
4 participants