Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHP > 8.1 fails to detect Māori macrons #11908

Closed
RoSk0 opened this issue Aug 8, 2023 · 8 comments
Closed

PHP > 8.1 fails to detect Māori macrons #11908

RoSk0 opened this issue Aug 8, 2023 · 8 comments

Comments

@RoSk0
Copy link

RoSk0 commented Aug 8, 2023

Description

The following code:

<?php

$str = 'Total%20M%C4%81ori%2C31.5%2C33.3%2C31.8%2C33%2C36.4%2C33.2%2C33.2';
$rawstr = rawurldecode($str);

var_dump(
    mb_detect_encoding($rawstr, ['UTF-8', 'ISO-8859-1', 'WINDOWS-1251']),
    mb_detect_encoding($rawstr, ['ISO-8859-1', 'WINDOWS-1251', 'UTF-8']),
    mb_detect_encoding($rawstr, ['WINDOWS-1251', 'UTF-8', 'ISO-8859-1']),
    mb_check_encoding($rawstr, 'ISO-8859-1'),
    mb_check_encoding($rawstr, 'UTF-8'),
    mb_check_encoding($rawstr, 'WINDOWS-1251'),
);

https://3v4l.org/jYDqY

Resulted in this output:

string(12) "Windows-1251"
string(12) "Windows-1251"
string(12) "Windows-1251"
bool(true)
bool(true)
bool(true)

But I expected this output instead:

string(5) "UTF-8"
string(5) "UTF-8"
string(5) "UTF-8"
bool(false)
bool(true)
bool(false)

Related issues

PHP Version

PHP 8.1.22

Operating System

No response

@hormus
Copy link

hormus commented Aug 11, 2023

U+0101 UTF-8 hex C481 valid
ISO-8859-1 hex 81 valid
Windows-1251 hex 81 valid

$str = 'Total%20M%C4%81ori%2C31.5%2C33.3%2C31.8%2C33%2C36.4%2C33.2%2C33.2';
$rawstr = rawurldecode($str);
var_dump(
    mb_check_encoding($rawstr, 'ISO-8859-1'),
    mb_check_encoding($rawstr, 'UTF-8'),
    mb_check_encoding($rawstr, 'WINDOWS-1251')
);

Expected result true

@youkidearitai
Copy link
Contributor

youkidearitai commented Aug 11, 2023

It behavior seems PHP >= 8.1 is correct. Because Windows-1251 is valid of 0x81, but ISO-8859-1 is not valid of 0x81.

Please see below of Character set section.

Therefore, PHP >= 8.1 of mb_detect_encoding is to this extent, correct behavior.

@youkidearitai
Copy link
Contributor

I mistake, 0xC4 0x81 is correct of UTF-8.

However, originally I want you to use the character code assuming what you are using.

@youkidearitai
Copy link
Contributor

I'm sorry, I mistake again.
0x81 is correct of ISO-8859-1. Therefore, mb_check_encoding("\x81", "ISO-8859-1"); is returns true.

Anyway, this case is detect(guess) encoding is very difficult. My opinion is not change that I want you to use the character code assuming what you are using.

@alexdowad
Copy link
Contributor

@RoSk0 Is U+0101 the only accented character which is commonly used when writing about the Maori people?

@xurizaemon
Copy link

xurizaemon commented Aug 22, 2023

@alexdowad Thanks for asking. The following is as I understand it (of European descent, living in Aotearoa, some beginner study of Te Reo Māori and a colleague of @RoSk0).

Those characters are used when writing in Te Reo Māori - not only when writing about tangata Māori (Māori people).

There are ten vowels in Te Reo Māori - a,e,i,o,u and the long versions ā,ē,ī,ō,ū, sometimes also represented as double-vowels (aa, ee, ...). Incorrect vowels can have significant impact on meaning. There are also the capital forms A,E,I,O,U and Ā,Ē,Ī,Ō,Ū.

Have added some macron examples to #12025

(Corrections welcome please, doing my best to help here but not expert.)

@nielsdos
Copy link
Member

@alexdowad Is it intentional that this issue is still open, or did you forget to close this? 🙂

@alexdowad
Copy link
Contributor

@nielsdos Uhhh... I forgot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants