Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mb_detect_encoding() results for UTF-7 differ between PHP 8.0 and 8.1 (if UTF-7 is present in the encodings list and the string contains '+' character) #10192

Closed
unix-world opened this issue Dec 30, 2022 · 44 comments

Comments

@unix-world
Copy link

unix-world commented Dec 30, 2022

Description

This is a very serious bug that may impact all strings in PHP.

The following code:

<?php

ini_set('display_errors', '1');	// display runtime errors
error_reporting(E_ALL & ~E_NOTICE & ~E_STRICT & ~E_DEPRECATED); // error reporting
date_default_timezone_set('UTC');

function detect_encoding($ystr, $csetlist='UTF-8, ISO-8859-1, ISO-8859-15, ISO-8859-2, ISO-8859-9, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-10, ISO-8859-13, ISO-8859-14, ISO-8859-16, UTF-7, ASCII, SJIS, EUC-JP, JIS, ISO-2022-JP, EUC-CN, GB18030, ISO-2022-KR, KOI8-R, KOI8-U') {
	return mb_detect_encoding((string)$ystr, (string)$csetlist, true); // mixed: (bool) FALSE or (string) 'CHARSET'
}

echo detect_encoding('A + B'); // expected output: UTF-8, but on PHP 8.1.x / 8.2.x returns UTF-7 if the '+' character is present in a string ; the PHP 8.0.x and 7.4.x behaves correctly and outputs UTF-8 also in this case
echo "\n";
echo detect_encoding('A - B'); // expected output: UTF-8 ; correct on all PHP versions 8.2.x / 8.1.x / 8.0.x / 7.4.x

Resulted in this output ; On PHP 8.2 and 8.1, if the plus (+) character is present in a string will detect UTF-7 instead of UTF-8 as expected:

UTF-7
UTF-8

But I expected this output instead ; on PHP 7.4 and PHP 8.0 works correctly. Output is this):

UTF-8
UTF-8

I tested here with different PHP versions.
https://onlinephp.io/
I also tested in my computer with PHP 8.1.12 and 8.0.25

PHP Version

8.1.x / 8.2.x

Operating System

All

@unix-world unix-world changed the title PHP 8.2 and PHP 8.1 Bug PHP 8.2 and PHP 8.1 Bug serious bug with strings encoding detection Dec 30, 2022
@unix-world unix-world changed the title PHP 8.2 and PHP 8.1 Bug serious bug with strings encoding detection PHP 8.2 and PHP 8.1 Bug serious bug with strings (encoding not UTF-8 if + character is present in any string) Dec 30, 2022
@unix-world unix-world changed the title PHP 8.2 and PHP 8.1 Bug serious bug with strings (encoding not UTF-8 if + character is present in any string) PHP 8.2 and PHP 8.1 Bug serious bug with strings (encoding not UTF-8 if a + character is present in a string) Dec 30, 2022
@Girgias Girgias changed the title PHP 8.2 and PHP 8.1 Bug serious bug with strings (encoding not UTF-8 if a + character is present in a string) PHP 8.(1|2) mb_detect_encoding() regression when string contains '+' is determined as UTF-7 instead of UTF-8 Dec 30, 2022
@Girgias
Copy link
Member

Girgias commented Dec 30, 2022

I honestly doubt the severity of this bug as AFAIK any UTF-7 string is valid UTF-8, moreover mb_detect_encoding() is a misnomer as it tries to guess what encoding a string might be.

But I'll let @alexdowad comment, as he's the character encoding expect and has been refactoring mbstring.

@unix-world
Copy link
Author

unix-world commented Dec 30, 2022

It is very serious. Because if you need to normalize the output you rely on re-encoding all strings as UTF-8.
Thus, if re-encode 'A + B' from UTF-7 (which is actually UTF-8 already, but is wrong detected by PHP 8.1/8.2) to UTF-8 results in 'A B' (the plus sign disappears), as the strings get corrupted.
Actually this is how I start investigating and get to this bug.
Scenario:
Read from a data source -> Filter UTF-8 or re-encode as UTF-8 -> Output. If a string contains '+' sign will be mangled after normalization (re-encode as UTF-8 which actually would be not necessary if UTF-8 would be correctly detected as in PHP 8.0 / 7.4).

@unix-world
Copy link
Author

It impacts so many things in an unpredictable way so I can't explain all situations here.

@unix-world
Copy link
Author

unix-world commented Dec 30, 2022

I updated the example for you to see the real disaster !

<?php

ini_set('display_errors', '1');	// display runtime errors
error_reporting(E_ALL & ~E_NOTICE & ~E_STRICT & ~E_DEPRECATED); // error reporting
date_default_timezone_set('UTC');

function detect_encoding($ystr, $csetlist='UTF-8, ISO-8859-1, ISO-8859-15, ISO-8859-2, ISO-8859-9, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-10, ISO-8859-13, ISO-8859-14, ISO-8859-16, UTF-7, ASCII, SJIS, EUC-JP, JIS, ISO-2022-JP, EUC-CN, GB18030, ISO-2022-KR, KOI8-R, KOI8-U') { // Fix: starting from PHP 7.1 it warns about illegal argument if using: ISO-8859-11
	return mb_detect_encoding((string)$ystr, (string)$csetlist, true); // mixed: (bool) FALSE or (string) 'CHARSET'
}

$str = 'A + B';
echo detect_encoding($str);
echo "\n";
echo mb_convert_encoding($str, 'UTF-8', detect_encoding($str));
echo "\n";
echo "\n";
$str = 'A - B';
echo detect_encoding('A - B');
echo "\n";
echo mb_convert_encoding($str, 'UTF-8', detect_encoding($str));

expected output (OK in PHP 8.0 / 7.4)

UTF-8
A + B

UTF-8
A - B

wrong output in PHP 8.1 / 8.2 (the + character is mangled, disappeared)

UTF-7
A  B

UTF-8
A - B

Thus in my opinion would affect a lot of database data. This is the most serious bug in PHP since I have seen. If you re-save in database all the data becomes corrupted without a chance to go backward only if you have a valid backup !!

@unix-world
Copy link
Author

I suspect a memory corruption there or something very ugly ...
If I convert a string from UTF-7 to UTF-8 should be preserved as is. Right ?
But actually because a string which contains + character is already UTF-8 but PHP detects it as UTF-7 it corrupts that string on re-encoding and + characters vanish ... !

@unix-world
Copy link
Author

And if 'A + B' is detected as UTF-7 instead of UTF-8 on PHP 8.1/8.2,... why 'A - B' is detected as UTF-8 ? Can't you see the anomaly in this case ?

@Girgias
Copy link
Member

Girgias commented Dec 30, 2022

Again, this is not a serious bug, just a bug.
Moreover, stop spamming comments and take your time to reply so that it contains everything.

So I just went and read about UTF-7, and it uses the + to designate an escape sequence for unicode codepoints in the BMP range. The issue seems to be that mb_detect_encoding() doesn't take into account if there is an actual escape sequence or not after the + or better it should check that + is encoded as +- to determine it is UTF-7.

@alexdowad
Copy link
Contributor

@unix-world Thanks for the report, will look into this. Will comment more once I have found the cause.

@unix-world
Copy link
Author

I got investigated more.
It is NOT a bug in MBString. Is quite in PHP.

<?php

$str = 'A + B';
echo iconv('UTF-7', 'UTF-8', $str);
echo "\n";
$str = 'A - B';
echo iconv('UTF-7', 'UTF-8', $str);

wrong output by PHP 8.1 / 8.2 ( + character is missing ! the same situation as with MBString !

A  B
A - B

OK output by PHP 8.0 / 7.4

A + B
A - B

@unix-world
Copy link
Author

Please guys fix this ASAP. A lot of people will corrupt their data/databases until then.
It's not a joke :-(

@unix-world unix-world changed the title PHP 8.(1|2) mb_detect_encoding() regression when string contains '+' is determined as UTF-7 instead of UTF-8 PHP 8.(1|2) Iconv and MBString corrupts a string on UTF-8 encoding (only if contains '+', if not contain works OK) Dec 30, 2022
@unix-world
Copy link
Author

As far as I know, internally PHP stores all strings as UTF-8.
In my opinion this is a deep bug in the Zend Multibyte and the way it stores the strings. When the '+' character is present in a string the string gets somehow corrupted.
Would be almost impossible that both extensions Iconv and MBstring to have the same bug at the same time...

@alexdowad
Copy link
Contributor

As far as I know, internally PHP stores all strings as UTF-8...

That's not the case; PHP strings are just a sequence of bytes, which may or may not be valid UTF-8.

@unix-world
Copy link
Author

OK. Then I don't know... But is almost to zero chances that both PHP extensions: Iconv and MbString to have the same bug in the same time. It looks as something inside PHP not in the extensions.

@alexdowad
Copy link
Contributor

@unix-world Thanks again for opening this GH issue and letting the PHP core developers know what our users are experiencing. Feedback from users is always appreciated.

I am not the iconv maintainer and don't know what has been done with iconv between PHP 8.0 and 8.1. Frankly, the iconv output you kindly showed for 8.0 and 8.1 both look reasonable to me.

To learn more about UTF-7, you might wish to consult the specification: https://www.rfc-editor.org/rfc/rfc2152.html. I can summarize the relevant part by saying that the character + has a special meaning in UTF-7; it introduces a Base64-encoded sequence. After +, UTF-7 strings should only contain Base64 characters such as letters and numbers, up to a terminating -.

The string which you are converting contains +<SPACE>. That is illegal in UTF-7. So when iconv converts that illegal string, what should it do? Should it just pass the + through as it was found? Or should it delete the illegal part? The UTF-7 specification allows for either behavior. The specification only tells us how valid UTF-7 strings should be interpreted, not what we should do with invalid ones.

Does the iconv documentation specify which output we should expect for illegal strings? I don't know, since I haven't read that documentation. If so, then there might indeed be a bug in iconv.

Comments on mbstring next...

@unix-world
Copy link
Author

unix-world commented Dec 30, 2022

The MBString behaves exactly the same.
In Go Lang, Javascript, NodeJS, PHP 8.0, PHP 7.4 this works correctly.
This is quite a BUG and have nothing to do with IConv or MBString.
I explain you:

$altstr = 'A - B'; // mb detect encoding detects it correctly as UTF-8
$str = 'A + B'; // mb detect encoding detects it as UTF-7 instead of UTF-8
echo iconv('UTF-7', 'UTF-8', $str);

The above code outputs: 'A + B' as it should in PHP 8.0 and PHP 7.4 and also 7.3, 7.2, 7.1, 7.0, 5.6
But in PHP 8.1 / 8.2 misbehave ! In PHP 8.1 and 8.2 ouputs 'A B' instead of 'A + B'

@alexdowad
Copy link
Contributor

@unix-world Try installing a recent version of the CLI tool iconv and try this at your shell:

$ echo "A + B" | iconv -f UTF-7 -t UTF-8

@unix-world
Copy link
Author

you are right with this. but then why all previous PHP versions like PHP 8.0 and 7.4 outputs: 'A + B' for the code below ?
$str = 'A + B';
echo iconv('UTF-7', 'UTF-8', $str);

@alexdowad
Copy link
Contributor

you are right with this. but then why all previous PHP versions like PHP 8.0 and 7.4 outputs: 'A + B' for the code below ? $str = 'A + B'; echo iconv('UTF-7', 'UTF-8', $str);

@unix-world Thanks for that question. Like I said, I am not the maintainer of the iconv extension and don't know what has happened with it.

One thing I can tell you about the iconv PHP extension is that it is based on libiconv. (I think it's GNU libiconv, please correct me if that's wrong... https://www.gnu.org/software/libiconv/)

PHP just takes your string, passes it to libiconv, then after libiconv does the conversion, PHP makes the result into a PHP string.

So if the output has changed from the iconv extension, probably something has changed in libiconv.

Again, I don't know much about iconv, so I may be wrong there. However, mbstring I do know very well. Should I explain a bit about mbstring?

@unix-world
Copy link
Author

unix-world commented Dec 30, 2022

I got a clue now. It looks like this:
If UTF-7 is present in the list of available charsets, because of the '+' sign will always detect 'A + B' as UTF-7 since PHP 8.1/8.2. Perhaps has something to do with some UTF-7 detection fixes for Iconv/Mbstring.
Prior versions of PHP perhaps are using older Iconv / MBstring versions and simply fallback differently on UTF-8 rather than UTF-7, don't know why.
It I remove UTF-7 from the available charset lists it detects the string as UTF-8.

But this would blow up a lot of old PHP code. On my side, the fix is simple. I remove out UTF-7 from the list

/*
function detect_encoding($ystr, $csetlist='UTF-8, ISO-8859-1, ISO-8859-15, ISO-8859-2, ISO-8859-9, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-10, ISO-8859-13, ISO-8859-14, ISO-8859-16, UTF-7, ASCII, SJIS, EUC-JP, JIS, ISO-2022-JP, EUC-CN, GB18030, ISO-2022-KR, KOI8-R, KOI8-U') { // Fix: starting from PHP 7.1 it warns about illegal argument if using: ISO-8859-11
	return mb_detect_encoding((string)$ystr, (string)$csetlist, true); // mixed: (bool) FALSE or (string) 'CHARSET'
}
*/

function detect_encoding($ystr, $csetlist='UTF-8, ISO-8859-1, ISO-8859-15, ISO-8859-2, ISO-8859-9, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-10, ISO-8859-13, ISO-8859-14, ISO-8859-16, ASCII, SJIS, EUC-JP, JIS, ISO-2022-JP, EUC-CN, GB18030, ISO-2022-KR, KOI8-R, KOI8-U') { // Fix: starting from PHP 7.1 it warns about illegal argument if using: ISO-8859-11 ; since PHP 8.1 the UTF-7 should no more be used !!
	return mb_detect_encoding((string)$ystr, (string)$csetlist, true); // mixed: (bool) FALSE or (string) 'CHARSET'
}

:-))

@alexdowad
Copy link
Contributor

@unix-world It's good that you have found a workaround for your use case. There is still a lot more which can be said about the issue which you have raised, if you are interested in discussing more. I think any further replies from me will be tomorrow or thereafter, though.

@unix-world
Copy link
Author

@cmb69 I don't know what versions of libiconv are you using.
Try this link, a PHP sandbox where you can test against any PHP version. You will notice the differences between PHP 7.4 and 8.0 vs 8.1 and 8.2 as I mentioned.
https://onlinephp.io/

More. on my computer the same execution differences appear between PHP 7.4 / 8.0 vs 8.1 / 8.2

echo iconv('UTF-7', 'UTF-8', 'A + B');
echo iconv('UTF-7', 'UTF-8', 'A - B');

in PHP 7.4 / 8.0
string(5) "A + B"
string(5) "A - B"

in PHP 8.1 / 8.2
string(4) "A B"
string(5) "A - B"

But this is not all.
mb_detect_encoding() with a given list of encodings, if 'UTF-7' is present in that list of encodings will always choose 'UTF-7' since PHP 8.1 / 8.2. In prior versions will choose 'UTF-8' as expected.

echo mb_detect_encoding('A + B', 'UTF-8, ISO-8859-1, ASCII, UTF-7', true);
echo mb_detect_encoding('A - B', 'UTF-8, ISO-8859-1, ASCII, UTF-7', true);

in PHP 7.4 / 8.0
string(6) "UTF-8"
string(6) "UTF-8"

in PHP 8.1 / 8.2
string(6) "UTF-7"
string(6) "UTF-8"

Use this link to test, and choose whatever PHP version you want. You' see the difference also for this case.
https://onlinephp.io/

Btw, the only current fix for now is to completely exclude UTF-7 from the list of possible encodings, as below

echo mb_detect_encoding('A + B', 'UTF-8, ISO-8859-1, ASCII', true);
echo mb_detect_encoding('A - B', 'UTF-8, ISO-8859-1, ASCII', true);

in PHP 7.4 / 8.0
string(6) "UTF-8"
string(6) "UTF-8"

in PHP 8.1 / 8.2
string(6) "UTF-8"
string(6) "UTF-8"

I don't know if a bug or before PHP 8.1 the UTF-7 detection was not working at all.
Also I don't know if when the UTF-7 is present in the list of available encodings for mb_detect_encoding() why if '+' char is in a string will always fall back to UTF-7 ... this is perhaps a bug.
But also iconv behaves strange in this case.

@cmb69
Copy link
Member

cmb69 commented Dec 31, 2022

@cmb69 I don't know what versions of libiconv are you using.

Then consider to re-read my comment.

Anyhow, https://onlinephp.io/c/0e2c7 shows the same behavior for PHP 8.2.0, 8.1.13, 8.0.26 and 7.4.33 (they're obviously use glibc 2.26.

@unix-world
Copy link
Author

Yes, you're right.
Sorry, my side mistake for iconv.

But:

function detect_encoding($ystr, $csetlist='UTF-8, ISO-8859-1, UTF-7, ASCII') {
	return mb_detect_encoding((string)$ystr, (string)$csetlist, true);
}

echo detect_encoding('A + B');
echo "\n";
echo detect_encoding('A - B');

Outputs - PHP 7.4 / 8.0:
UTF-8
UTF-8

Outputs - PHP 8.1 / 8.2:
UTF-7
UTF-8

ONLY if I remove UTF-7 from the list, mb_detect_encoding works correctly on all PHP versions.

But:

function detect_encoding($ystr, $csetlist='UTF-8, ISO-8859-1, ASCII') {
	return mb_detect_encoding((string)$ystr, (string)$csetlist, true);
}

echo detect_encoding('A + B');
echo "\n";
echo detect_encoding('A - B');

Outputs - PHP 7.4 / 8.0 / 8.1 / 8.2:
UTF-8
UTF-8

So yeah, looks like a bug in MBstring. If UTF-7 is present in the encodings list and the string contains '+' it will be always detected as UTF-7 !

@unix-world unix-world changed the title mb_detect_encoding() results for UTF-7 differ between PHP 8.0 and 8.1 mb_detect_encoding() results for UTF-7 differ between PHP 8.0 and 8.1 (if UTF-7 is present in the encodings list and the string contains '+' character) Jan 1, 2023
@alecpl
Copy link

alecpl commented Jan 4, 2023

mb_check_encoding('A + B', 'utf-7'); returns TRUE since PHP 8.1.

RFC 2152: "A "+" character followed immediately by any character other than members of set B or "-" is an ill-formed sequence.".

@alexdowad
Copy link
Contributor

So yeah, looks like a bug in MBstring. If UTF-7 is present in the encodings list and the string contains '+' it will be always detected as UTF-7 !

That's not the case; it is easy to find strings which contain '+' but will not be detected as UTF-7.

Aside from that, though, you might find it interesting to change your list of candidate encodings from:

'UTF-8, ISO-8859-1, UTF-7, ASCII'

to:

'UTF-7, UTF-8, ISO-8859-1, ASCII'

...and see what happens, in both PHP 8.0 and 8.1.

@alexdowad
Copy link
Contributor

mb_check_encoding('A + B', 'utf-7'); returns TRUE since PHP 8.1.

RFC 2152: "A "+" character followed immediately by any character other than members of set B or "-" is an ill-formed sequence.".

Nice! You are most of the way to discovering the root cause of this issue.

One point is missing, though. What does mb_convert_encoding do with that input string, in PHP 8.0 and 8.1? Do you notice any inconsistency with the behavior of mb_check_encoding?

@alecpl
Copy link

alecpl commented Jan 5, 2023

mb_convert_encoding('A + B', 'utf-7', 'utf-7'); on all PHP versions returns A B. So, no inconsistency here.

Looking at iconv, it does not consider this invalid either:

$ echo "A + B" | iconv -f UTF-7 -t UTF-8
A  B
$ echo "A ++ B" | iconv -f UTF-7 -t UTF-8
A iconv: illegal input sequence at position 4

@alexdowad
Copy link
Contributor

mb_convert_encoding('A + B', 'utf-7', 'utf-7'); on all PHP versions returns A B. So, no inconsistency here.

The inconsistency is between mb_check_encoding and mb_convert_encoding. If the input string is invalid in the specified encoding, mb_check_encoding should return false, and mb_convert_encoding should insert one or more instances of the mb_substitute_character in place of the invalid byte sequence(s). Further, mb_get_info('illegal_chars') should be incremented for each such invalid byte sequence which is found when performing a conversion.

In PHP 8.0, mb_check_encoding returns false, which means this string is invalid, but mb_convert_encoding does not insert any error markers or increment mb_get_info('illegal_chars'), which means the string is valid.

In PHP 8.1, mb_check_encoding and mb_convert_encoding (and mb_detect_encoding) all agree on which strings are valid and which are not.

It still leaves the question of whether we should treat "A + B" as valid UTF-7 or not. Right now we are following the behavior of PHP 8.0 (and earlier) mb_convert_encoding.

@cmb69
Copy link
Member

cmb69 commented Jan 5, 2023

It still leaves the question of whether we should treat "A + B" as valid UTF-7 or not.

I think we should treat a stray + as illegal. RFC 2152 is as good as it gets (it's already marked as legacy, and I consider it highly unlikely that this encoding will be revived), and is very clear about that ('A "+" character followed immediately by any character other than members of set B or "-" is an ill-formed sequence.')

Anyhow, Wikipedia notes:

As such, if standard ASCII-based escaping or validation processes are used on strings that may be later interpreted as UTF-7, then Unicode blocks may be used to slip malicious strings past them. To mitigate this problem, systems should perform decoding before validation and should avoid attempting to autodetect UTF-7.

Thus, it might be a good idea to document recommendation against such autodetection in the manual, and maybe fade out UTF-7 support in MBString in the long run.

@alecpl
Copy link

alecpl commented Jan 5, 2023

To make it even more fun we have also UTF7-IMAP encoding, which is similar and uses & instead of +.

mb_check_encoding('A & B', 'utf7-imap');

Returns FALSE on most of recent PHP versions, but there was a short time it returned TRUE. I think it would be good to at least have some consistency between UTF7-IMAP and UTF-7.

EDIT: Sorry, you can ignore this comment. In UTF7-IMAP the closing delimiter is specified as -. So, it is different. I'll shut up now.

@alexdowad
Copy link
Contributor

I think we should treat a stray + as illegal. RFC 2152 is as good as it gets (it's already marked as legacy, and I consider it highly unlikely that this encoding will be revived), and is very clear about that ('A "+" character followed immediately by any character other than members of set B or "-" is an ill-formed sequence.')

Fair enough! I seriously doubt that any software which emits UTF-7 would ever produce such strings. If anyone knows of such software, I would like to hear about it. (It is interesting to see, as @alecpl pointed out, that iconv doesn't necessarily treat a stray + as an error. I wonder what their rationale is.)

If we make this change right now, on PHP-8.1, then we help to preserve BC with PHP-8.0 as regards mb_check_encoding and mb_detect_encoding... but on the other hand, we break BC for mb_convert_encoding. It's an interesting situation: the legacy code was inconsistent, so to resolve the inconsistency, we have to break BC for one function or the other.

What do the commenters think about the BC issue? Is it better to adjust the handling of UTF-7 right now, to fix BC between PHP-8.0 and PHP-8.1 for mb_check_encoding and mb_detect_encoding? Or is it better to schedule the change for later?

Anyhow, Wikipedia notes:

As such, if standard ASCII-based escaping or validation processes are used on strings that may be later interpreted as UTF-7, then Unicode blocks may be used to slip malicious strings past them. To mitigate this problem, systems should perform decoding before validation and should avoid attempting to autodetect UTF-7.

Thus, it might be a good idea to document recommendation against such autodetection in the manual

Oh, absolutely. Just to let you know, once I finish all the pending code changes for mbstring, my plan is to go through the documentation and fix all inaccuracies, add missing information, and so on.

, and maybe fade out UTF-7 support in MBString in the long run.

I don't support this, seeing as a big part of mbstring's reason to exist is to help people work with legacy software, or process files which were once produced by such software.

@alexdowad
Copy link
Contributor

EDIT: Sorry, you can ignore this comment. In UTF7-IMAP the closing delimiter is specified as -. So, it is different. I'll shut up now.

Just to provide some historical background, when UTF7-IMAP was designed, an explicit design goal was that there should be one, and only one, possible way to encode any given string. It was deliberately made stricter than UTF-7.

@Tofandel
Copy link

Tofandel commented Jan 18, 2023

It's in fact a serious issue that affects us too, because "+" characters are being dropped when doing mb_convert_encoding("+", "UTF-8", "UTF-7");

<?php
function fvm_ensure_utf8($str) {
	$enc = mb_detect_encoding($str, mb_list_encodings(), true);
	var_dump($enc);
	if ($enc === false){
		return false; // could not detect encoding
	} else if ($enc !== "UTF-8") {
		return mb_convert_encoding($str, "UTF-8", $enc); // converted to utf8
	} else {
		return $str; // already utf8
	}
}

$css = 'input[type="radio"]:checked + img {
    border: 5px solid #0083ca;
}';
$css = fvm_ensure_utf8($css);
echo $css;

PHP8.1+

string(5) "UTF-7"
input[type="radio"]:checked  img {
    border: 5px solid #0083ca;
}

PHP8.0

string(5) "UTF-8"
input[type="radio"]:checked + img {
    border: 5px solid #0083ca;
}

@cmb69
Copy link
Member

cmb69 commented Jan 18, 2023

@Tofandel, mb_detect_encoding() is not a magic encoding/charset detector. Actually, there is no such thing, at least if the strings to check are short. If you are expecting UTF-8, I suggest to reject anything else, and not magically try to convert it.

Anyhow, I leave this for @alexdowad to decide.

@alexdowad
Copy link
Contributor

@Tofandel, thanks for letting us know about the issue you are experiencing.

It's not really the main point, but anyways, a + character (U+002B PLUS SIGN) is not represented by the single UTF-7 byte '+'. In UTF-7, a plus sign is represented by the two bytes '+-'. So it's not really correct to say that mb_convert_encoding is "dropping" plus signs when converting UTF-7 strings; those 0x2B bytes don't represent plus signs in UTF-7.

A bare '+' like you showed in the code above is invalid according to the UTF-7 RFC, so mbstring should treat it as an error; as you can see from the above discussion, I am just trying to figure out when this change should be made.

@cmb69's recommendation is very good, and you would probably be wise to follow it (though we don't know all the details of your application). However, even if you have good reason to use mb_detect_encoding, I would really suggest that mb_detect_encoding(..., mb_list_encodings()) is asking for trouble. Rather than mb_list_encodings(), it is much better to provide an explicit list of the text encodings which you actually need to support. For many applications, it might be something like ['UTF-8', 'UTF-16LE', 'UTF-16BE'], perhaps with a couple of legacy text encodings added to the list.

Supporting UTF-7 is quite dubious. The name "UTF-7" might make you think this is a "modern" way of encoding text, but that is not true at all. Wikipedia describes UTF-7 as "obsolete", and that is 100% true. It's a weird, badly designed text encoding which is hardly ever used for anything.

About the mb_list_encodings thing, when I finish the work I am currently doing on the mbstring codebase and start working on the documentation, I intend to put a warning right in the PHP manual not to pass mb_list_encodings() to mb_detect_encoding. It's just a bad idea.

Even so, I do acknowledge that many PHP users will continue to use mb_detect_encoding like this. When we make the UTF-7 conversion code a bit stricter about illegal uses of '+', it will help to prevent UTF-8 text from being accidentally identified as UTF-7. Again, we just need to figure out when to make this change and how to go about it.

@Lehren
Copy link

Lehren commented Mar 22, 2023

It's not just for UTF-7. Since PHP 8.1.7 and up thinks this string is Windows-1252, even though it's seen as UTF-8 in PHP 8.0, 7.4, and 8.1.2

' ̞>'

It contains the U+031E character

mb_detect_encoding(' ̞>',['ASCII','UTF-8', 'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-6','ISO-8859-15', 'Windows-1252'])

Another example is
mb_detect_encoding('Ⱥ',['UTF-8', 'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-6','ISO-8859-15', 'Windows-1252','ASCII'])

in PHP 8.1 and up it's seen as ISO-8859-1, and in PHP 8.0 as UTF-8

@alexdowad
Copy link
Contributor

Just working on a patch to be slightly more strict when checking whether an input string might be UTF-7. I am adding @Tofandel's test case to the official test suite. This should be incorporated in the upcoming release of PHP 8.3.

@alexdowad
Copy link
Contributor

@Lehren Thanks for the report. Frankly, I think the behavior in PHP 8.1+ makes more sense. Why would we expect to find U+031E COMBINING DOWN TACK BELOW coming after an ASCII space character?

Still, that is a valid (though nonsensical) UTF-8 string, so if all you really care about is that the string is valid UTF-8, then you can use mb_check_encoding($str, 'UTF-8'). That is a much better idea even for PHP 8.0 and below; for one thing, it's much faster.

alexdowad added a commit to alexdowad/php-src that referenced this issue May 14, 2023
…n non-strict mode

In 6fc8d01, pakutoma added specialized validity checking functions
for some legacy text encodings like ISO-2022-JP and UTF-7. These
check functions perform a more strict validity check than the encoding
conversion functions for the same text encodings. For example, the
check function for ISO-2022-JP verifies that the string ends in the
correct state required by the specification for ISO-2022-JP.

These check functions are already being used to make detection of text
encoding more accurate when 'strict' detection mode is enabled.

However, since the default is 'non-strict' detection (a bad API design
but we're stuck with it now), most users will not benefit from
pakutoma's work. I was previously reluctant to enable this new logic
for non-strict detection mode. My intention was to reduce the scope of
behavior changes, since almost *any* behavior change may affect *some*
user in a way we don't expect.

However, we definitely have users whose (production) code was broken
by the changes I made in 28b346b, and enabling pakutoma's check
functions for non-strict detection mode would un-break it. (See
phpGH-10192 as an example.) The added checks do also make sense.

In non-strict detection mode, we will not immediately reject candidate
encodings whose validity check function returns false; but they will
be much less likely to be selected. However, failure of the validity
check function is weighted less heavily than an encoding error detected
by the encoding conversion function.
alexdowad added a commit to alexdowad/php-src that referenced this issue May 14, 2023
…n non-strict mode

In 6fc8d01, pakutoma added specialized validity checking functions
for some legacy text encodings like ISO-2022-JP and UTF-7. These
check functions perform a more strict validity check than the encoding
conversion functions for the same text encodings. For example, the
check function for ISO-2022-JP verifies that the string ends in the
correct state required by the specification for ISO-2022-JP.

These check functions are already being used to make detection of text
encoding more accurate when 'strict' detection mode is enabled.

However, since the default is 'non-strict' detection (a bad API design
but we're stuck with it now), most users will not benefit from
pakutoma's work. I was previously reluctant to enable this new logic
for non-strict detection mode. My intention was to reduce the scope of
behavior changes, since almost *any* behavior change may affect *some*
user in a way we don't expect.

However, we definitely have users whose (production) code was broken
by the changes I made in 28b346b, and enabling pakutoma's check
functions for non-strict detection mode would un-break it. (See
phpGH-10192 as an example.) The added checks do also make sense.

In non-strict detection mode, we will not immediately reject candidate
encodings whose validity check function returns false; but they will
be much less likely to be selected. However, failure of the validity
check function is weighted less heavily than an encoding error detected
by the encoding conversion function.
alexdowad added a commit that referenced this issue May 16, 2023
…n non-strict mode

In 6fc8d01, pakutoma added specialized validity checking functions
for some legacy text encodings like ISO-2022-JP and UTF-7. These
check functions perform a more strict validity check than the encoding
conversion functions for the same text encodings. For example, the
check function for ISO-2022-JP verifies that the string ends in the
correct state required by the specification for ISO-2022-JP.

These check functions are already being used to make detection of text
encoding more accurate when 'strict' detection mode is enabled.

However, since the default is 'non-strict' detection (a bad API design
but we're stuck with it now), most users will not benefit from
pakutoma's work. I was previously reluctant to enable this new logic
for non-strict detection mode. My intention was to reduce the scope of
behavior changes, since almost *any* behavior change may affect *some*
user in a way we don't expect.

However, we definitely have users whose (production) code was broken
by the changes I made in 28b346b, and enabling pakutoma's check
functions for non-strict detection mode would un-break it. (See
GH-10192 as an example.) The added checks do also make sense.

In non-strict detection mode, we will not immediately reject candidate
encodings whose validity check function returns false; but they will
be much less likely to be selected. However, failure of the validity
check function is weighted less heavily than an encoding error detected
by the encoding conversion function.
@nielsdos
Copy link
Member

This was fixed in #11239, but this issue was not closed automatically (despite being linked to the PR). So I'm closing this manually. If this is a mistake, let me know and I'll reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants