-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mb_detect_encoding() results for UTF-7 differ between PHP 8.0 and 8.1 (if UTF-7 is present in the encodings list and the string contains '+' character) #10192
Comments
+
character is present in a string)
+
character is present in a string)
I honestly doubt the severity of this bug as AFAIK any UTF-7 string is valid UTF-8, moreover But I'll let @alexdowad comment, as he's the character encoding expect and has been refactoring mbstring. |
It is very serious. Because if you need to normalize the output you rely on re-encoding all strings as UTF-8. |
It impacts so many things in an unpredictable way so I can't explain all situations here. |
I updated the example for you to see the real disaster ! <?php
ini_set('display_errors', '1'); // display runtime errors
error_reporting(E_ALL & ~E_NOTICE & ~E_STRICT & ~E_DEPRECATED); // error reporting
date_default_timezone_set('UTC');
function detect_encoding($ystr, $csetlist='UTF-8, ISO-8859-1, ISO-8859-15, ISO-8859-2, ISO-8859-9, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-10, ISO-8859-13, ISO-8859-14, ISO-8859-16, UTF-7, ASCII, SJIS, EUC-JP, JIS, ISO-2022-JP, EUC-CN, GB18030, ISO-2022-KR, KOI8-R, KOI8-U') { // Fix: starting from PHP 7.1 it warns about illegal argument if using: ISO-8859-11
return mb_detect_encoding((string)$ystr, (string)$csetlist, true); // mixed: (bool) FALSE or (string) 'CHARSET'
}
$str = 'A + B';
echo detect_encoding($str);
echo "\n";
echo mb_convert_encoding($str, 'UTF-8', detect_encoding($str));
echo "\n";
echo "\n";
$str = 'A - B';
echo detect_encoding('A - B');
echo "\n";
echo mb_convert_encoding($str, 'UTF-8', detect_encoding($str)); expected output (OK in PHP 8.0 / 7.4)
wrong output in PHP 8.1 / 8.2 (the + character is mangled, disappeared)
Thus in my opinion would affect a lot of database data. This is the most serious bug in PHP since I have seen. If you re-save in database all the data becomes corrupted without a chance to go backward only if you have a valid backup !! |
I suspect a memory corruption there or something very ugly ... |
And if 'A + B' is detected as UTF-7 instead of UTF-8 on PHP 8.1/8.2,... why 'A - B' is detected as UTF-8 ? Can't you see the anomaly in this case ? |
Again, this is not a serious bug, just a bug. So I just went and read about UTF-7, and it uses the |
@unix-world Thanks for the report, will look into this. Will comment more once I have found the cause. |
I got investigated more. <?php
$str = 'A + B';
echo iconv('UTF-7', 'UTF-8', $str);
echo "\n";
$str = 'A - B';
echo iconv('UTF-7', 'UTF-8', $str); wrong output by PHP 8.1 / 8.2 (
OK output by PHP 8.0 / 7.4
|
Please guys fix this ASAP. A lot of people will corrupt their data/databases until then. |
As far as I know, internally PHP stores all strings as UTF-8. |
That's not the case; PHP strings are just a sequence of bytes, which may or may not be valid UTF-8. |
OK. Then I don't know... But is almost to zero chances that both PHP extensions: Iconv and MbString to have the same bug in the same time. It looks as something inside PHP not in the extensions. |
@unix-world Thanks again for opening this GH issue and letting the PHP core developers know what our users are experiencing. Feedback from users is always appreciated. I am not the iconv maintainer and don't know what has been done with iconv between PHP 8.0 and 8.1. Frankly, the iconv output you kindly showed for 8.0 and 8.1 both look reasonable to me. To learn more about UTF-7, you might wish to consult the specification: https://www.rfc-editor.org/rfc/rfc2152.html. I can summarize the relevant part by saying that the character The string which you are converting contains Does the iconv documentation specify which output we should expect for illegal strings? I don't know, since I haven't read that documentation. If so, then there might indeed be a bug in iconv. Comments on mbstring next... |
The MBString behaves exactly the same. $altstr = 'A - B'; // mb detect encoding detects it correctly as UTF-8
$str = 'A + B'; // mb detect encoding detects it as UTF-7 instead of UTF-8
echo iconv('UTF-7', 'UTF-8', $str); The above code outputs: 'A + B' as it should in PHP 8.0 and PHP 7.4 and also 7.3, 7.2, 7.1, 7.0, 5.6 |
@unix-world Try installing a recent version of the CLI tool
|
you are right with this. but then why all previous PHP versions like PHP 8.0 and 7.4 outputs: 'A + B' for the code below ? |
@unix-world Thanks for that question. Like I said, I am not the maintainer of the iconv extension and don't know what has happened with it. One thing I can tell you about the iconv PHP extension is that it is based on libiconv. (I think it's GNU libiconv, please correct me if that's wrong... https://www.gnu.org/software/libiconv/) PHP just takes your string, passes it to libiconv, then after libiconv does the conversion, PHP makes the result into a PHP string. So if the output has changed from the iconv extension, probably something has changed in libiconv. Again, I don't know much about iconv, so I may be wrong there. However, mbstring I do know very well. Should I explain a bit about mbstring? |
I got a clue now. It looks like this: But this would blow up a lot of old PHP code. On my side, the fix is simple. I remove out UTF-7 from the list /*
function detect_encoding($ystr, $csetlist='UTF-8, ISO-8859-1, ISO-8859-15, ISO-8859-2, ISO-8859-9, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-10, ISO-8859-13, ISO-8859-14, ISO-8859-16, UTF-7, ASCII, SJIS, EUC-JP, JIS, ISO-2022-JP, EUC-CN, GB18030, ISO-2022-KR, KOI8-R, KOI8-U') { // Fix: starting from PHP 7.1 it warns about illegal argument if using: ISO-8859-11
return mb_detect_encoding((string)$ystr, (string)$csetlist, true); // mixed: (bool) FALSE or (string) 'CHARSET'
}
*/
function detect_encoding($ystr, $csetlist='UTF-8, ISO-8859-1, ISO-8859-15, ISO-8859-2, ISO-8859-9, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-10, ISO-8859-13, ISO-8859-14, ISO-8859-16, ASCII, SJIS, EUC-JP, JIS, ISO-2022-JP, EUC-CN, GB18030, ISO-2022-KR, KOI8-R, KOI8-U') { // Fix: starting from PHP 7.1 it warns about illegal argument if using: ISO-8859-11 ; since PHP 8.1 the UTF-7 should no more be used !!
return mb_detect_encoding((string)$ystr, (string)$csetlist, true); // mixed: (bool) FALSE or (string) 'CHARSET'
} :-)) |
@unix-world It's good that you have found a workaround for your use case. There is still a lot more which can be said about the issue which you have raised, if you are interested in discussing more. I think any further replies from me will be tomorrow or thereafter, though. |
@cmb69 I don't know what versions of libiconv are you using. More. on my computer the same execution differences appear between PHP 7.4 / 8.0 vs 8.1 / 8.2 echo iconv('UTF-7', 'UTF-8', 'A + B');
echo iconv('UTF-7', 'UTF-8', 'A - B'); in PHP 7.4 / 8.0 in PHP 8.1 / 8.2 But this is not all. echo mb_detect_encoding('A + B', 'UTF-8, ISO-8859-1, ASCII, UTF-7', true);
echo mb_detect_encoding('A - B', 'UTF-8, ISO-8859-1, ASCII, UTF-7', true); in PHP 7.4 / 8.0 in PHP 8.1 / 8.2 Use this link to test, and choose whatever PHP version you want. You' see the difference also for this case. Btw, the only current fix for now is to completely exclude UTF-7 from the list of possible encodings, as below echo mb_detect_encoding('A + B', 'UTF-8, ISO-8859-1, ASCII', true);
echo mb_detect_encoding('A - B', 'UTF-8, ISO-8859-1, ASCII', true); in PHP 7.4 / 8.0 in PHP 8.1 / 8.2 I don't know if a bug or before PHP 8.1 the UTF-7 detection was not working at all. |
Then consider to re-read my comment. Anyhow, https://onlinephp.io/c/0e2c7 shows the same behavior for PHP 8.2.0, 8.1.13, 8.0.26 and 7.4.33 (they're obviously use glibc 2.26. |
Yes, you're right. But: function detect_encoding($ystr, $csetlist='UTF-8, ISO-8859-1, UTF-7, ASCII') {
return mb_detect_encoding((string)$ystr, (string)$csetlist, true);
}
echo detect_encoding('A + B');
echo "\n";
echo detect_encoding('A - B'); Outputs - PHP 7.4 / 8.0: Outputs - PHP 8.1 / 8.2: ONLY if I remove UTF-7 from the list, mb_detect_encoding works correctly on all PHP versions. But: function detect_encoding($ystr, $csetlist='UTF-8, ISO-8859-1, ASCII') {
return mb_detect_encoding((string)$ystr, (string)$csetlist, true);
}
echo detect_encoding('A + B');
echo "\n";
echo detect_encoding('A - B'); Outputs - PHP 7.4 / 8.0 / 8.1 / 8.2: So yeah, looks like a bug in MBstring. If UTF-7 is present in the encodings list and the string contains '+' it will be always detected as UTF-7 ! |
RFC 2152: "A "+" character followed immediately by any character other than members of set B or "-" is an ill-formed sequence.". |
That's not the case; it is easy to find strings which contain '+' but will not be detected as UTF-7. Aside from that, though, you might find it interesting to change your list of candidate encodings from:
to:
...and see what happens, in both PHP 8.0 and 8.1. |
Nice! You are most of the way to discovering the root cause of this issue. One point is missing, though. What does |
Looking at iconv, it does not consider this invalid either:
|
The inconsistency is between In PHP 8.0, In PHP 8.1, It still leaves the question of whether we should treat |
I think we should treat a stray Anyhow, Wikipedia notes:
Thus, it might be a good idea to document recommendation against such autodetection in the manual, and maybe fade out UTF-7 support in MBString in the long run. |
To make it even more fun we have also UTF7-IMAP encoding, which is similar and uses
Returns FALSE on most of recent PHP versions, but there was a short time it returned TRUE. I think it would be good to at least have some consistency between UTF7-IMAP and UTF-7. EDIT: Sorry, you can ignore this comment. In UTF7-IMAP the closing delimiter is specified as |
Fair enough! I seriously doubt that any software which emits UTF-7 would ever produce such strings. If anyone knows of such software, I would like to hear about it. (It is interesting to see, as @alecpl pointed out, that iconv doesn't necessarily treat a stray If we make this change right now, on PHP-8.1, then we help to preserve BC with PHP-8.0 as regards What do the commenters think about the BC issue? Is it better to adjust the handling of UTF-7 right now, to fix BC between PHP-8.0 and PHP-8.1 for
Oh, absolutely. Just to let you know, once I finish all the pending code changes for mbstring, my plan is to go through the documentation and fix all inaccuracies, add missing information, and so on.
I don't support this, seeing as a big part of mbstring's reason to exist is to help people work with legacy software, or process files which were once produced by such software. |
Just to provide some historical background, when UTF7-IMAP was designed, an explicit design goal was that there should be one, and only one, possible way to encode any given string. It was deliberately made stricter than UTF-7. |
It's in fact a serious issue that affects us too, because "+" characters are being dropped when doing mb_convert_encoding("+", "UTF-8", "UTF-7"); <?php
function fvm_ensure_utf8($str) {
$enc = mb_detect_encoding($str, mb_list_encodings(), true);
var_dump($enc);
if ($enc === false){
return false; // could not detect encoding
} else if ($enc !== "UTF-8") {
return mb_convert_encoding($str, "UTF-8", $enc); // converted to utf8
} else {
return $str; // already utf8
}
}
$css = 'input[type="radio"]:checked + img {
border: 5px solid #0083ca;
}';
$css = fvm_ensure_utf8($css);
echo $css; PHP8.1+
PHP8.0
|
@Tofandel, Anyhow, I leave this for @alexdowad to decide. |
@Tofandel, thanks for letting us know about the issue you are experiencing. It's not really the main point, but anyways, a + character (U+002B PLUS SIGN) is not represented by the single UTF-7 byte '+'. In UTF-7, a plus sign is represented by the two bytes '+-'. So it's not really correct to say that A bare '+' like you showed in the code above is invalid according to the UTF-7 RFC, so mbstring should treat it as an error; as you can see from the above discussion, I am just trying to figure out when this change should be made. @cmb69's recommendation is very good, and you would probably be wise to follow it (though we don't know all the details of your application). However, even if you have good reason to use Supporting UTF-7 is quite dubious. The name "UTF-7" might make you think this is a "modern" way of encoding text, but that is not true at all. Wikipedia describes UTF-7 as "obsolete", and that is 100% true. It's a weird, badly designed text encoding which is hardly ever used for anything. About the Even so, I do acknowledge that many PHP users will continue to use |
It's not just for UTF-7. Since PHP 8.1.7 and up thinks this string is Windows-1252, even though it's seen as UTF-8 in PHP 8.0, 7.4, and 8.1.2
It contains the U+031E character
Another example is in PHP 8.1 and up it's seen as ISO-8859-1, and in PHP 8.0 as UTF-8 |
Just working on a patch to be slightly more strict when checking whether an input string might be UTF-7. I am adding @Tofandel's test case to the official test suite. This should be incorporated in the upcoming release of PHP 8.3. |
@Lehren Thanks for the report. Frankly, I think the behavior in PHP 8.1+ makes more sense. Why would we expect to find U+031E COMBINING DOWN TACK BELOW coming after an ASCII space character? Still, that is a valid (though nonsensical) UTF-8 string, so if all you really care about is that the string is valid UTF-8, then you can use |
…n non-strict mode In 6fc8d01, pakutoma added specialized validity checking functions for some legacy text encodings like ISO-2022-JP and UTF-7. These check functions perform a more strict validity check than the encoding conversion functions for the same text encodings. For example, the check function for ISO-2022-JP verifies that the string ends in the correct state required by the specification for ISO-2022-JP. These check functions are already being used to make detection of text encoding more accurate when 'strict' detection mode is enabled. However, since the default is 'non-strict' detection (a bad API design but we're stuck with it now), most users will not benefit from pakutoma's work. I was previously reluctant to enable this new logic for non-strict detection mode. My intention was to reduce the scope of behavior changes, since almost *any* behavior change may affect *some* user in a way we don't expect. However, we definitely have users whose (production) code was broken by the changes I made in 28b346b, and enabling pakutoma's check functions for non-strict detection mode would un-break it. (See phpGH-10192 as an example.) The added checks do also make sense. In non-strict detection mode, we will not immediately reject candidate encodings whose validity check function returns false; but they will be much less likely to be selected. However, failure of the validity check function is weighted less heavily than an encoding error detected by the encoding conversion function.
…n non-strict mode In 6fc8d01, pakutoma added specialized validity checking functions for some legacy text encodings like ISO-2022-JP and UTF-7. These check functions perform a more strict validity check than the encoding conversion functions for the same text encodings. For example, the check function for ISO-2022-JP verifies that the string ends in the correct state required by the specification for ISO-2022-JP. These check functions are already being used to make detection of text encoding more accurate when 'strict' detection mode is enabled. However, since the default is 'non-strict' detection (a bad API design but we're stuck with it now), most users will not benefit from pakutoma's work. I was previously reluctant to enable this new logic for non-strict detection mode. My intention was to reduce the scope of behavior changes, since almost *any* behavior change may affect *some* user in a way we don't expect. However, we definitely have users whose (production) code was broken by the changes I made in 28b346b, and enabling pakutoma's check functions for non-strict detection mode would un-break it. (See phpGH-10192 as an example.) The added checks do also make sense. In non-strict detection mode, we will not immediately reject candidate encodings whose validity check function returns false; but they will be much less likely to be selected. However, failure of the validity check function is weighted less heavily than an encoding error detected by the encoding conversion function.
…n non-strict mode In 6fc8d01, pakutoma added specialized validity checking functions for some legacy text encodings like ISO-2022-JP and UTF-7. These check functions perform a more strict validity check than the encoding conversion functions for the same text encodings. For example, the check function for ISO-2022-JP verifies that the string ends in the correct state required by the specification for ISO-2022-JP. These check functions are already being used to make detection of text encoding more accurate when 'strict' detection mode is enabled. However, since the default is 'non-strict' detection (a bad API design but we're stuck with it now), most users will not benefit from pakutoma's work. I was previously reluctant to enable this new logic for non-strict detection mode. My intention was to reduce the scope of behavior changes, since almost *any* behavior change may affect *some* user in a way we don't expect. However, we definitely have users whose (production) code was broken by the changes I made in 28b346b, and enabling pakutoma's check functions for non-strict detection mode would un-break it. (See GH-10192 as an example.) The added checks do also make sense. In non-strict detection mode, we will not immediately reject candidate encodings whose validity check function returns false; but they will be much less likely to be selected. However, failure of the validity check function is weighted less heavily than an encoding error detected by the encoding conversion function.
This was fixed in #11239, but this issue was not closed automatically (despite being linked to the PR). So I'm closing this manually. If this is a mistake, let me know and I'll reopen. |
Description
This is a very serious bug that may impact all strings in PHP.
The following code:
Resulted in this output ; On PHP 8.2 and 8.1, if the plus (+) character is present in a string will detect UTF-7 instead of UTF-8 as expected:
But I expected this output instead ; on PHP 7.4 and PHP 8.0 works correctly. Output is this):
I tested here with different PHP versions.
https://onlinephp.io/
I also tested in my computer with PHP 8.1.12 and 8.0.25
PHP Version
8.1.x / 8.2.x
Operating System
All
The text was updated successfully, but these errors were encountered: