-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fromArray method write cell failure because of xEFxBFxBE character(s) existence in string #2897
Comments
That looks like an invalid UTF-8 string; what wold you expect to be stored in the cell if you set it's value to something that isn't valid utf-8? |
Prior to PHP 5.4, the So you should ensure that your string values are valid before setting them in a cell Filtering your strings through the following function: function valid_utf8_bytes($str)
{
$return = '';
$length = strlen($str);
$invalid = array_flip(array("xEFxBFxBF" /* U-FFFF */, "xEFxBFxBE" /* U-FFFE */));
for ($i=0; $i < $length; $i++)
{
$c = ord($str[$o=$i]);
if ($c < 0x80) $n=0; # 0bbbbbbb
elseif (($c & 0xE0) === 0xC0) $n=1; # 110bbbbb
elseif (($c & 0xF0) === 0xE0) $n=2; # 1110bbbb
elseif (($c & 0xF8) === 0xF0) $n=3; # 11110bbb
elseif (($c & 0xFC) === 0xF8) $n=4; # 111110bb
else continue; # Does not match
for ($j=++$n; --$j;) # n bytes matching 10bbbbbb follow ?
if ((++$i === $length) || ((ord($str[$i]) & 0xC0) != 0x80))
continue 2
;
$match = substr($str, $o, $n);
if ($n === 3 && isset($invalid[$match])) # test invalid sequences
continue;
$return .= $match;
}
return $return;
} before setting them in a cell should filter out any invalid UTF-8 character sequences, leaving only valid UTF-8 characters, so that when you set that value in a cell it will be valid and won't be dropped |
A brief history about the case. I'm working in a project that use MySQL. The string I shared as an example coming from the database. Later I know that the data converted from latin_* collation to utf8. So, when I try adding a feature to export the data to Excel, this happened: the excel file contains only data with number, but no single string at all. Not only in a sheet, but in entire spreadsheet, since I generated an excel file with multiple sheet. After debugging process, i learned that the trouble coming from strings with those characters. The fact that all strings disappears in the result file, and I was not aware that the characters exists, make debugging process a bit tricky since I need to chunking the data and recreate the process many times, until I find the problematic chunk or single data. So, while we don't want to store those characters in the cell, and surely we could validating the data before processing, it would be nice and better if we have a feature to make sure that only cells with problematic characters that not shown in the result, and all others good data still appears in the result file. |
We specifically request that string data should be valid UTF-8, and then PhpSpreadsheet uses a very simply validation function to verify that the string is valid utf-8:
If the string contains all valid characters, then it returns the string; if there are any invalid characters, then it returns an empty string. That is validation. Yes, it might be a better approach if we discarded invalid characters and returned a string containing only valid characters instead of returning an empty string... using the |
I'm not sure exactly what's going wrong here, but Php/iconv is giving unexpected results: $invalid_utf8_string = "\xef\xbf\xbe\xef\xbf\xbf"; // illegal FFFE and FFFF character
var_dump(iconv('UTF-8', 'UTF-8', $invalid_utf8_string));
var_dump(mb_convert_encoding($invalid_utf8_string, 'UTF-8', 'UTF-8'));
var_dump(preg_match('//u', $invalid_utf8_string));
var_dump(UConverter::transcode($invalid_utf8_string, 'UTF-8', 'UTF-8')); Those are 4 methods I would expect to detect this as an invalid UTF8 string, but all 4 of them treat it as valid in all of Php 7.4, 8.0, 8.1, and 8.2alpha. I do see code in Mark's routine which will fail those values. Based on this dreadful result, it might make sense to do a str_replace for those 2 illegal characters (deleting them or replacing them with the Unicode substitution character). Uconverter::transcode will (I hope) automatically use the substitution character in place of any illegal characters except for the 2 special cases, which I think is a better treatment than dropping the characters. The replacement of the 2 undetected characters can happen before or after iconv (or transcode or mb_convert_encoding) ... or just use Mark's routine in place of all of it. Possibly you can add a setting to enable this treatment if you think the overhead will be too high. |
Likewise with the variants:
There's also an mb_* ini setting (that I can't remember at the moment) that determines how to handle invalid characters for EDIT
|
Fix PHPOffice#2897. We have been relying on iconv/mb_convert_encoding to detect invalid UTF-8, but all techniques designed to validate UTF-8 seem to accept FFFE and FFFF. This PR explicitly converts those characters to FFFD (Unicode substitution character) before validating the rest of the string. It also substitutes one or more FFFD when it detects invalid UTF-8 character sequences. A comment in the code being change stated that it doesn't handle surrogates. It is right not to do so. The only case where we should see surrogates is reading UTF-16. Additional tests are added to an existing test reading a UTF-16 Csv to demonstrate that surrogates are handled correctly, and that FFFE/FFFF are handled reasonably.
Fix #2897. We have been relying on iconv/mb_convert_encoding to detect invalid UTF-8, but all techniques designed to validate UTF-8 seem to accept FFFE and FFFF. This PR explicitly converts those characters to FFFD (Unicode substitution character) before validating the rest of the string. It also substitutes one or more FFFD when it detects invalid UTF-8 character sequences. A comment in the code being change stated that it doesn't handle surrogates. It is right not to do so. The only case where we should see surrogates is reading UTF-16. Additional tests are added to an existing test reading a UTF-16 Csv to demonstrate that surrogates are handled correctly, and that FFFE/FFFF are handled reasonably.
Note that this will be the last 1.x branch release before the 2.x release. We will maintain both branches in parallel for a time; but users are requested to update to version 2.0 once that is fully available. ### Added - Added `removeComment()` method for Worksheet [PR #2875](https://github.com/PHPOffice/PhpSpreadsheet/pull/2875/files) - Add point size option for scatter charts [Issue #2298](#2298) [PR #2801](#2801) - Basic support for Xlsx reading/writing Chart Sheets [PR #2830](#2830) Note that a ChartSheet is still only written as a normal Worksheet containing a single chart, not as an actual ChartSheet. - Added Worksheet visibility in Ods Reader [PR #2851](#2851) and Gnumeric Reader [PR #2853](#2853) - Added Worksheet visibility in Ods Writer [PR #2850](#2850) - Allow Csv Reader to treat string as contents of file [Issue #1285](#1285) [PR #2792](#2792) - Allow Csv Reader to store null string rather than leave cell empty [Issue #2840](#2840) [PR #2842](#2842) - Provide new Worksheet methods to identify if a row or column is "empty", making allowance for different definitions of "empty": - Treat rows/columns containing no cell records as empty (default) - Treat cells containing a null value as empty - Treat cells containing an empty string as empty ### Changed - Modify `rangeBoundaries()`, `rangeDimension()` and `getRangeBoundaries()` Coordinate methods to work with row/column ranges as well as with cell ranges and cells [PR #2926](#2926) - Better enforcement of value modification to match specified datatype when using `setValueExplicit()` - Relax validation of merge cells to allow merge for a single cell reference [Issue #2776](#2776) - Memory and speed improvements, particularly for the Cell Collection, and the Writers. See [the Discussion section on github](#2821) for details of performance across versions - Improved performance for removing rows/columns from a worksheet ### Deprecated - Nothing ### Removed - Nothing ### Fixed - Xls Reader resolving absolute named ranges to relative ranges [Issue #2826](#2826) [PR #2827](#2827) - Null value handling in the Excel Math/Trig PRODUCT() function [Issue #2833](#2833) [PR #2834](#2834) - Invalid Print Area defined in Xlsx corrupts internal storage of print area [Issue #2848](#2848) [PR #2849](#2849) - Time interval formatting [Issue #2768](#2768) [PR #2772](#2772) - Copy from Xls(x) to Html/Pdf loses drawings [PR #2788](#2788) - Html Reader converting cell containing 0 to null string [Issue #2810](#2810) [PR #2813](#2813) - Many fixes for Charts, especially, but not limited to, Scatter, Bubble, and Surface charts. [Issue #2762](#2762) [Issue #2299](#2299) [Issue #2700](#2700) [Issue #2817](#2817) [Issue #2763](#2763) [Issue #2219](#2219) [Issue #2863](#2863) [PR #2828](#2828) [PR #2841](#2841) [PR #2846](#2846) [PR #2852](#2852) [PR #2856](#2856) [PR #2865](#2865) [PR #2872](#2872) [PR #2879](#2879) [PR #2898](#2898) [PR #2906](#2906) [PR #2922](#2922) [PR #2923](#2923) - Adjust both coordinates for two-cell anchors when rows/columns are added/deleted. [Issue #2908](#2908) [PR #2909](#2909) - Keep calculated string results below 32K. [PR #2921](#2921) - Filter out illegal Unicode char values FFFE/FFFF. [Issue #2897](#2897) [PR #2910](#2910) - Better handling of REF errors and propagation of all errors in Calculation engine. [PR #2902](#2902) - Calculating Engine regexp for Column/Row references when there are multiple quoted worksheet references in the formula [Issue #2874](#2874) [PR #2899](#2899)
This is:
What is the expected behavior?
$sheet->fromArray success writing cells
What is the current behavior?
The existence of xEFxBFxBE character(s) in a string affected fromArray method failed to write not only single cell, but all cells that have string in it. It does not affects cell with number.
String example:
LLC “CPC “Business Perspectives�
What features do you think are causing the issue
Does an issue affect all spreadsheet file formats? If not, which formats are affected?
xlsx
Which versions of PhpSpreadsheet and PHP are affected?
PhpSpreadsheet 1.19
PHP 7.4
The text was updated successfully, but these errors were encountered: