-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Audit usage of mb_*
functions in PHP parser
#1611
Comments
A potential polyfill https://github.com/symfony/polyfill-mbstring |
Closed with the merge of #1775 (I'm pretty sure 😉) |
Unfortunately #1775 only got rid of the 2 functions that are already polyfilled in core.
|
There is now also a usage of |
This is causing fatals to some people in certain configurations. Let's avoid breaking and more gracefully not rendering dynamic blocks on the server. |
As I understand it, the grammar never needs multi-byte regex patterns (we only ever look for digits, ASCII text, and a few other explicit character classes). For inputs guaranteed to be UTF-8, I think we can just switch to
This is no longer required if we never use
I believe we can switch to function Gutenberg_PEG_chr_unicode( $code ) {
return html_entity_decode( '&#' . $code . ';', ENT_QUOTES, 'UTF-8' );
} (The use of Where, though, is this function even used?
This is only used to split the input into "characters" (UTF-8 encoded Unicode codepoints) for array-access and counting, true? We could do the splitting in a more annoying way: /**
* Split UTF-8 String into array of UTF-8 characters (UTF-8 encoded codepoints).
*
* Based on WordPress' _mb_substr() compat function.
*
* @param string $str The string to extract the substring from.
* @return array
*/
function _utf8_split( $str ) {
if ( _wp_can_use_pcre_u() ) {
// Use the regex unicode support to separate the UTF-8 characters into an array.
preg_match_all( '/./us', $str, $match );
return $match[0];
}
$regex = '/(
[\x00-\x7F] # single-byte sequences 0xxxxxxx
| [\xC2-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
| \xE0[\xA0-\xBF][\x80-\xBF] # triple-byte sequences 1110xxxx 10xxxxxx * 2
| [\xE1-\xEC][\x80-\xBF]{2}
| \xED[\x80-\x9F][\x80-\xBF]
| [\xEE-\xEF][\x80-\xBF]{2}
| \xF0[\x90-\xBF][\x80-\xBF]{2} # four-byte sequences 11110xxx 10xxxxxx * 3
| [\xF1-\xF3][\x80-\xBF]{3}
| \xF4[\x80-\x8F][\x80-\xBF]{2}
)/x';
// Start with 1 element instead of 0 since the first thing we do is pop.
$chars = array( '' );
do {
// We had some string left over from the last round, but we counted it in that last round.
array_pop( $chars );
/*
* Split by UTF-8 character, limit to 1000 characters (last array element will contain
* the rest of the string).
*/
$pieces = preg_split( $regex, $str, 1000, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY );
$chars = array_merge( $chars, $pieces );
// If there's anything left over, repeat the loop.
} while ( count( $pieces ) > 1 && $str = array_pop( $pieces ) );
return $chars;
} I did not even try to run this code :) No idea if it works, how it actually scales, etc. |
@mdawaffe your knowledge of text continues to amaze me. 👏 I was also wondering if we could completely ignore the |
In other character encodings, there might be problems. In UTF-8, though, any byte that looks like an ASCII character is not part of some multi-byte sequence. So searching for ASCII bytes shouldn't return false positives, and splitting on ASCII bytes shouldn't create any tricky string munging based injection attacks. All that to say: I think (for this grammar), we can get away with ignoring multi-byteness when tokenizing (if that's the right word). |
Generally I would prefer to fix these issues in a way that leaves Everything described here is implemented in #1869.
This is the approach I took in nylen/phpegjs@2cc4046. The We are not using this
Indeed, the only usage of Since we're not using any case-insensitive matching in our grammar, I've added a This also yielded a slight further performance improvement (measured the same way as #1775 (comment)):
Since some of the functions we need already exist in WordPress, and yes it is pretty annoying, let's keep this out of |
It works :) Based on my testing so far, it doesn't appear to have adverse performance characteristics either. My only concern is the limit on string length, which defaulted to 1,000. Most places in WP core use this logic to extract fairly short prefixes from a string: excerpts, default post titles, etc. I've changed it to 100,000. Perhaps it should be even longer for this use case. |
Cool. I clearly don't really know what's going on, but I'm glad I arbitrarily picked the right mode :)
That's just the per iteration limit in the loop. It can still handle arbitrarily long strings. (Rather, some other bottleneck will appear.) The loop and per iteration limit are there to prevent Regex DOSes from catastrophic inputs. |
👍 Updated the PR to restore the previous 1000 character limit. |
Fixed in #1869. |
The generated PHP parser contains several instances of
mb_*
(multibyte string) functions:However, PHP is not always installed with the
mbstring
extension enabled, so these functions do not always exist.WP core has polyfills that attempt to emulate
mb_substr
andmb_strlen
if they are not present; the rest of these functions are newly introduced and will cause errors under PHP configurations without these functions enabled.There are a few options for how to proceed:
Polyfill the new functions
Converting encodings and matching regexes both seem quite complicated.
Decide that this is no longer an issue
This code in core has a lot of history, and I am not sure how common it is to have PHP installed without the
mbstring
extension. (It's not enabled in a default build of PHP, but this hasn't been a problem in distributions of PHP that I've used).In particular there is a mention of yoast.com having this configuration, but from 6 years ago.
Replace the new functions with something else
Fortunately,
mb_convert_encoding
is only used to convert a character code to a UTF-8 character, and [I am pretty sure that] themb_ereg
andmb_eregi
calls are only used to match very simple regular expressions of the form^[charlist]
against single characters. So the simplest path forward is probably to remove these calls fromphpegjs
.The text was updated successfully, but these errors were encountered: