Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 encoding detection fails on PHP 8.1 #181

Closed
come-nc opened this issue Mar 30, 2022 · 11 comments · Fixed by #182
Closed

UTF-8 encoding detection fails on PHP 8.1 #181

come-nc opened this issue Mar 30, 2022 · 11 comments · Fixed by #182

Comments

@come-nc
Copy link
Contributor

come-nc commented Mar 30, 2022

Hello

Someone openned an issue at Nextcloud signaling that creating a folder with special character broke when using PHP 8.1.
It appears that this comes from mb_detect_encoding changes in PHP 8.1 that end up detecting ISO-8859-1 instead of UTF-8

The code handling this is the same as the one in you decodePathSegment at https://github.com/sabre-io/http/blob/master/lib/functions.php#L404
You can see the behavior here: https://3v4l.org/IT9dJ

It feels we should not be trying to detect encoding here but maybe assume UTF-8, or only check with mb_check_encoding (which does return true for UTF-8).
What is the reason for this test, are some dav client sending ISO-8859-1? Is that allowed by webdav protocol?

Would this be ok instead:

/**
 * Decodes a url-encoded path segment.
 */
function decodePathSegment(string $path): string
{
    $path = rawurldecode($path);

    if (!mb_check_encoding($path, 'UTF-8') && mb_check_encoding($path, 'ISO-8859-1')) {
        $path = utf8_encode($path);
    }

    return $path;
}
@staabm
Copy link
Member

staabm commented Mar 30, 2022

thx for the report

Someone openned an issue at Nextcloud signaling that creating a folder with special character broke when using PHP 8.1.
It appears that this comes from mb_detect_encoding changes in PHP 8.1 that end up detecting ISO-8859-1 instead of UTF-8

did you check whether its a intended change in php-src or maybe even a php 8.1 only bug?
I would suggest opening a php-src issue and see whether it is expected.

@come-nc
Copy link
Contributor Author

come-nc commented Mar 30, 2022

Opened php/php-src#8279

@come-nc
Copy link
Contributor Author

come-nc commented Mar 30, 2022

@come-nc
Copy link
Contributor Author

come-nc commented Mar 31, 2022

@staabm See the PHP dev answer, this change is on purpose. Can you clarify why encoding detection is needed in this case? Is the webdav protocol accepting any encoding or is it because some clients are misbehaving?

If we really have to detect the encoding then current behavior is fine, if (as I hope) the protocol is enforcing utf-8 we should only detect encoding if the input is not valid utf-8.

@staabm
Copy link
Member

staabm commented Mar 31, 2022

See the PHP dev answer, this change is on purpose. Can you clarify why encoding detection is needed in this case? Is the webdav protocol accepting any encoding or is it because some clients are misbehaving?

I can't say anything about the protocol, but technically the detection is needed because utf8_encode only works on ISO-8859-1

@come-nc
Copy link
Contributor Author

come-nc commented Mar 31, 2022

I can't say anything about the protocol, but technically the detection is needed because utf8_encode only works on ISO-8859-1

But how come we do not know the encoding of what we are receiving in the first place?
Would you be ok with switching to mb_check_encoding as proposed in my first post?

@staabm
Copy link
Member

staabm commented Mar 31, 2022

But how come we do not know the encoding of what we are receiving in the first place?

I don't know.

Would you be ok with switching to mb_check_encoding as proposed in my first post?

Yep, after the discussion it seems the way forward.

@DeepDiver1975
Copy link
Member

But how come we do not know the encoding of what we are receiving in the first place?

I can only guess here .... sabre dav has a long history and a lot of bug fixing went into buggy clients sending essentially 💩 .
I guess sabre is not trusting any client and performs additional checks/verification/validations ...

@staabm
Copy link
Member

staabm commented Mar 31, 2022

btw: I tracked back the history when this was added and found the commit 885ada5#diff-c8569a19ab0202cf69472c1bc9c1945632a552d660283e3c8354408803a12972R79

@ambraspace
Copy link

@come-nc thanks for investigating this issue.

I reported this to Nextcloud repository since I noticed the problem, not only with file names but also with contact names. Since (probably) more then 90% names in Slavic languages contain one of šđčćž it's quite a problem.

After reading the discussion at php/php-src#8279 I'm even more worried. It seems that sometimes PHP will guess UTF-8 and sometimes it will guess ISO-8859-1 for the same application. It seems very awkward. Shouldn't the encoding be detected from the HTTP request header or something?

Unless there's is a way to specify the encoding within WebDAV protocol, I strongly suggest assuming UTF-8 by default.

@come-nc
Copy link
Contributor Author

come-nc commented Apr 7, 2022

Also note that utf8_encode will most likely be deprecated in next PHP version: https://wiki.php.net/rfc/remove_utf8_decode_and_utf8_encode

come-nc added a commit to come-nc/http that referenced this issue Apr 25, 2022
Use mb_check_encoding to detect encoding as mb_detect_encoding is misbehaving under PHP 8.1.
Also use mb_convert_encoding instead of utf8_encode as it’s getting deprecated in PHP 8.2.
Fixes sabre-io#181
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants