Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to debug UnicodeDecodeError? #16

Closed
ondrejmirtes opened this issue Jul 11, 2022 · 5 comments
Closed

How to debug UnicodeDecodeError? #16

ondrejmirtes opened this issue Jul 11, 2022 · 5 comments

Comments

@ondrejmirtes
Copy link

Hello,
I really appreciate your action, it's really useful!

I'm getting UnicodeDecodeError for a few or my pages (https://github.com/phpstan/phpstan/runs/7283991324?check_suite_focus=true):

ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/blog/phpstan-1-6-0-with-conditional-return-types
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/blog/phpstan-is-ready-for-php8
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/blog/zero-config-analysis-with-static-reflection
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/blog/from-minutes-to-seconds-massive-performance-gains-in-phpstan
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/blog/phpstans-doctrine-extension-just-got-a-lot-better
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/blog/phpstan-0-11
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference
WARNING:deadseeker.clientsession:::warn ::Retry Attempt #2 of 5: https://phpstan.org/blog/whats-up-with-template-covariant
WARNING:deadseeker.clientsession:::warn ::Retry Attempt #2 of 5: https://phpstan.org/writing-php-code/phpdoc-types#general-arrays
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference#rememberpossiblyimpurefunctionvalues
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference#analysed-files
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference#rule-level
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference#multiple-files
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference#errorformat
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference#parallel-processing
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference#caching
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference#bootstrap
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference#checkuninitializedproperties

Which normally work fine and I'm not sure how to debug this. Can you please tell me some steps how to fix it? Thanks :)

@ScholliYT
Copy link
Owner

ScholliYT commented Jul 11, 2022

Hi there,

The current output is indeed not very useful. I added logging of the exception in the following in order to investigate this issue. I suppose that the problem here is encoding of emojis. All the pages that you list above contain at least one emoji character.

ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/blog/phpstan-is-ready-for-php8
Traceback (most recent call last):
  File "/modules/deadseeker/responsefetcher.py", line 42, in fetch_response
    await self._inner_fetch(session, resp, urltarget, timer)
  File "/modules/deadseeker/responsefetcher.py", line 98, in _inner_fetch
    await self._do_get(session, resp, urltarget, timer)
  File "/modules/deadseeker/responsefetcher.py", line 71, in _do_get
    resp.html = await response.text()
  File "/usr/local/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 1081, in text
    return self._body.decode(encoding, errors=errors)  # type: ignore
  File "/usr/local/lib/python3.8/encodings/cp1254.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)

For some reason the auto charset auto detection thinks your page is encoded using cp1254 / ISO/IEC 8859-9.

DEBUG:chardet.charsetprober:SHIFT_JIS Japanese prober hit error at byte 16846
DEBUG:chardet.charsetprober:EUC-JP Japanese prober hit error at byte 139
DEBUG:chardet.charsetprober:GB2312 Chinese prober hit error at byte 16847
DEBUG:chardet.charsetprober:EUC-KR Korean prober hit error at byte 139
DEBUG:chardet.charsetprober:CP949 Korean prober hit error at byte 139
DEBUG:chardet.charsetprober:Big5 Chinese prober hit error at byte 140
DEBUG:chardet.charsetprober:EUC-TW Taiwan prober hit error at byte 139
DEBUG:chardet.charsetprober:windows-1251 Russian confidence = 0.01
DEBUG:chardet.charsetprober:KOI8-R Russian confidence = 0.01
DEBUG:chardet.charsetprober:ISO-8859-5 Russian confidence = 0.0
DEBUG:chardet.charsetprober:MacCyrillic Russian confidence = 0.0
DEBUG:chardet.charsetprober:IBM866 Russian confidence = 0.0
DEBUG:chardet.charsetprober:IBM855 Russian confidence = 0.01
DEBUG:chardet.charsetprober:ISO-8859-7 Greek confidence = 0.0
DEBUG:chardet.charsetprober:windows-1253 Greek confidence = 0.0
DEBUG:chardet.charsetprober:ISO-8859-5 Bulgarian confidence = 0.0
DEBUG:chardet.charsetprober:windows-1251 Bulgarian confidence = 0.0
DEBUG:chardet.charsetprober:TIS-620 Thai confidence = 0.0
DEBUG:chardet.charsetprober:ISO-8859-9 Turkish confidence = 0.578594466537165
DEBUG:chardet.charsetprober:windows-1255 Hebrew confidence = 0.0
DEBUG:chardet.charsetprober:windows-1255 Hebrew confidence = 0.0
DEBUG:chardet.charsetprober:windows-1255 Hebrew confidence = 0.0

@ScholliYT
Copy link
Owner

There are the following options to fix this:

  1. Add the charset parameter to the Content-Type header of your html documents. This will be respected by the aiohttp client.
  2. Default to "utf-8", and force aiohttp to use this encoding. This can be done as almost everything is utf-8 encoded.

I think solution 2 is fine.

@ScholliYT
Copy link
Owner

@ondrejmirtes can you check if the Pull Request #20 works for you? I just checked with your config and it seems to work. However, you have at least one broken link on your page 😄 This page you link to seems to be down https://2018.phpce.eu.

@ondrejmirtes
Copy link
Author

Yes, it works for me, it's green now! :) https://github.com/phpstan/phpstan/runs/7297099734?check_suite_focus=true

Thank you very much for your swift response.

@ScholliYT
Copy link
Owner

From aio-libs/aiohttp#5930 it looks like the solution I picked (i.e. defaulting to "utf-8") is not optimal. Instead updating to aio-http >3.8 should work here too.

ScholliYT added a commit that referenced this issue Aug 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants