How to debug UnicodeDecodeError? #16

ondrejmirtes · 2022-07-11T15:14:18Z

Hello,
I really appreciate your action, it's really useful!

I'm getting UnicodeDecodeError for a few or my pages (https://github.com/phpstan/phpstan/runs/7283991324?check_suite_focus=true):

ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/blog/phpstan-1-6-0-with-conditional-return-types
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/blog/phpstan-is-ready-for-php8
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/blog/zero-config-analysis-with-static-reflection
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/blog/from-minutes-to-seconds-massive-performance-gains-in-phpstan
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/blog/phpstans-doctrine-extension-just-got-a-lot-better
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/blog/phpstan-0-11
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference
WARNING:deadseeker.clientsession:::warn ::Retry Attempt #2 of 5: https://phpstan.org/blog/whats-up-with-template-covariant
WARNING:deadseeker.clientsession:::warn ::Retry Attempt #2 of 5: https://phpstan.org/writing-php-code/phpdoc-types#general-arrays
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference#rememberpossiblyimpurefunctionvalues
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference#analysed-files
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference#rule-level
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference#multiple-files
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference#errorformat
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference#parallel-processing
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference#caching
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference#bootstrap
ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/config-reference#checkuninitializedproperties

Which normally work fine and I'm not sure how to debug this. Can you please tell me some steps how to fix it? Thanks :)

The text was updated successfully, but these errors were encountered:

ScholliYT · 2022-07-11T21:39:30Z

Hi there,

The current output is indeed not very useful. I added logging of the exception in the following in order to investigate this issue. I suppose that the problem here is encoding of emojis. All the pages that you list above contain at least one emoji character.

ERROR:deadseeker.loggingresponsehandler:::error ::UnicodeDecodeError: 200 - https://phpstan.org/blog/phpstan-is-ready-for-php8
Traceback (most recent call last):
  File "/modules/deadseeker/responsefetcher.py", line 42, in fetch_response
    await self._inner_fetch(session, resp, urltarget, timer)
  File "/modules/deadseeker/responsefetcher.py", line 98, in _inner_fetch
    await self._do_get(session, resp, urltarget, timer)
  File "/modules/deadseeker/responsefetcher.py", line 71, in _do_get
    resp.html = await response.text()
  File "/usr/local/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 1081, in text
    return self._body.decode(encoding, errors=errors)  # type: ignore
  File "/usr/local/lib/python3.8/encodings/cp1254.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)

For some reason the auto charset auto detection thinks your page is encoded using cp1254 / ISO/IEC 8859-9.

DEBUG:chardet.charsetprober:SHIFT_JIS Japanese prober hit error at byte 16846
DEBUG:chardet.charsetprober:EUC-JP Japanese prober hit error at byte 139
DEBUG:chardet.charsetprober:GB2312 Chinese prober hit error at byte 16847
DEBUG:chardet.charsetprober:EUC-KR Korean prober hit error at byte 139
DEBUG:chardet.charsetprober:CP949 Korean prober hit error at byte 139
DEBUG:chardet.charsetprober:Big5 Chinese prober hit error at byte 140
DEBUG:chardet.charsetprober:EUC-TW Taiwan prober hit error at byte 139
DEBUG:chardet.charsetprober:windows-1251 Russian confidence = 0.01
DEBUG:chardet.charsetprober:KOI8-R Russian confidence = 0.01
DEBUG:chardet.charsetprober:ISO-8859-5 Russian confidence = 0.0
DEBUG:chardet.charsetprober:MacCyrillic Russian confidence = 0.0
DEBUG:chardet.charsetprober:IBM866 Russian confidence = 0.0
DEBUG:chardet.charsetprober:IBM855 Russian confidence = 0.01
DEBUG:chardet.charsetprober:ISO-8859-7 Greek confidence = 0.0
DEBUG:chardet.charsetprober:windows-1253 Greek confidence = 0.0
DEBUG:chardet.charsetprober:ISO-8859-5 Bulgarian confidence = 0.0
DEBUG:chardet.charsetprober:windows-1251 Bulgarian confidence = 0.0
DEBUG:chardet.charsetprober:TIS-620 Thai confidence = 0.0
DEBUG:chardet.charsetprober:ISO-8859-9 Turkish confidence = 0.578594466537165
DEBUG:chardet.charsetprober:windows-1255 Hebrew confidence = 0.0
DEBUG:chardet.charsetprober:windows-1255 Hebrew confidence = 0.0
DEBUG:chardet.charsetprober:windows-1255 Hebrew confidence = 0.0

ScholliYT · 2022-07-11T21:52:40Z

There are the following options to fix this:

Add the charset parameter to the Content-Type header of your html documents. This will be respected by the aiohttp client.
Default to "utf-8", and force aiohttp to use this encoding. This can be done as almost everything is utf-8 encoded.

I think solution 2 is fine.

fixes #16

ScholliYT · 2022-07-11T23:40:49Z

@ondrejmirtes can you check if the Pull Request #20 works for you? I just checked with your config and it seems to work. However, you have at least one broken link on your page 😄 This page you link to seems to be down https://2018.phpce.eu.

ondrejmirtes · 2022-07-12T07:34:34Z

Yes, it works for me, it's green now! :) https://github.com/phpstan/phpstan/runs/7297099734?check_suite_focus=true

Thank you very much for your swift response.

ScholliYT · 2022-07-12T07:48:53Z

From aio-libs/aiohttp#5930 it looks like the solution I picked (i.e. defaulting to "utf-8") is not optimal. Instead updating to aio-http >3.8 should work here too.

Feature/upgrade aiohttp Fixes #16 Closes #20

ScholliYT added a commit that referenced this issue Jul 11, 2022

default to utf-8 encoding for html body

4cf2374

fixes #16

ScholliYT mentioned this issue Jul 11, 2022

Fix: wrong detection of html text encoding #20

Merged

ScholliYT mentioned this issue Aug 16, 2022

Feature/upgrade aiohttp #22

Merged

ScholliYT closed this as completed in d96aae7 Aug 16, 2022

ScholliYT added a commit that referenced this issue Aug 16, 2022

Merge pull request #22 from ScholliYT/feature/upgrade-aiohttp

50b7580

Feature/upgrade aiohttp Fixes #16 Closes #20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to debug UnicodeDecodeError? #16

How to debug UnicodeDecodeError? #16

ondrejmirtes commented Jul 11, 2022

ScholliYT commented Jul 11, 2022 •

edited

Loading

ScholliYT commented Jul 11, 2022

ScholliYT commented Jul 11, 2022

ondrejmirtes commented Jul 12, 2022

ScholliYT commented Jul 12, 2022

How to debug UnicodeDecodeError? #16

How to debug UnicodeDecodeError? #16

Comments

ondrejmirtes commented Jul 11, 2022

ScholliYT commented Jul 11, 2022 • edited Loading

ScholliYT commented Jul 11, 2022

ScholliYT commented Jul 11, 2022

ondrejmirtes commented Jul 12, 2022

ScholliYT commented Jul 12, 2022

ScholliYT commented Jul 11, 2022 •

edited

Loading