Switch default fallback encoding detection lib to `charset-normalizer` #5930

Ousret · 2021-08-04T20:23:41Z

What do these changes do?

Switch Chardet dependency to Charset-Normalizer for the fallback encoding detection.

Are there changes in behavior for the user?

This change is mostly backward-compatible, exception of a thing:

This new library support way more code pages (x3) than its counterpart Chardet.
- Based on the 30-ich charsets that Chardet support, expect roughly 90% BC results https://github.com/Ousret/charset_normalizer/pull/77/checks?check_run_id=3244585065

Why should you bother with such a change? Is it worth it?

Short answer, absolutely.

Long answer:

On average x5 faster than Chardet and countless times faster in the larger payload cases (>1MB)
- https://github.com/Ousret/charset_normalizer/pull/119/checks?check_run_id=3726052653
Put a definitive end to the licensing debate around Chardet and LGPL. Remove ANY license ambiguity/restriction for projects bundling aiohttp.
Never return an encoding if not suited for the given decoder. Eg. Never get UnicodeDecodeError!
The package size is X4 lower than Chardet’s (4.0)!
Binary files should never be detected as potential texts.
- https://github.com/Ousret/char-dataset/blob/master/None/sample-1.gif
  - Chardet and cChardet sees it as Windows-1252
  - None for charset-normalizer
Especially, because the Unicode detection is far better than Chardet (or cChardet).
- Some representative examples:
  - https://github.com/Ousret/char-dataset/blob/master/utf_8/reddit_wsb.csv
  - https://github.com/Ousret/char-dataset/blob/master/utf_8/_ude_1.md
    - Chardet sees it as Windows-1254
    - cChardet sees it as ISO-8859-7
    - utf_8 for charset-normalizer
  - https://github.com/Ousret/char-dataset/blob/master/utf_8/_ude_1.rst
    - Chardet and cChardet sees it as Windows-1252
    - utf_8 for charset-normalizer
https://github.com/Ousret/charset_normalizer/pull/119/checks?check_run_id=3726052761
Actively maintained, watching closely for any concerns.
- Many workflows are set up to ensure quality continuous delivery.
requests did integrate it first and for total transparency, the lib needed some minors adjustments. But it is going well so far.

It's still a heuristic lib, therefore cannot be trusted blindly of course.

Is UTF-8 everywhere already?

Not really, that is a dangerous assumption. Looking at https://w3techs.com/technologies/overview/character_encoding may seem like encoding detection is a thing of the past but not really. Solo based on 33k websites, you will find
3,4k responses without predefined encoding. 1,8k websites were not UTF-8, merely half! (Top 1000 sites from 80 countries in the world according to Data for SEO) https://github.com/potiuk/test-charset-normalizer

This statistic (w3techs) does not offer any ponderation, so one should not read it as
"I have a 97 % chance of hitting UTF-8 content on HTML content".

First of all, neither aiohttp, chardet or charset-normalizer are dedicated to HTML content. The detection concern every text document (SubRip Subtitle for ex.).
It is so hard to find any stats at all regarding this matter. Users' usages can be very dispersed, so making assumptions is unwise.

The real debate is to state if the detection is an HTTP client matter or not. That is more complicated and not my field.

Related issue number

No related issue.

Checklist

I think the code is well written
Documentation reflects the changes
If you provide code modification, please add yourself to CONTRIBUTORS.txt
Add a new news fragment into the CHANGES folder

Happen when give too few bytes

lgtm-com · 2021-08-05T21:10:55Z

This pull request introduces 1 alert when merging 55d585f into a341986 - view on LGTM.com

new alerts:

1 for Module is imported more than once

1 for Module is imported more than once

Ousret · 2021-09-10T20:50:44Z

Ping.

Would you be interested in such a change in the near future? @Dreamsorcerer @webknjaz

Regards,

Dreamsorcerer · 2021-09-10T22:48:53Z

I'm not really familiar with either library, so I don't think I'm in a place to make a decision on this. Although it does sound promising.

I've enabled the CI though, so looks like it needs some minor reformatting for flake8.

aiohttp/client_reqrep.py

flake8 hint for double quote str

Ousret · 2021-09-11T14:23:31Z

The rtd build failed with a syntax error at runtime. I cannot reboot the task.

codecov · 2021-09-11T15:31:50Z

Codecov Report

Merging #5930 (4008f15) into master (2904573) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #5930   +/-   ##
=======================================
  Coverage   93.31%   93.31%           
=======================================
  Files         102      102           
  Lines       30062    30062           
  Branches     2690     2690           
=======================================
  Hits        28053    28053           
  Misses       1833     1833           
  Partials      176      176

Flag	Coverage Δ
unit	`93.24% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
aiohttp/client_reqrep.py	`98.01% <ø> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2904573...4008f15. Read the comment docs.

charset-normalizer v2.0.5 no longer raise warnings on tiny content

Dreamsorcerer · 2021-09-14T22:45:21Z

The rtd build failed with a syntax error at runtime. I cannot reboot the task.

Should be fixed now. Just need to merge master.

Ousret · 2021-09-15T11:56:59Z

Merged.
Thanks.

Ousret · 2021-10-04T22:06:04Z

@asvetlov is there anything that could be done to ease the road toward the approval? I would gladly help in any way I can.
From any specific angle to run tests if needed.

Ousret · 2021-10-08T12:11:11Z

I ran some basic tests and got some measurements to share.
You may find the sources of what I produced to get those results here https://github.com/Ousret/aiohttp-tests

Everything went well as expected. No crashes.

Context

aiohttp.TCPConnector(limit=16)
alpine linux
nginx 1.21
python 3.9
aiohttp dev-master
chardet 4.0.0 (aiohttp-chardet)
charset-normalizer 2.0.6 (aiohttp-next)

Prepare the environment

docker-compose build aiohttp-next server
docker-compose build --no-cache aiohttp-chardet

docker-compose up -d server

Collect results

docker-compose up aiohttp-chardet
docker-compose up aiohttp-next

Memory sampling (to be run in parallel of aiohttp-* services)

docker stats

RAW performance

This comes with no surprise as it was already proven previously.

using charset-normalizer
- 37.7s with 467 files --> 81ms
using chardet
- 189.3s with 467 files --> 405ms

5 times faster while keeping 88% (410 / 464 files) full backward-compatible results.

see https://github.com/Ousret/charset_normalizer/runs/3637139029

RAM footprint

Following issue #4112 it could be a good idea to showcase that aspect.
Using docker stats for the sampling.

Bellow, the peak usage for each flavor.

using charset-normalizer
- 38MiB
using chardet
- 67MiB

1.7 times less memory consumption (on peaks). For 5 times faster guess output (on avg).

Other

Keep in mind that the actual delays may vary depending on your CPU capabilities.
The factor between chardet/charset-normalizer flavors should remain intact.

You may find the RAW console outputs for both flavors at https://gist.github.com/Ousret/ec22b8842a42fb74276b052b9178dcc4

CHANGES/5930.feature

docs/spelling_wordlist.txt

webknjaz · 2021-10-20T00:48:49Z

Hi @Ousret, the change is good. Though, could you apply those minor improvements I've pointed out?

Co-authored-by: Sviatoslav Sydorenko <[email protected]>

…tch-charset-detection

Ousret · 2021-10-20T06:08:32Z

I have applied the recommendations.

webknjaz

This is great, thanks!

webknjaz · 2021-10-20T10:59:25Z

@Ousret I've enabled auto-merge which should get this patch into master once the tests pass. After that, we'll need to get it backported to the 3.8 branch so that it'll get into the upcoming release of aiohttp v3.8. If automatic backporting fails, you'll need to follow the bot's instructions to do this manually.

patchback · 2021-10-20T11:29:30Z

Backport to 3.8: 💔 cherry-picking failed — conflicts found

❌ Failed to cleanly apply 2d5597e on top of patchback/backports/3.8/2d5597e6743bb4c579adb6f9b67482d5d35978c7/pr-5930

Backporting merged PR #5930 into master

Ensure you have a local repo clone of your fork. Unless you cloned it
from the upstream, this would be your origin remote.
Make sure you have an upstream repo added as a remote too. In these
instructions you'll refer to it by the name upstream. If you don't
have it, here's how you can add it:
```
$ git remote add upstream https://github.com/aio-libs/aiohttp.git
```

Ensure you have the latest copy of upstream and prepare a branch
that will hold the backported code:

$ git fetch upstream
$ git checkout -b patchback/backports/3.8/2d5597e6743bb4c579adb6f9b67482d5d35978c7/pr-5930 upstream/3.8

Now, cherry-pick PR Switch default fallback encoding detection lib to charset-normalizer #5930 contents into that branch:
```
$ git cherry-pick -x 2d5597e6743bb4c579adb6f9b67482d5d35978c7
```
If it'll yell at you with something like fatal: Commit 2d5597e6743bb4c579adb6f9b67482d5d35978c7 is a merge but no -m option was given., add -m 1 as follows intead:
```
$ git cherry-pick -m1 -x 2d5597e6743bb4c579adb6f9b67482d5d35978c7
```
At this point, you'll probably encounter some merge conflicts. You must
resolve them in to preserve the patch from PR Switch default fallback encoding detection lib to charset-normalizer #5930 as close to the
original as possible.

Push this branch to your fork on GitHub:

$ git push origin patchback/backports/3.8/2d5597e6743bb4c579adb6f9b67482d5d35978c7/pr-5930

Create a PR, ensure that the CI is green. If it's not — update it so that
the tests and any other checks pass. This is it!
Now relax and wait for the maintainers to process your pull request
when they have some cycles to do reviews. Don't worry — they'll tell you if
any improvements are necessary when the time comes!

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

aio-libs-github-bot · 2021-10-20T11:29:44Z

💔 Backport was not successful

The PR was attempted backported to the following branches:

❌ 3.8: Commit could not be cherrypicked due to conflicts

This change improves the performance of the encoding detection by substituting the backend lib with the new `Charset-Normalizer` (used to be `Chardet`). The patch is backward-compatible API wise, except that the dependency is different. PR aio-libs#5930 Co-authored-by: Sviatoslav Sydorenko <[email protected]> (cherry picked from commit 2d5597e)

webknjaz · 2021-10-20T11:49:25Z

Backport PR: #6108.

…tection lib to `charset-normalizer` (#6108) Co-authored-by: Sviatoslav Sydorenko <[email protected]> Co-authored-by: TAHRI Ahmed R <[email protected]>

Ousret · 2021-10-20T12:44:16Z

Thanks for your time. I can see that the backport PR has already been merged.

Ousret added 4 commits August 4, 2021 20:52

✨⚡️ Swap chardet main dependency for charset-normalizer

71b2875

📝 Update documentation impacted

29151e7

👥 Update CONTRIBUTORS.txt

8a7486a

📝 Add provisional CHANGES entry

7be7233

Ousret requested review from asvetlov and webknjaz as code owners August 4, 2021 20:23

psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Aug 4, 2021

Ousret added 2 commits August 4, 2021 22:26

📝 Conformance requirements related changes

c46e1fb

🚨 Hide warnings given by charset-normalizer

55d585f

Happen when give too few bytes

🚨 Fix LGTM lint alert

5a979a6

1 for Module is imported more than once

TheLanceLord mentioned this pull request Aug 11, 2021

Google Cloud Support Slackbot GoogleCloudPlatform/professional-services#673

Merged

Ousret and others added 2 commits August 21, 2021 07:58

Merge branch 'master' into patch-charset-detection

9895753

📝 charset-normalizer will never induce LookupError

a328c4c

Dreamsorcerer reviewed Sep 10, 2021

View reviewed changes

aiohttp/client_reqrep.py Outdated Show resolved Hide resolved

🎨 reformat file client_reqrep.py

64538d7

flake8 hint for double quote str

📝 Add words to spelling_wordlist.txt

ff6b3a2

🔥 Remove silent warning in charset_normalizer

7347d09

charset-normalizer v2.0.5 no longer raise warnings on tiny content

Merge branch 'master' into patch-charset-detection

16ab471

Ousret mentioned this pull request Oct 4, 2021

Fix encoding error with non-prettified encoded responses httpie/cli#1168

Merged

Merge branch 'master' into patch-charset-detection

7aa6a5d

webknjaz reviewed Oct 20, 2021

View reviewed changes

CHANGES/5930.feature Outdated Show resolved Hide resolved

webknjaz reviewed Oct 20, 2021

View reviewed changes

docs/spelling_wordlist.txt Outdated Show resolved Hide resolved

Ousret and others added 3 commits October 20, 2021 08:05

Update CHANGES/5930.feature

25ff27b

Co-authored-by: Sviatoslav Sydorenko <[email protected]>

📝 spelling_wordlist must be ordered alphabetically

d760bbe

Merge remote-tracking branch 'origin/patch-charset-detection' into pa…

4008f15

…tch-charset-detection

webknjaz approved these changes Oct 20, 2021

View reviewed changes

webknjaz added the backport-3.8 label Oct 20, 2021

webknjaz changed the title ~~Improve the default fallback encoding detection~~ Switch the default fallback encoding detection lib to charset-normalizer Oct 20, 2021

webknjaz changed the title ~~Switch the default fallback encoding detection lib to charset-normalizer~~ Switch default fallback encoding detection lib to charset-normalizer Oct 20, 2021

webknjaz enabled auto-merge (squash) October 20, 2021 10:41

webknjaz merged commit 2d5597e into aio-libs:master Oct 20, 2021

webknjaz mentioned this pull request Oct 20, 2021

Free resp conn @ test_mark_formdata_as_processed #6107

Merged

5 tasks

webknjaz mentioned this pull request Oct 20, 2021

[PR #5930/2d5597e6 backport][3.8] Switch default fallback encoding detection lib to charset-normalizer #6108

Merged

4 tasks

patchback bot mentioned this pull request Oct 20, 2021

[PR #6107/cd4c700d backport][3.8] Free resp conn @ test_mark_formdata_as_processed #6109

Merged

5 tasks

Ousret deleted the patch-charset-detection branch October 20, 2021 12:44

asvetlov mentioned this pull request Nov 3, 2021

Remove chardet from documentation and extra dependencies #6236

Closed

1 task

Dreamsorcerer mentioned this pull request Jul 11, 2022

cchardet seems to be obsolete, charset_normalizer as an alternative #6819

Closed

1 task

ScholliYT mentioned this pull request Jul 12, 2022

How to debug UnicodeDecodeError? ScholliYT/Broken-Links-Crawler-Action#16

Closed

lord-alfred mentioned this pull request Feb 19, 2023

cchardet is recommended now or is replaced by charset-normalizer by default? adbar/trafilatura#305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch default fallback encoding detection lib to `charset-normalizer` #5930

Switch default fallback encoding detection lib to `charset-normalizer` #5930

Ousret commented Aug 4, 2021 •

edited

Loading

lgtm-com bot commented Aug 5, 2021

Ousret commented Sep 10, 2021

Dreamsorcerer commented Sep 10, 2021 •

edited

Loading

Ousret commented Sep 11, 2021

codecov bot commented Sep 11, 2021 •

edited

Loading

Dreamsorcerer commented Sep 14, 2021

Ousret commented Sep 15, 2021

Ousret commented Oct 4, 2021

Ousret commented Oct 8, 2021

webknjaz commented Oct 20, 2021

Ousret commented Oct 20, 2021

webknjaz left a comment

webknjaz commented Oct 20, 2021

patchback bot commented Oct 20, 2021 •

edited

Loading

aio-libs-github-bot bot commented Oct 20, 2021

webknjaz commented Oct 20, 2021

Ousret commented Oct 20, 2021

Switch default fallback encoding detection lib to charset-normalizer #5930

Switch default fallback encoding detection lib to charset-normalizer #5930

Conversation

Ousret commented Aug 4, 2021 • edited Loading

What do these changes do?

Are there changes in behavior for the user?

Why should you bother with such a change? Is it worth it?

Is UTF-8 everywhere already?

Related issue number

Checklist

lgtm-com bot commented Aug 5, 2021

Ousret commented Sep 10, 2021

Dreamsorcerer commented Sep 10, 2021 • edited Loading

Ousret commented Sep 11, 2021

codecov bot commented Sep 11, 2021 • edited Loading

Codecov Report

Dreamsorcerer commented Sep 14, 2021

Ousret commented Sep 15, 2021

Ousret commented Oct 4, 2021

Ousret commented Oct 8, 2021

Context

RAW performance

RAM footprint

Other

webknjaz commented Oct 20, 2021

Ousret commented Oct 20, 2021

webknjaz left a comment

Choose a reason for hiding this comment

webknjaz commented Oct 20, 2021

patchback bot commented Oct 20, 2021 • edited Loading

Backport to 3.8: 💔 cherry-picking failed — conflicts found

Backporting merged PR #5930 into master

aio-libs-github-bot bot commented Oct 20, 2021

💔 Backport was not successful

webknjaz commented Oct 20, 2021

Ousret commented Oct 20, 2021

Switch default fallback encoding detection lib to `charset-normalizer` #5930

Switch default fallback encoding detection lib to `charset-normalizer` #5930

Ousret commented Aug 4, 2021 •

edited

Loading

Dreamsorcerer commented Sep 10, 2021 •

edited

Loading

codecov bot commented Sep 11, 2021 •

edited

Loading

patchback bot commented Oct 20, 2021 •

edited

Loading