Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch default fallback encoding detection lib to charset-normalizer #5930

Merged
merged 17 commits into from
Oct 20, 2021
Merged

Switch default fallback encoding detection lib to charset-normalizer #5930

merged 17 commits into from
Oct 20, 2021

Conversation

Ousret
Copy link
Contributor

@Ousret Ousret commented Aug 4, 2021

What do these changes do?

Switch Chardet dependency to Charset-Normalizer for the fallback encoding detection.

Are there changes in behavior for the user?

This change is mostly backward-compatible, exception of a thing:

Why should you bother with such a change? Is it worth it?

Short answer, absolutely.

Long answer:

It's still a heuristic lib, therefore cannot be trusted blindly of course.

Is UTF-8 everywhere already?

Not really, that is a dangerous assumption. Looking at https://w3techs.com/technologies/overview/character_encoding may seem like encoding detection is a thing of the past but not really. Solo based on 33k websites, you will find
3,4k responses without predefined encoding. 1,8k websites were not UTF-8, merely half! (Top 1000 sites from 80 countries in the world according to Data for SEO) https://github.com/potiuk/test-charset-normalizer

This statistic (w3techs) does not offer any ponderation, so one should not read it as
"I have a 97 % chance of hitting UTF-8 content on HTML content".

First of all, neither aiohttp, chardet or charset-normalizer are dedicated to HTML content. The detection concern every text document (SubRip Subtitle for ex.).
It is so hard to find any stats at all regarding this matter. Users' usages can be very dispersed, so making assumptions is unwise.

The real debate is to state if the detection is an HTTP client matter or not. That is more complicated and not my field.

Related issue number

No related issue.

Checklist

  • I think the code is well written
  • Documentation reflects the changes
  • If you provide code modification, please add yourself to CONTRIBUTORS.txt
  • Add a new news fragment into the CHANGES folder

@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Aug 4, 2021
@lgtm-com
Copy link
Contributor

lgtm-com bot commented Aug 5, 2021

This pull request introduces 1 alert when merging 55d585f into a341986 - view on LGTM.com

new alerts:

  • 1 for Module is imported more than once

1 for Module is imported more than once
@Ousret
Copy link
Contributor Author

Ousret commented Sep 10, 2021

Ping.

Would you be interested in such a change in the near future? @Dreamsorcerer @webknjaz

Regards,

@Dreamsorcerer
Copy link
Member

Dreamsorcerer commented Sep 10, 2021

I'm not really familiar with either library, so I don't think I'm in a place to make a decision on this. Although it does sound promising.

I've enabled the CI though, so looks like it needs some minor reformatting for flake8.

flake8 hint for double quote str
@Ousret
Copy link
Contributor Author

Ousret commented Sep 11, 2021

The rtd build failed with a syntax error at runtime. I cannot reboot the task.

@codecov
Copy link

codecov bot commented Sep 11, 2021

Codecov Report

Merging #5930 (4008f15) into master (2904573) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #5930   +/-   ##
=======================================
  Coverage   93.31%   93.31%           
=======================================
  Files         102      102           
  Lines       30062    30062           
  Branches     2690     2690           
=======================================
  Hits        28053    28053           
  Misses       1833     1833           
  Partials      176      176           
Flag Coverage Δ
unit 93.24% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiohttp/client_reqrep.py 98.01% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2904573...4008f15. Read the comment docs.

charset-normalizer v2.0.5 no longer raise warnings on tiny content
@Dreamsorcerer
Copy link
Member

The rtd build failed with a syntax error at runtime. I cannot reboot the task.

Should be fixed now. Just need to merge master.

@Ousret
Copy link
Contributor Author

Ousret commented Sep 15, 2021

Merged.
Thanks.

@Ousret
Copy link
Contributor Author

Ousret commented Oct 4, 2021

@asvetlov is there anything that could be done to ease the road toward the approval? I would gladly help in any way I can.
From any specific angle to run tests if needed.

@Ousret
Copy link
Contributor Author

Ousret commented Oct 8, 2021

I ran some basic tests and got some measurements to share.
You may find the sources of what I produced to get those results here https://github.com/Ousret/aiohttp-tests

Everything went well as expected. No crashes.

Context

  • aiohttp.TCPConnector(limit=16)
  • alpine linux
  • nginx 1.21
  • python 3.9
  • aiohttp dev-master
  • chardet 4.0.0 (aiohttp-chardet)
  • charset-normalizer 2.0.6 (aiohttp-next)

Prepare the environment

docker-compose build aiohttp-next server
docker-compose build --no-cache aiohttp-chardet

docker-compose up -d server

Collect results

docker-compose up aiohttp-chardet
docker-compose up aiohttp-next

Memory sampling (to be run in parallel of aiohttp-* services)

docker stats

RAW performance

This comes with no surprise as it was already proven previously.

  • using charset-normalizer
    • 37.7s with 467 files --> 81ms
  • using chardet
    • 189.3s with 467 files --> 405ms

5 times faster while keeping 88% (410 / 464 files) full backward-compatible results.

RAM footprint

Following issue #4112 it could be a good idea to showcase that aspect.
Using docker stats for the sampling.

Bellow, the peak usage for each flavor.

  • using charset-normalizer
    • 38MiB
  • using chardet
    • 67MiB

1.7 times less memory consumption (on peaks). For 5 times faster guess output (on avg).

Other

Keep in mind that the actual delays may vary depending on your CPU capabilities.
The factor between chardet/charset-normalizer flavors should remain intact.

You may find the RAW console outputs for both flavors at https://gist.github.com/Ousret/ec22b8842a42fb74276b052b9178dcc4

CHANGES/5930.feature Outdated Show resolved Hide resolved
@webknjaz
Copy link
Member

Hi @Ousret, the change is good. Though, could you apply those minor improvements I've pointed out?

@Ousret
Copy link
Contributor Author

Ousret commented Oct 20, 2021

I have applied the recommendations.

Copy link
Member

@webknjaz webknjaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks!

@webknjaz webknjaz changed the title Improve the default fallback encoding detection Switch the default fallback encoding detection lib to charset-normalizer Oct 20, 2021
@webknjaz webknjaz changed the title Switch the default fallback encoding detection lib to charset-normalizer Switch default fallback encoding detection lib to charset-normalizer Oct 20, 2021
@webknjaz webknjaz enabled auto-merge (squash) October 20, 2021 10:41
@webknjaz
Copy link
Member

@Ousret I've enabled auto-merge which should get this patch into master once the tests pass. After that, we'll need to get it backported to the 3.8 branch so that it'll get into the upcoming release of aiohttp v3.8. If automatic backporting fails, you'll need to follow the bot's instructions to do this manually.

@webknjaz webknjaz merged commit 2d5597e into aio-libs:master Oct 20, 2021
@patchback
Copy link
Contributor

patchback bot commented Oct 20, 2021

Backport to 3.8: 💔 cherry-picking failed — conflicts found

❌ Failed to cleanly apply 2d5597e on top of patchback/backports/3.8/2d5597e6743bb4c579adb6f9b67482d5d35978c7/pr-5930

Backporting merged PR #5930 into master

  1. Ensure you have a local repo clone of your fork. Unless you cloned it
    from the upstream, this would be your origin remote.
  2. Make sure you have an upstream repo added as a remote too. In these
    instructions you'll refer to it by the name upstream. If you don't
    have it, here's how you can add it:
    $ git remote add upstream https://github.com/aio-libs/aiohttp.git
  3. Ensure you have the latest copy of upstream and prepare a branch
    that will hold the backported code:
    $ git fetch upstream
    $ git checkout -b patchback/backports/3.8/2d5597e6743bb4c579adb6f9b67482d5d35978c7/pr-5930 upstream/3.8
  4. Now, cherry-pick PR Switch default fallback encoding detection lib to charset-normalizer #5930 contents into that branch:
    $ git cherry-pick -x 2d5597e6743bb4c579adb6f9b67482d5d35978c7
    If it'll yell at you with something like fatal: Commit 2d5597e6743bb4c579adb6f9b67482d5d35978c7 is a merge but no -m option was given., add -m 1 as follows intead:
    $ git cherry-pick -m1 -x 2d5597e6743bb4c579adb6f9b67482d5d35978c7
  5. At this point, you'll probably encounter some merge conflicts. You must
    resolve them in to preserve the patch from PR Switch default fallback encoding detection lib to charset-normalizer #5930 as close to the
    original as possible.
  6. Push this branch to your fork on GitHub:
    $ git push origin patchback/backports/3.8/2d5597e6743bb4c579adb6f9b67482d5d35978c7/pr-5930
  7. Create a PR, ensure that the CI is green. If it's not — update it so that
    the tests and any other checks pass. This is it!
    Now relax and wait for the maintainers to process your pull request
    when they have some cycles to do reviews. Don't worry — they'll tell you if
    any improvements are necessary when the time comes!

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@aio-libs-github-bot
Copy link
Contributor

💔 Backport was not successful

The PR was attempted backported to the following branches:

  • ❌ 3.8: Commit could not be cherrypicked due to conflicts

webknjaz pushed a commit to webknjaz/aiohttp that referenced this pull request Oct 20, 2021
This change improves the performance of the encoding
detection by substituting the backend lib with the
new `Charset-Normalizer` (used to be `Chardet`).

The patch is backward-compatible API wise, except
that the dependency is different.

PR aio-libs#5930

Co-authored-by: Sviatoslav Sydorenko <[email protected]>
(cherry picked from commit 2d5597e)
@webknjaz
Copy link
Member

Backport PR: #6108.

webknjaz added a commit that referenced this pull request Oct 20, 2021
…tection lib to `charset-normalizer` (#6108)

Co-authored-by: Sviatoslav Sydorenko <[email protected]>
Co-authored-by: TAHRI Ahmed R <[email protected]>
@Ousret
Copy link
Contributor Author

Ousret commented Oct 20, 2021

Thanks for your time. I can see that the backport PR has already been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bot:chronographer:provided There is a change note present in this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants