-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch default fallback encoding detection lib to charset-normalizer
#5930
Conversation
Happen when give too few bytes
This pull request introduces 1 alert when merging 55d585f into a341986 - view on LGTM.com new alerts:
|
1 for Module is imported more than once
Ping. Would you be interested in such a change in the near future? @Dreamsorcerer @webknjaz Regards, |
I'm not really familiar with either library, so I don't think I'm in a place to make a decision on this. Although it does sound promising. I've enabled the CI though, so looks like it needs some minor reformatting for flake8. |
flake8 hint for double quote str
The rtd build failed with a syntax error at runtime. I cannot reboot the task. |
Codecov Report
@@ Coverage Diff @@
## master #5930 +/- ##
=======================================
Coverage 93.31% 93.31%
=======================================
Files 102 102
Lines 30062 30062
Branches 2690 2690
=======================================
Hits 28053 28053
Misses 1833 1833
Partials 176 176
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
charset-normalizer v2.0.5 no longer raise warnings on tiny content
Should be fixed now. Just need to merge master. |
Merged. |
@asvetlov is there anything that could be done to ease the road toward the approval? I would gladly help in any way I can. |
I ran some basic tests and got some measurements to share. Everything went well as expected. No crashes. Context
Prepare the environment
Collect results
Memory sampling (to be run in parallel of
RAW performanceThis comes with no surprise as it was already proven previously.
5 times faster while keeping 88% (410 / 464 files) full backward-compatible results. RAM footprintFollowing issue #4112 it could be a good idea to showcase that aspect. Bellow, the peak usage for each flavor.
1.7 times less memory consumption (on peaks). For 5 times faster guess output (on avg). OtherKeep in mind that the actual delays may vary depending on your CPU capabilities. You may find the RAW console outputs for both flavors at https://gist.github.com/Ousret/ec22b8842a42fb74276b052b9178dcc4 |
Hi @Ousret, the change is good. Though, could you apply those minor improvements I've pointed out? |
Co-authored-by: Sviatoslav Sydorenko <[email protected]>
…tch-charset-detection
I have applied the recommendations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks!
charset-normalizer
charset-normalizer
charset-normalizer
@Ousret I've enabled auto-merge which should get this patch into |
Backport to 3.8: 💔 cherry-picking failed — conflicts found❌ Failed to cleanly apply 2d5597e on top of patchback/backports/3.8/2d5597e6743bb4c579adb6f9b67482d5d35978c7/pr-5930 Backporting merged PR #5930 into master
🤖 @patchback |
💔 Backport was not successfulThe PR was attempted backported to the following branches:
|
This change improves the performance of the encoding detection by substituting the backend lib with the new `Charset-Normalizer` (used to be `Chardet`). The patch is backward-compatible API wise, except that the dependency is different. PR aio-libs#5930 Co-authored-by: Sviatoslav Sydorenko <[email protected]> (cherry picked from commit 2d5597e)
Backport PR: #6108. |
…tection lib to `charset-normalizer` (#6108) Co-authored-by: Sviatoslav Sydorenko <[email protected]> Co-authored-by: TAHRI Ahmed R <[email protected]>
Thanks for your time. I can see that the backport PR has already been merged. |
What do these changes do?
Switch Chardet dependency to Charset-Normalizer for the fallback encoding detection.
Are there changes in behavior for the user?
This change is mostly backward-compatible, exception of a thing:
Why should you bother with such a change? Is it worth it?
Short answer, absolutely.
Long answer:
Windows-1252
Windows-1254
ISO-8859-7
utf_8
for charset-normalizerWindows-1252
utf_8
for charset-normalizerrequests
did integrate it first and for total transparency, the lib needed some minors adjustments. But it is going well so far.It's still a heuristic lib, therefore cannot be trusted blindly of course.
Is UTF-8 everywhere already?
Not really, that is a dangerous assumption. Looking at https://w3techs.com/technologies/overview/character_encoding may seem like encoding detection is a thing of the past but not really. Solo based on 33k websites, you will find
3,4k responses without predefined encoding. 1,8k websites were not UTF-8, merely half! (Top 1000 sites from 80 countries in the world according to Data for SEO) https://github.com/potiuk/test-charset-normalizer
This statistic (w3techs) does not offer any ponderation, so one should not read it as
"I have a 97 % chance of hitting UTF-8 content on HTML content".
First of all, neither aiohttp, chardet or charset-normalizer are dedicated to HTML content. The detection concern every text document (SubRip Subtitle for ex.).
It is so hard to find any stats at all regarding this matter. Users' usages can be very dispersed, so making assumptions is unwise.
The real debate is to state if the detection is an HTTP client matter or not. That is more complicated and not my field.
Related issue number
No related issue.
Checklist
CONTRIBUTORS.txt
CHANGES
folder