From db9af4323fee938d6bdfed3aae20339da26ead11 Mon Sep 17 00:00:00 2001 From: TAHRI Ahmed R Date: Mon, 6 Mar 2023 07:46:55 +0100 Subject: [PATCH] Release 3.1 (#270) --- CHANGELOG.md | 7 ++- README.md | 50 ++++++++++---------- bin/run_autofix.sh | 2 +- bin/run_checks.sh | 2 +- charset_normalizer/version.py | 2 +- docs/community/faq.rst | 19 +++++++- docs/user/support.rst | 86 +++++++++++++++++++---------------- 7 files changed, 99 insertions(+), 69 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 87282c6c..e3763752 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,14 +2,17 @@ All notable changes to charset-normalizer will be documented in this file. This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). -## [3.1.0-dev0](https://github.com/Ousret/charset_normalizer/compare/3.0.1...master) (unreleased) +## [3.1.0](https://github.com/Ousret/charset_normalizer/compare/3.0.1...3.1.0) (2023-03-06) ### Added -- Argument `should_rename_legacy` for legacy function `detect` and disregard any new arguments without errors (PR #261) +- Argument `should_rename_legacy` for legacy function `detect` and disregard any new arguments without errors (PR #262) ### Removed - Support for Python 3.6 (PR #260) +### Changed +- Optional speedup provided by mypy/c 1.0.1 + ## [3.0.1](https://github.com/Ousret/charset_normalizer/compare/3.0.0...3.0.1) (2022-11-18) ### Fixed diff --git a/README.md b/README.md index c5ce43a6..f196dc66 100644 --- a/README.md +++ b/README.md @@ -23,18 +23,18 @@ This project offers you an alternative to **Universal Charset Encoding Detector**, also known as **Chardet**. -| Feature | [Chardet](https://github.com/chardet/chardet) | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) | -| ------------- | :-------------: | :------------------: | :------------------: | -| `Fast` | ❌
| ✅
| ✅
| -| `Universal**` | ❌ | ✅ | ❌ | -| `Reliable` **without** distinguishable standards | ❌ | ✅ | ✅ | -| `Reliable` **with** distinguishable standards | ✅ | ✅ | ✅ | -| `License` | LGPL-2.1
_restrictive_ | MIT | MPL-1.1
_restrictive_ | -| `Native Python` | ✅ | ✅ | ❌ | -| `Detect spoken language` | ❌ | ✅ | N/A | -| `UnicodeDecodeError Safety` | ❌ | ✅ | ❌ | -| `Whl Size` | 193.6 kB | 39.5 kB | ~200 kB | -| `Supported Encoding` | 33 | :tada: [90](https://charset-normalizer.readthedocs.io/en/latest/user/support.html#supported-encodings) | 40 +| Feature | [Chardet](https://github.com/chardet/chardet) | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) | +|--------------------------------------------------|:---------------------------------------------:|:------------------------------------------------------------------------------------------------------:|:-----------------------------------------------:| +| `Fast` | ❌
| ✅
| ✅
| +| `Universal**` | ❌ | ✅ | ❌ | +| `Reliable` **without** distinguishable standards | ❌ | ✅ | ✅ | +| `Reliable` **with** distinguishable standards | ✅ | ✅ | ✅ | +| `License` | LGPL-2.1
_restrictive_ | MIT | MPL-1.1
_restrictive_ | +| `Native Python` | ✅ | ✅ | ❌ | +| `Detect spoken language` | ❌ | ✅ | N/A | +| `UnicodeDecodeError Safety` | ❌ | ✅ | ❌ | +| `Whl Size` | 193.6 kB | 39.5 kB | ~200 kB | +| `Supported Encoding` | 33 | :tada: [90](https://charset-normalizer.readthedocs.io/en/latest/user/support.html#supported-encodings) | 40 |

Reading Normalized TextCat Reading Text @@ -50,15 +50,15 @@ Did you got there because of the logs? See [https://charset-normalizer.readthedo This package offer better performance than its counterpart Chardet. Here are some numbers. -| Package | Accuracy | Mean per file (ms) | File per sec (est) | -| ------------- | :-------------: | :------------------: | :------------------: | -| [chardet](https://github.com/chardet/chardet) | 86 % | 200 ms | 5 file/sec | -| charset-normalizer | **98 %** | **10 ms** | 100 file/sec | +| Package | Accuracy | Mean per file (ms) | File per sec (est) | +|-----------------------------------------------|:--------:|:------------------:|:------------------:| +| [chardet](https://github.com/chardet/chardet) | 86 % | 200 ms | 5 file/sec | +| charset-normalizer | **98 %** | **10 ms** | 100 file/sec | -| Package | 99th percentile | 95th percentile | 50th percentile | -| ------------- | :-------------: | :------------------: | :------------------: | -| [chardet](https://github.com/chardet/chardet) | 1200 ms | 287 ms | 23 ms | -| charset-normalizer | 100 ms | 50 ms | 5 ms | +| Package | 99th percentile | 95th percentile | 50th percentile | +|-----------------------------------------------|:---------------:|:---------------:|:---------------:| +| [chardet](https://github.com/chardet/chardet) | 1200 ms | 287 ms | 23 ms | +| charset-normalizer | 100 ms | 50 ms | 5 ms | Chardet's performance on larger file (1MB+) are very poor. Expect huge difference on large payload. @@ -185,15 +185,15 @@ Don't confuse package **ftfy** with charset-normalizer or chardet. ftfy goal is ## 🍰 How - Discard all charset encoding table that could not fit the binary content. - - Measure chaos, or the mess once opened (by chunks) with a corresponding charset encoding. + - Measure noise, or the mess once opened (by chunks) with a corresponding charset encoding. - Extract matches with the lowest mess detected. - Additionally, we measure coherence / probe for a language. -**Wait a minute**, what is chaos/mess and coherence according to **YOU ?** +**Wait a minute**, what is noise/mess and coherence according to **YOU ?** -*Chaos :* I opened hundred of text files, **written by humans**, with the wrong encoding table. **I observed**, then +*Noise :* I opened hundred of text files, **written by humans**, with the wrong encoding table. **I observed**, then **I established** some ground rules about **what is obvious** when **it seems like** a mess. - I know that my interpretation of what is chaotic is very subjective, feel free to contribute in order to + I know that my interpretation of what is noise is probably incomplete, feel free to contribute in order to improve or rewrite it. *Coherence :* For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought @@ -226,7 +226,7 @@ This project is [MIT](https://github.com/Ousret/charset_normalizer/blob/master/L Characters frequencies used in this project © 2012 [Denny Vrandečić](http://simia.net/letters/) -## For Enterprise +## 💼 For Enterprise Professional support for charset-normalizer is available as part of the [Tidelift Subscription][1]. Tidelift gives software development teams a single source for diff --git a/bin/run_autofix.sh b/bin/run_autofix.sh index e88f45c6..ac4fe2cd 100755 --- a/bin/run_autofix.sh +++ b/bin/run_autofix.sh @@ -7,5 +7,5 @@ fi set -x -${PREFIX}black --target-version=py36 charset_normalizer +${PREFIX}black --target-version=py37 charset_normalizer ${PREFIX}isort charset_normalizer diff --git a/bin/run_checks.sh b/bin/run_checks.sh index 1e135b35..d5fc34a4 100755 --- a/bin/run_checks.sh +++ b/bin/run_checks.sh @@ -8,7 +8,7 @@ fi set -x ${PREFIX}pytest -${PREFIX}black --check --diff --target-version=py36 charset_normalizer +${PREFIX}black --check --diff --target-version=py37 charset_normalizer ${PREFIX}flake8 charset_normalizer ${PREFIX}mypy charset_normalizer ${PREFIX}isort --check --diff charset_normalizer diff --git a/charset_normalizer/version.py b/charset_normalizer/version.py index b6b8fa50..b74c2643 100644 --- a/charset_normalizer/version.py +++ b/charset_normalizer/version.py @@ -2,5 +2,5 @@ Expose version """ -__version__ = "3.1.0-dev0" +__version__ = "3.1.0" VERSION = __version__.split(".") diff --git a/docs/community/faq.rst b/docs/community/faq.rst index c52f223a..d1deff4d 100644 --- a/docs/community/faq.rst +++ b/docs/community/faq.rst @@ -40,7 +40,7 @@ If you use the legacy `detect` function, Then this change is mostly backward-compatible, exception of a thing: - This new library support way more code pages (x3) than its counterpart Chardet. - - Based on the 30-ich charsets that Chardet support, expect roughly 85% BC results https://github.com/Ousret/charset_normalizer/pull/77/checks?check_run_id=3244585065 +- Based on the 30-ich charsets that Chardet support, expect roughly 80% BC results We do not guarantee this BC exact percentage through time. May vary but not by much. @@ -56,3 +56,20 @@ detection. Any code page supported by your cPython is supported by charset-normalizer! It is that simple, no need to update the library. It is as generic as we could do. + +I can't build standalone executable +----------------------------------- + +If you are using ``pyinstaller``, ``py2exe`` or alike, you may be encountering this or close to: + + ModuleNotFoundError: No module named 'charset_normalizer.md__mypyc' + +Why? + +- Your package manager picked up a optimized (for speed purposes) wheel that match your architecture and operating system. +- Finally, the module ``charset_normalizer.md__mypyc`` is imported via binaries and can't be seen using your tool. + +How to remedy? + +If your bundler program support it, set up a hook that implicitly import the hidden module. +Otherwise, follow the guide on how to install the vanilla version of this package. (Section: *Optional speedup extension*) diff --git a/docs/user/support.rst b/docs/user/support.rst index 3de51f28..ac10d653 100644 --- a/docs/user/support.rst +++ b/docs/user/support.rst @@ -124,41 +124,51 @@ Supported Languages Those language can be detected inside your content. All of these are specified in ./charset_normalizer/assets/__init__.py . -English, -German, -French, -Dutch, -Italian, -Polish, -Spanish, -Russian, -Japanese, -Portuguese, -Swedish, -Chinese, -Ukrainian, -Norwegian, -Finnish, -Vietnamese, -Czech, -Hungarian, -Korean, -Indonesian, -Turkish, -Romanian, -Farsi, -Arabic, -Danish, -Serbian, -Lithuanian, -Slovene, -Slovak, -Malay, -Hebrew, -Bulgarian, -Croatian, -Hindi, -Estonian, -Thai, -Greek, -Tamil. +| English, +| German, +| French, +| Dutch, +| Italian, +| Polish, +| Spanish, +| Russian, +| Japanese, +| Portuguese, +| Swedish, +| Chinese, +| Ukrainian, +| Norwegian, +| Finnish, +| Vietnamese, +| Czech, +| Hungarian, +| Korean, +| Indonesian, +| Turkish, +| Romanian, +| Farsi, +| Arabic, +| Danish, +| Serbian, +| Lithuanian, +| Slovene, +| Slovak, +| Malay, +| Hebrew, +| Bulgarian, +| Croatian, +| Hindi, +| Estonian, +| Thai, +| Greek, +| Tamil. + +---------------------------- +Incomplete Sequence / Stream +---------------------------- + +It is not (yet) officially supported. If you feed an incomplete byte sequence (eg. truncated multi-byte sequence) the detector will +most likely fail to return a proper result. +If you are purposely feeding part of your payload for performance concerns, you may stop doing it as this package is fairly optimized. + +We are working on a dedicated way to handle streams.