Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch default fallback encoding detection lib to charset-normalizer #5930

Merged
merged 17 commits into from
Oct 20, 2021
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGES/5930.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Switch chardet to charset-normalizer when aiohttp need to guess the body encoding
Ousret marked this conversation as resolved.
Show resolved Hide resolved
1 change: 1 addition & 0 deletions CONTRIBUTORS.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Adam Horacek
Adam Mills
Adrian Krupa
Adrián Chaves
Ahmed Tahri
Alan Tse
Alec Hanefeld
Alejandro Gómez
Expand Down
4 changes: 2 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -164,14 +164,14 @@ Requirements

- Python >= 3.7
- async-timeout_
- chardet_
- charset-normalizer_
- multidict_
- yarl_

Optionally you may install the cChardet_ and aiodns_ libraries (highly
recommended for sake of speed).

.. _chardet: https://pypi.python.org/pypi/chardet
.. _charset-normalizer: https://pypi.org/project/charset-normalizer
.. _aiodns: https://pypi.python.org/pypi/aiodns
.. _multidict: https://pypi.python.org/pypi/multidict
.. _yarl: https://pypi.python.org/pypi/yarl
Expand Down
2 changes: 1 addition & 1 deletion aiohttp/client_reqrep.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@
try:
import cchardet as chardet
except ImportError: # pragma: no cover
import chardet # type: ignore[no-redef]
import charset_normalizer as chardet # type: ignore[no-redef]


__all__ = ("ClientRequest", "ClientResponse", "RequestInfo", "Fingerprint")
Expand Down
16 changes: 8 additions & 8 deletions docs/client_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1374,10 +1374,10 @@ Response object
specified *encoding* parameter.

If *encoding* is ``None`` content encoding is autocalculated
using ``Content-Type`` HTTP header and *chardet* tool if the
using ``Content-Type`` HTTP header and *charset-normalizer* tool if the
header is not provided by server.

:term:`cchardet` is used with fallback to :term:`chardet` if
:term:`cchardet` is used with fallback to :term:`charset-normalizer` if
*cchardet* is not available.

Close underlying connection if data reading gets an error,
Expand All @@ -1389,14 +1389,14 @@ Response object

:return str: decoded *BODY*

:raise LookupError: if the encoding detected by chardet or cchardet is
:raise LookupError: if the encoding detected by cchardet is
unknown by Python (e.g. VISCII).

.. note::

If response has no ``charset`` info in ``Content-Type`` HTTP
header :term:`cchardet` / :term:`chardet` is used for content
encoding autodetection.
header :term:`cchardet` / :term:`charset-normalizer` is used for
content encoding autodetection.

It may hurt performance. If page encoding is known passing
explicit *encoding* parameter might help::
Expand All @@ -1411,7 +1411,7 @@ Response object
a ``read`` call will be done,

If *encoding* is ``None`` content encoding is autocalculated
using :term:`cchardet` or :term:`chardet` as fallback if
using :term:`cchardet` or :term:`charset-normalizer` as fallback if
*cchardet* is not available.

if response's `content-type` does not match `content_type` parameter
Expand Down Expand Up @@ -1449,11 +1449,11 @@ Response object
Automatically detect content encoding using ``charset`` info in
``Content-Type`` HTTP header. If this info is not exists or there
are no appropriate codecs for encoding then :term:`cchardet` /
:term:`chardet` is used.
:term:`charset-normalizer` is used.

Beware that it is not always safe to use the result of this function to
decode a response. Some encodings detected by cchardet are not known by
Python (e.g. VISCII).
Python (e.g. VISCII). *charset-normalizer* is not concerned by that issue.

:raise RuntimeError: if called before the body has been read,
for :term:`cchardet` usage
Expand Down
7 changes: 4 additions & 3 deletions docs/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,12 @@
Any object that can be called. Use :func:`callable` to check
that.

chardet
charset-normalizer

The Universal Character Encoding Detector
The Real First Universal Charset Detector.
Open, modern and actively maintained alternative to Chardet.

https://pypi.python.org/pypi/chardet/
https://pypi.org/project/charset-normalizer/

cchardet

Expand Down
8 changes: 4 additions & 4 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Library Installation
$ pip install aiohttp

You may want to install *optional* :term:`cchardet` library as faster
replacement for :term:`chardet`:
replacement for :term:`charset-normalizer`:

.. code-block:: bash

Expand All @@ -51,7 +51,7 @@ This option is highly recommended:
Installing speedups altogether
------------------------------

The following will get you ``aiohttp`` along with :term:`chardet`,
The following will get you ``aiohttp`` along with :term:`charset-normalizer`,
:term:`aiodns` and ``Brotli`` in one bundle. No need to type
separate commands anymore!

Expand Down Expand Up @@ -148,11 +148,11 @@ Dependencies

- Python 3.7+
- *async_timeout*
- *chardet*
- *charset-normalizer*
- *multidict*
- *yarl*
- *Optional* :term:`cchardet` as faster replacement for
:term:`chardet`.
:term:`charset-normalizer`.

Install it explicitly via:

Expand Down
2 changes: 2 additions & 0 deletions docs/spelling_wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,8 @@ cchardet
ceil
charset
charsetdetect
normalizer
Chardet
Ousret marked this conversation as resolved.
Show resolved Hide resolved
chunked
chunking
cls
Expand Down
2 changes: 1 addition & 1 deletion requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ async-timeout==4.0.0a3
asynctest==0.13.0; python_version<"3.8"
Brotli==1.0.9
cchardet==2.1.7
chardet==4.0.0
charset-normalizer==2.0.4
frozenlist==1.1.1
gunicorn==20.1.0
typing_extensions==3.7.4.3
Expand Down
2 changes: 1 addition & 1 deletion requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ cfgv==3.2.0
# via
# -r requirements/lint.txt
# pre-commit
chardet==4.0.0
charset-normalizer==2.0.4
# via
# -r requirements/base.txt
# requests
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@
raise RuntimeError("Unable to determine version.")

install_requires = [
"chardet>=2.0,<5.0",
"charset-normalizer>=2.0,<3.0",
"multidict>=4.5,<7.0",
"async_timeout>=4.0a2,<5.0",
'asynctest==0.13.0; python_version<"3.8"',
Expand Down