Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect content encoding if invalid charset was specified #2549

Merged
merged 6 commits into from
Nov 23, 2017

Conversation

decaz
Copy link
Contributor

@decaz decaz commented Nov 23, 2017

What do these changes do?

  1. Add processing of invalid charsets while detecting content encoding
  2. Make the aiohttp.ClientResponse.get_encoding method public

Are there changes in behavior for the user?

Autodetection of content encoding is now working even if content provided with invalid charset (such charset is taken as if it was not provided).

Related issue number

There are no any opened issues that will be resolved by merging this change.

Checklist

  • I think the code is well written
  • Unit tests for the changes exist
  • Documentation reflects the changes
  • If you provide code modification, please add yourself to CONTRIBUTORS.txt
    • The format is <Name> <Surname>.
    • Please keep alphabetical order, the file is sorted by names.
  • Add a new news fragment into the CHANGES folder
    • name it <issue_id>.<type> for example (588.bug)
    • if you don't have an issue_id change it to the pr id after creating the pr
    • ensure type is one of the following:
      • .feature: Signifying a new feature.
      • .bugfix: Signifying a bug fix.
      • .doc: Signifying a documentation improvement.
      • .removal: Signifying a deprecation or removal of public API.
      • .misc: A ticket has been closed, but it is not of interest to users.
    • Make sure to use full sentences with correct case and punctuation, for example: "Fix issue with non-ascii contents in doctest text files."

@asvetlov
Copy link
Member

asvetlov commented Nov 23, 2017

Sorry, I don't understand your use case.
How making a private method public fixes content encoding autodetection (as you mentioned in PR's initial message)?

@decaz
Copy link
Contributor Author

decaz commented Nov 23, 2017

@asvetlov I have rewrote the description :)
P.S.: there is something strange with doc-spelling at the Travis, it succeeds on my machine, but it fails at CI =/

@asvetlov
Copy link
Member

asvetlov commented Nov 23, 2017

Catching LookupError looks good but I still don't understand the reason for _get_encoding -> get_encoding renaming.

Local make doc-spelling doesn't check a CHANGES/xxx file.

@decaz
Copy link
Contributor Author

decaz commented Nov 23, 2017

@asvetlov currently I am fetching sites by reading content by chunks and have to use "private" _get_encoding method to get resulting content's encoding:

encoding = response._get_encoding()

So it will be very handy to have such method as public helper. What do you think?

@codecov-io
Copy link

codecov-io commented Nov 23, 2017

Codecov Report

Merging #2549 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2549      +/-   ##
==========================================
+ Coverage   97.08%   97.08%   +<.01%     
==========================================
  Files          40       40              
  Lines        8135     8141       +6     
  Branches     1438     1439       +1     
==========================================
+ Hits         7898     7904       +6     
  Misses        100      100              
  Partials      137      137
Impacted Files Coverage Δ
aiohttp/client_reqrep.py 97.22% <100%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e180b12...994b4f6. Read the comment docs.

@asvetlov
Copy link
Member

Aaah, makes sense.
Another nice to have feature is allowing to specify max data size for sniffing to prevent reading the whole response in memory.
UniversalDetector may help: https://chardet.readthedocs.io/en/latest/usage.html#example-detecting-encoding-incrementally

@decaz
Copy link
Contributor Author

decaz commented Nov 23, 2017

@asvetlov thanks for the advice! I'll think about implementation of detection encoding incrementally. It may be worth to implement this at the response.content.read method level (add new parameter, for instance detect_encoding=True), which will detect encoding incrementally by chunks and assign it to the response.content.encoding attribute.

@asvetlov
Copy link
Member

Well, let's merge the PR as is.
Please make an issue/PR for encoding detection improvements

@asvetlov asvetlov merged commit 67eb1e7 into aio-libs:master Nov 23, 2017
@asvetlov
Copy link
Member

Thanks!

@decaz decaz deleted the resp-get-encoding branch November 24, 2017 17:09
@lock
Copy link

lock bot commented Oct 28, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a [new issue] for related bugs.
If you feel like there's important points made in this discussion, please include those exceprts into that [new issue].
[new issue]: https://github.com/aio-libs/aiohttp/issues/new

@lock lock bot added the outdated label Oct 28, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Oct 28, 2019
@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Oct 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bot:chronographer:provided There is a change note present in this PR outdated
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants