Memory efficient encoding detection #4112

decaz · 2019-09-27T14:23:03Z

At #2549 @asvetlov proposed useful improvement of detection of content encoding to prevent reading the whole response in memory by allowing to specify max data size for sniffing.

UniversalDetector can be used: https://chardet.readthedocs.io/en/latest/usage.html#example-detecting-encoding-incrementally.

The text was updated successfully, but these errors were encountered:

asvetlov · 2019-09-27T15:14:24Z

Yes, sure.
Would you make a PR?

decaz · 2019-09-27T15:24:05Z

Unfortunately right now I don't have time to, maybe later. I've created issue just not to forget about it.

icmpecho · 2019-10-09T16:12:18Z

@asvetlov @decaz I was taking a quick look into this and have some open questions. As far as I see the read method of ClientResponse has already read the entire body and put it into the memory. the get_encoding method just make use of that in-memory body right?
So what is your expectation of this PR? Is it about changing the behaviour of read so it return a stream or just splitting the already in-memory body into chunks and feed that into the UniversalDetactor ?

asvetlov · 2019-10-09T20:47:39Z

We can replace 'resp.read()withresp.content.read(2048)for fetching first 2 Kb buffer (I have no idea what is the best size of sniffing uncoding though).content.read()is not stored inresp._body`, so we need more complex buffering strategy.

Transfusion · 2019-10-17T15:19:05Z

We can replace 'resp.read()withresp.content.read(2048)for fetching first 2 Kb buffer (I have no idea what is the best size of sniffing uncoding though).content.read()is not stored inresp._body`

At the time of @decaz 's original PR, await self.read() is always called if the encoding is not detected, i.e. there is no charset in the Content-Type response header from the web server, so _content (aka _body) will always be populated when chardet.detect is called,

However, as of now, get_encoding() always assumes _body is not None, i.e. read(), json(), or text() must be called first before it will work, or else it will throw an error if charset isn't in the Content-Type response header.

aiohttp/aiohttp/client_reqrep.py

Lines 927 to 944 in a54956d

    
           def get_encoding(self) -> str: 
        
               ctype = self.headers.get(hdrs.CONTENT_TYPE, '').lower() 
        
               mimetype = helpers.parse_mimetype(ctype) 
        
               encoding = mimetype.parameters.get('charset') 
        
               if encoding: 
        
                   try: 
        
                       codecs.lookup(encoding) 
        
                   except LookupError: 
        
                       encoding = None 
        
               if not encoding: 
        
                   if mimetype.type == 'application' and mimetype.subtype == 'json': 
        
                       # RFC 7159 states that the default encoding is UTF-8. 
        
                       encoding = 'utf-8' 
        
                   else: 
        
                       encoding = chardet.detect(self._body)['encoding'] 
        
               if not encoding: 
        
                   encoding = 'utf-8'

so we need more complex buffering strategy.

If ClientResponse.read() has already been called then we can just do detection on the whole body, no issue.
If it hasn't (_body is None)...
- Repeatedly call resp.content.read(n=2048) and append to a self._partial_body = bytearray() in ClientResponse until charset is detected or we reach EOF?
- Let's say detection is successful after i read 4096 bytes; the 2nd call. Now I want to read the full response, so I call text(). Inside read(), is all we need to do self._body = self._partial_body.append(self.content.read()) ? Because as the name of StreamReader suggests, read() will continue where we left off?

Is this the strategy you are thinking of?

asvetlov · 2019-10-18T15:39:56Z

Sorry for misleading in my previous message.
The change would be the following:

Read the entire body as we do it right now.
Pass only first 4KiB to chardet. Memoryview can be used: encoding = chardet.detect(memoryview(self._body)[:4096])['encoding']
Did I miss something?

P.S.
As I see get_encoding() should be a private method actually but this is another issue.

Transfusion · 2019-10-26T11:49:47Z

Memoryview cannot be passed into cchardet (the optimized cython version).
File "/home/transfusion/python38/lib/python3.8/site-packages/cchardet/__init__.py", line 35, in feed self._detector.feed(data) TypeError: Argument 'msg' has incorrect type (expected bytes, got memoryview)

I had to use https://docs.python.org/3/library/stdtypes.html#memoryview.tobytes , which has the same effect as directly slicing the bytes object since it copies the data.

I have done a quick test with memory_profiler by passing some webpages incrementally into UniversalDetector(), with cchardet 2.1.4 and chardet 3.0.4.

from memory_profiler import profile
try:
    import cchardet as chardet
except ImportError:
    import chardet
import urllib.request
global _bytes

@profile
def chunked_with_memoryview(chunk_size: int = 2 * 1024):
    detector = chardet.UniversalDetector()
    _body_memoryview = memoryview(_bytes)
    # print(len(_bytes) == len(_body_memoryview))
    for i in range((len(_body_memoryview) // chunk_size) + 1):
        _chunk = _body_memoryview[i * chunk_size: (i + 1) * chunk_size]
        _chunk_bytes = _chunk.tobytes()
        detector.feed(_chunk_bytes)
        del _chunk_bytes
        if detector.done:
            print("chunk " + str(i) + " reached")
            break

    detector.close()
    print(detector.result)


@profile
def without_chunking():
    print(chardet.detect(_bytes)['encoding'])


if __name__ == "__main__":
    global _bytes

    # SHIFT-JIS, GB18030, KOI8-R, UHC/EUC-KR
    for website in ['https://www.jalan.net/', 'http://ctrip.com',
                    'http://nn.ru', 'https://www.incruit.com/']:
        response = urllib.request.urlopen(website)
        _bytes = response.read()

        chunked_with_memoryview()
        without_chunking()

with cchardet: https://gist.github.com/Transfusion/b2075b7a08863c3e5b5afc96b119c29d
with chardet: https://gist.github.com/Transfusion/be47f38522f0a22e6dce6af95243082b

I neither see any significant decrease nor increase in memory usage with cchardet.
There is some small improvement (< 2 MB) if cchardet does not exist and instead falls back to chardet (which btw can work with memoryview). However, the same improvement can be had by simply installing cchardet, unless for some reason the platform doesn't support it.

asvetlov · 2019-10-26T11:54:34Z

I'm curious what is the performance improvement if any?

Dreamsorcerer · 2024-08-16T21:10:49Z

The recommended libraries for charset encoding seem to outperform these libraries, plus charset detection seems to only be used when we already have the full body, so I don't think there's any value in looking at this any further.

asvetlov added the Hacktoberfest We think it's good for https://hacktoberfest.digitalocean.com/ label Oct 3, 2019

Transfusion mentioned this issue Oct 17, 2019

ClientResponse.get_encoding() throws TypeError if it hasn't been read #4214

Closed

Ousret mentioned this issue Oct 8, 2021

Switch default fallback encoding detection lib to charset-normalizer #5930

Merged

4 tasks

john-parton mentioned this issue Aug 26, 2023

Replace cChardet with chardetng_py. #7559

Closed

5 tasks

Dreamsorcerer closed this as not planned Won't fix, can't repro, duplicate, stale Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory efficient encoding detection #4112

Memory efficient encoding detection #4112

decaz commented Sep 27, 2019

asvetlov commented Sep 27, 2019

decaz commented Sep 27, 2019

icmpecho commented Oct 9, 2019

asvetlov commented Oct 9, 2019

Transfusion commented Oct 17, 2019

asvetlov commented Oct 18, 2019

Transfusion commented Oct 26, 2019

asvetlov commented Oct 26, 2019

Dreamsorcerer commented Aug 16, 2024

Memory efficient encoding detection #4112

Memory efficient encoding detection #4112

Comments

decaz commented Sep 27, 2019

asvetlov commented Sep 27, 2019

decaz commented Sep 27, 2019

icmpecho commented Oct 9, 2019

asvetlov commented Oct 9, 2019

Transfusion commented Oct 17, 2019

asvetlov commented Oct 18, 2019

Transfusion commented Oct 26, 2019

asvetlov commented Oct 26, 2019

Dreamsorcerer commented Aug 16, 2024