-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory efficient encoding detection #4112
Comments
Yes, sure. |
Unfortunately right now I don't have time to, maybe later. I've created issue just not to forget about it. |
@asvetlov @decaz I was taking a quick look into this and have some open questions. As far as I see the |
We can replace 'resp.read() |
At the time of @decaz 's original PR, However, as of now, aiohttp/aiohttp/client_reqrep.py Lines 927 to 944 in a54956d
Is this the strategy you are thinking of? |
Sorry for misleading in my previous message.
P.S. |
Memoryview cannot be passed into I had to use https://docs.python.org/3/library/stdtypes.html#memoryview.tobytes , which has the same effect as directly slicing the I have done a quick test with from memory_profiler import profile
try:
import cchardet as chardet
except ImportError:
import chardet
import urllib.request
global _bytes
@profile
def chunked_with_memoryview(chunk_size: int = 2 * 1024):
detector = chardet.UniversalDetector()
_body_memoryview = memoryview(_bytes)
# print(len(_bytes) == len(_body_memoryview))
for i in range((len(_body_memoryview) // chunk_size) + 1):
_chunk = _body_memoryview[i * chunk_size: (i + 1) * chunk_size]
_chunk_bytes = _chunk.tobytes()
detector.feed(_chunk_bytes)
del _chunk_bytes
if detector.done:
print("chunk " + str(i) + " reached")
break
detector.close()
print(detector.result)
@profile
def without_chunking():
print(chardet.detect(_bytes)['encoding'])
if __name__ == "__main__":
global _bytes
# SHIFT-JIS, GB18030, KOI8-R, UHC/EUC-KR
for website in ['https://www.jalan.net/', 'http://ctrip.com',
'http://nn.ru', 'https://www.incruit.com/']:
response = urllib.request.urlopen(website)
_bytes = response.read()
chunked_with_memoryview()
without_chunking() with cchardet: https://gist.github.com/Transfusion/b2075b7a08863c3e5b5afc96b119c29d I neither see any significant decrease nor increase in memory usage with cchardet. |
I'm curious what is the performance improvement if any? |
The recommended libraries for charset encoding seem to outperform these libraries, plus charset detection seems to only be used when we already have the full body, so I don't think there's any value in looking at this any further. |
At #2549 @asvetlov proposed useful improvement of detection of content encoding to prevent reading the whole response in memory by allowing to specify max data size for sniffing.
UniversalDetector
can be used: https://chardet.readthedocs.io/en/latest/usage.html#example-detecting-encoding-incrementally.The text was updated successfully, but these errors were encountered: