-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large responses cause increased memory usage. #145
Comments
Digging more - the call to:
In |
ISTM that there might be an extra copy of the body attached to the response making things worse ( |
Turns out there is a simple issue that might gives us most of the possible savings on a simple For patched code and bare requests I get:
For original code I get:
FWIW, I see more fluctuations in memory use even in the patched version (as opposed to raw requests) when watching the process in task manager. But it does't climb nearly as high as the unpatched version. Testing on *nix very welcome :) Can you show how exactly you "throw CacheControl into the mix"? I'm using Thanks for reporting this :) I'm using a test file like this: out = open('bigdata.bin', 'w', 'replace')
data = ''.join(chr(x) for x in range(37, 1064)) + '\n'
for i in range(70):
for j in range(1024):
out.write(data)
out.write(str(i + j) + '\n' * 20)
out.close() Then testing it like this: import os
import logging
import shutil
import requests
import cachecontrol
import psutil
# logging.basicConfig(level=logging.DEBUG)
us = psutil.Process(os.getpid())
MB = 1024 * 1024
N = 15
sess = cachecontrol.CacheControl(requests.Session())
total = 0
for i in range(N):
url = 'http://localhost:8000/bigdata.bin?limit=%s' % i
print('Requesting %s...' % i)
response = sess.get(url, stream=True)
fh = open('dest.bin', 'wb')
shutil.copyfileobj(response.raw, fh)
fh.close()
used_mem = us.memory_full_info().uss / MB
total += used_mem
print('Using %d MB.' % round(used_mem))
import gc
gc.collect()
print('Done.')
print('Using %d MB on program end.' % round(us.memory_full_info().uss / MB))
print('Mean memory use: %d MB' % round(total/N))
input() |
This seems pretty fundamental: The cached data is read entirely into memory and then duplicated with the call to serializer.loads. I'm surprised to see this. This is just not usable with large files. I'd expect the cache response to include response headers only and a pointer/reader to cached body content that is usable as a file-like or stream-like object. |
One thing that makes things tricky is that we're caching objects and not raw responses. This is primarily because under the hood it wasn't feasible to use the same loading mechanism as urllib3 does on the socket to ensure things work correctly. The result is that we've gone with the 80% solution, which seems to have been OK. I don't have much time to work on CacheControl, but I'm happy to review any PRs to try and fix the issue. The one caveat is that since it is bundled with pip, we need to tread lightly regarding compatibility. |
I think the current serialization scheme could be maintained, but the interface for checking and serving cached content would need be more selective about what and how it reads from disk:
As an interim change, the facility might set an upper limit on file size, above which it simply doesn't cache or read. This could be set high enough to avoid breaking most-if-not-all caching scenarios, but low enough to avoid wantonly thrashing a system. I don't know what this number is. More than 1G, less than 4G? :) I'd dive in to help but I'm heads down on other things myself. Typical for a GitHub issue troll! |
Related:
I took a look under the hood and it appears that the implementation of response body caching in cachecontrol is naive: when saving a response the entire body is stored in memory and is only written to disk at the end. |
Well, if you wish to make backwards incompatible changes, I'm pretty sure we'd be happy to accomodate for them on pip's end. While there are caveats added by being bundled with pip (need to be pure-Python, not writing imports a certain way etc) -- API backwards compatibility is not one, since pip has careful and exact control of what version of cachecontrol it uses. There may be some nuance of downstream redistributors who do weird things with pip, but they picked their poison by doing those weird things -- so I wouldn't worry too much about them. |
#254 seems to fix this? |
Just checking in again—anything I can do to help get #254 reviewed and merged? Bribery? Extra code/tests? |
Thanks for merging #254. Does this mean this can be closed? |
0.7.4 was released yesterday (Windows is still pending). Would you mind trying 0.7.4 to verify that this addresses your issue? Feel free to close as needed! |
@gar1t I'm not a 100% sure what you mean by that. There's no 0.7.4 release for this package. :) |
I tested with |
In particular, I ran under the Fil memory profiler (https://pythonspeed.com/fil/) which gives peak memory. In both cases peak memory looked like this when using the test script provided by #145 (comment) (this is the 0.12.6 version, Git |
🎉 Plans for a release? :) Otherwise, I guess pip can vendor a specific commit @pradyunsg ? |
No, pip only vendors releases from PyPI -- see https://pip.pypa.io/en/stable/development/vendoring-policy/ |
When downloading large files, memory usage is not constant when using CacheControl.
I believe this is due to the FileWrapper that buffers the response in memory.
If using requests directly:
Yields constant memory usage. If you throw CacheControl into the mix, memory shoots up based on the size of the downloaded object.
The text was updated successfully, but these errors were encountered: