Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large responses cause increased memory usage. #145

Closed
dsully opened this issue Mar 15, 2017 · 17 comments · Fixed by #256
Closed

Large responses cause increased memory usage. #145

dsully opened this issue Mar 15, 2017 · 17 comments · Fixed by #256

Comments

@dsully
Copy link

dsully commented Mar 15, 2017

When downloading large files, memory usage is not constant when using CacheControl.

I believe this is due to the FileWrapper that buffers the response in memory.

If using requests directly:

import shutil
import requests

response = requests.get(url, stream=True)
with open('/var/tmp/out.bin') as fh:
    shutil.copyfileobj(response.raw, fh)

Yields constant memory usage. If you throw CacheControl into the mix, memory shoots up based on the size of the downloaded object.

@dsully
Copy link
Author

dsully commented Mar 15, 2017

Digging more - the call to:

self.serializer.dumps(request, response, body=body)

In def cache_response(self, request, response, body=None): is looking like the culprit.

@elnuno
Copy link
Contributor

elnuno commented Apr 11, 2017

ISTM that there might be an extra copy of the body attached to the response making things worse (serialize.py:38). I'll check for any low hanging fruits and try to trace it to see whether we could somehow use streaming, but I expect it not to be practical.

@elnuno
Copy link
Contributor

elnuno commented Apr 12, 2017

Turns out there is a simple issue that might gives us most of the possible savings on a simple CacheControl(requests.Session()) case: we tee the response .read into a buffer we never bother to close. Closing it makes things better (on Windows), in that total memory doesn't rise as much as in original code. PR incoming!

For patched code and bare requests I get:

Using 32 MB on program end.
Mean memory use: 32 MB

For original code I get:

Using 106 MB on program end.
Mean memory use: 199 MB

FWIW, I see more fluctuations in memory use even in the patched version (as opposed to raw requests) when watching the process in task manager. But it does't climb nearly as high as the unpatched version. Testing on *nix very welcome :)

Can you show how exactly you "throw CacheControl into the mix"? I'm using sess = cachecontrol.CacheControl(requests.Session()), but maybe using other caches makes things worse.

Thanks for reporting this :)

I'm using a test file like this:

out = open('bigdata.bin', 'w', 'replace')
data = ''.join(chr(x) for x in range(37, 1064)) + '\n'
for i in range(70):
     for j in range(1024):
         out.write(data)
         out.write(str(i + j) + '\n' * 20)
out.close()

Then testing it like this:

import os
import logging
import shutil

import requests
import cachecontrol
import psutil

# logging.basicConfig(level=logging.DEBUG)

us = psutil.Process(os.getpid())
MB = 1024 * 1024
N = 15

sess = cachecontrol.CacheControl(requests.Session())

total = 0
for i in range(N):
    url = 'http://localhost:8000/bigdata.bin?limit=%s' % i
    print('Requesting %s...' % i)
    response = sess.get(url, stream=True)
    fh = open('dest.bin', 'wb')
    shutil.copyfileobj(response.raw, fh)
    fh.close()
    used_mem = us.memory_full_info().uss / MB
    total += used_mem
    print('Using %d MB.' % round(used_mem))
import gc
gc.collect()
print('Done.')
print('Using %d MB on program end.' % round(us.memory_full_info().uss / MB))
print('Mean memory use: %d MB' % round(total/N))
input()

@gar1t
Copy link

gar1t commented Mar 2, 2018

This seems pretty fundamental:

https://github.com/ionrock/cachecontrol/blob/bc43a32f1dc5d4467f8ac504f92eed33f113d89a/cachecontrol/controller.py#L131-L137

The cached data is read entirely into memory and then duplicated with the call to serializer.loads.

I'm surprised to see this. This is just not usable with large files.

I'd expect the cache response to include response headers only and a pointer/reader to cached body content that is usable as a file-like or stream-like object.

@ionrock
Copy link
Contributor

ionrock commented Mar 3, 2018

One thing that makes things tricky is that we're caching objects and not raw responses. This is primarily because under the hood it wasn't feasible to use the same loading mechanism as urllib3 does on the socket to ensure things work correctly. The result is that we've gone with the 80% solution, which seems to have been OK.

I don't have much time to work on CacheControl, but I'm happy to review any PRs to try and fix the issue. The one caveat is that since it is bundled with pip, we need to tread lightly regarding compatibility.

@gar1t
Copy link

gar1t commented Mar 3, 2018

I think the current serialization scheme could be maintained, but the interface for checking and serving cached content would need be more selective about what and how it reads from disk:

  • Deserialize from the start when checking for cached content, but read only up to the object headers
  • Serve response body content from disk via a file-like interface

As an interim change, the facility might set an upper limit on file size, above which it simply doesn't cache or read. This could be set high enough to avoid breaking most-if-not-all caching scenarios, but low enough to avoid wantonly thrashing a system. I don't know what this number is. More than 1G, less than 4G? :)

I'd dive in to help but I'm heads down on other things myself. Typical for a GitHub issue troll!

@hexagonrecursion
Copy link
Contributor

Related:
#180 - pypa/pip#2984 - you are using too much RAM
#238 - pypa/pip#9549 - you have hit the design limitation of the msgpack format

https://github.com/msgpack/msgpack/blob/master/spec.md#bin-format-family
bin 32 stores a byte array whose length is upto (2^32)-1 bytes

I took a look under the hood and it appears that the implementation of response body caching in cachecontrol is naive: when saving a response the entire body is stored in memory and is only written to disk at the end.

@pradyunsg
Copy link
Contributor

The one caveat is that since it is bundled with pip, we need to tread lightly regarding compatibility.

Well, if you wish to make backwards incompatible changes, I'm pretty sure we'd be happy to accomodate for them on pip's end. While there are caveats added by being bundled with pip (need to be pure-Python, not writing imports a certain way etc) -- API backwards compatibility is not one, since pip has careful and exact control of what version of cachecontrol it uses.

There may be some nuance of downstream redistributors who do weird things with pip, but they picked their poison by doing those weird things -- so I wouldn't worry too much about them.

@itamarst
Copy link
Contributor

#254 seems to fix this?

@itamarst
Copy link
Contributor

itamarst commented Oct 4, 2021

Just checking in again—anything I can do to help get #254 reviewed and merged? Bribery? Extra code/tests?

@itamarst
Copy link
Contributor

Thanks for merging #254. Does this mean this can be closed?

@gar1t
Copy link

gar1t commented Oct 13, 2021

0.7.4 was released yesterday (Windows is still pending). Would you mind trying 0.7.4 to verify that this addresses your issue?

Feel free to close as needed!

@pradyunsg
Copy link
Contributor

pradyunsg commented Oct 14, 2021

@gar1t I'm not a 100% sure what you mean by that. There's no 0.7.4 release for this package. :)

@itamarst
Copy link
Contributor

I tested with master vs released 0.12.6. Both had same peak memory usage, so I don't think #254 made any difference. I will see if I can submit a follow-up PR.

@itamarst
Copy link
Contributor

itamarst commented Oct 14, 2021

In particular, I ran under the Fil memory profiler (https://pythonspeed.com/fil/) which gives peak memory. In both cases peak memory looked like this when using the test script provided by #145 (comment) (this is the 0.12.6 version, Git master is same except with slightly different traceback):

Fil output

@astrojuanlu
Copy link

🎉 Plans for a release? :) Otherwise, I guess pip can vendor a specific commit @pradyunsg ?

@pradyunsg
Copy link
Contributor

pradyunsg commented Oct 27, 2021

No, pip only vendors releases from PyPI -- see https://pip.pypa.io/en/stable/development/vendoring-policy/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants