Large responses cause increased memory usage. #145

dsully · 2017-03-15T16:14:37Z

When downloading large files, memory usage is not constant when using CacheControl.

I believe this is due to the FileWrapper that buffers the response in memory.

If using requests directly:

import shutil
import requests

response = requests.get(url, stream=True)
with open('/var/tmp/out.bin') as fh:
    shutil.copyfileobj(response.raw, fh)

Yields constant memory usage. If you throw CacheControl into the mix, memory shoots up based on the size of the downloaded object.

The text was updated successfully, but these errors were encountered:

dsully · 2017-03-15T16:20:28Z

Digging more - the call to:

self.serializer.dumps(request, response, body=body)

In def cache_response(self, request, response, body=None): is looking like the culprit.

elnuno · 2017-04-11T18:47:15Z

ISTM that there might be an extra copy of the body attached to the response making things worse (serialize.py:38). I'll check for any low hanging fruits and try to trace it to see whether we could somehow use streaming, but I expect it not to be practical.

elnuno · 2017-04-12T01:28:36Z

Turns out there is a simple issue that might gives us most of the possible savings on a simple CacheControl(requests.Session()) case: we tee the response .read into a buffer we never bother to close. Closing it makes things better (on Windows), in that total memory doesn't rise as much as in original code. PR incoming!

For patched code and bare requests I get:

Using 32 MB on program end.
Mean memory use: 32 MB

For original code I get:

Using 106 MB on program end.
Mean memory use: 199 MB

FWIW, I see more fluctuations in memory use even in the patched version (as opposed to raw requests) when watching the process in task manager. But it does't climb nearly as high as the unpatched version. Testing on *nix very welcome :)

Can you show how exactly you "throw CacheControl into the mix"? I'm using sess = cachecontrol.CacheControl(requests.Session()), but maybe using other caches makes things worse.

Thanks for reporting this :)

I'm using a test file like this:

out = open('bigdata.bin', 'w', 'replace')
data = ''.join(chr(x) for x in range(37, 1064)) + '\n'
for i in range(70):
     for j in range(1024):
         out.write(data)
         out.write(str(i + j) + '\n' * 20)
out.close()

Then testing it like this:

import os
import logging
import shutil

import requests
import cachecontrol
import psutil

# logging.basicConfig(level=logging.DEBUG)

us = psutil.Process(os.getpid())
MB = 1024 * 1024
N = 15

sess = cachecontrol.CacheControl(requests.Session())

total = 0
for i in range(N):
    url = 'http://localhost:8000/bigdata.bin?limit=%s' % i
    print('Requesting %s...' % i)
    response = sess.get(url, stream=True)
    fh = open('dest.bin', 'wb')
    shutil.copyfileobj(response.raw, fh)
    fh.close()
    used_mem = us.memory_full_info().uss / MB
    total += used_mem
    print('Using %d MB.' % round(used_mem))
import gc
gc.collect()
print('Done.')
print('Using %d MB on program end.' % round(us.memory_full_info().uss / MB))
print('Mean memory use: %d MB' % round(total/N))
input()

gar1t · 2018-03-02T13:56:09Z

This seems pretty fundamental:

https://github.com/ionrock/cachecontrol/blob/bc43a32f1dc5d4467f8ac504f92eed33f113d89a/cachecontrol/controller.py#L131-L137

The cached data is read entirely into memory and then duplicated with the call to serializer.loads.

I'm surprised to see this. This is just not usable with large files.

I'd expect the cache response to include response headers only and a pointer/reader to cached body content that is usable as a file-like or stream-like object.

ionrock · 2018-03-03T03:18:03Z

One thing that makes things tricky is that we're caching objects and not raw responses. This is primarily because under the hood it wasn't feasible to use the same loading mechanism as urllib3 does on the socket to ensure things work correctly. The result is that we've gone with the 80% solution, which seems to have been OK.

I don't have much time to work on CacheControl, but I'm happy to review any PRs to try and fix the issue. The one caveat is that since it is bundled with pip, we need to tread lightly regarding compatibility.

gar1t · 2018-03-03T12:50:25Z

I think the current serialization scheme could be maintained, but the interface for checking and serving cached content would need be more selective about what and how it reads from disk:

Deserialize from the start when checking for cached content, but read only up to the object headers
Serve response body content from disk via a file-like interface

As an interim change, the facility might set an upper limit on file size, above which it simply doesn't cache or read. This could be set high enough to avoid breaking most-if-not-all caching scenarios, but low enough to avoid wantonly thrashing a system. I don't know what this number is. More than 1G, less than 4G? :)

I'd dive in to help but I'm heads down on other things myself. Typical for a GitHub issue troll!

hexagonrecursion · 2021-02-13T09:02:39Z

Related:
#180 - pypa/pip#2984 - you are using too much RAM
#238 - pypa/pip#9549 - you have hit the design limitation of the msgpack format

https://github.com/msgpack/msgpack/blob/master/spec.md#bin-format-family
bin 32 stores a byte array whose length is upto (2^32)-1 bytes

I took a look under the hood and it appears that the implementation of response body caching in cachecontrol is naive: when saving a response the entire body is stored in memory and is only written to disk at the end.

pradyunsg · 2021-09-20T06:46:04Z

The one caveat is that since it is bundled with pip, we need to tread lightly regarding compatibility.

Well, if you wish to make backwards incompatible changes, I'm pretty sure we'd be happy to accomodate for them on pip's end. While there are caveats added by being bundled with pip (need to be pure-Python, not writing imports a certain way etc) -- API backwards compatibility is not one, since pip has careful and exact control of what version of cachecontrol it uses.

There may be some nuance of downstream redistributors who do weird things with pip, but they picked their poison by doing those weird things -- so I wouldn't worry too much about them.

itamarst · 2021-09-27T15:33:51Z

#254 seems to fix this?

itamarst · 2021-10-04T14:40:32Z

Just checking in again—anything I can do to help get #254 reviewed and merged? Bribery? Extra code/tests?

itamarst · 2021-10-13T17:24:04Z

Thanks for merging #254. Does this mean this can be closed?

gar1t · 2021-10-13T22:47:49Z

0.7.4 was released yesterday (Windows is still pending). Would you mind trying 0.7.4 to verify that this addresses your issue?

Feel free to close as needed!

pradyunsg · 2021-10-14T06:27:45Z

@gar1t I'm not a 100% sure what you mean by that. There's no 0.7.4 release for this package. :)

itamarst · 2021-10-14T14:32:07Z

I tested with master vs released 0.12.6. Both had same peak memory usage, so I don't think #254 made any difference. I will see if I can submit a follow-up PR.

itamarst · 2021-10-14T14:34:10Z

In particular, I ran under the Fil memory profiler (https://pythonspeed.com/fil/) which gives peak memory. In both cases peak memory looked like this when using the test script provided by #145 (comment) (this is the 0.12.6 version, Git master is same except with slightly different traceback):

astrojuanlu · 2021-10-27T05:09:29Z

🎉 Plans for a release? :) Otherwise, I guess pip can vendor a specific commit @pradyunsg ?

pradyunsg · 2021-10-27T08:52:41Z

No, pip only vendors releases from PyPI -- see https://pip.pypa.io/en/stable/development/vendoring-policy/

elnuno mentioned this issue Apr 12, 2017

Close CallbackFileWrapper.__buf once it's used to free memory. #152

Closed

This was referenced Feb 16, 2021

Redisign the cahce API #240

Closed

Stop throwing if saving a request fails #242

Closed

This was referenced Mar 2, 2021

Large RAM spike when caching large wheels pypa/pip#9678

Closed

Excessive memory use when caching large packages pypa/pip#2984

Closed

martinkonopka mentioned this issue Apr 7, 2021

pip-compile consumes up to 10.5 GB RAM when resolving PyTorch 1.8.1 with CUDA jazzband/pip-tools#1371

Closed

itamarst mentioned this issue Oct 14, 2021

Don't buffer all the data in memory when writing, reducing peak memory usage #256

Merged

ionrock closed this as completed in #256 Oct 27, 2021

itamarst mentioned this issue Oct 27, 2021

Catch MemoryErrors and give up instead of letting them propogate up #180

Closed

itamarst mentioned this issue Nov 3, 2021

Excessive memory usage, part 2 #265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large responses cause increased memory usage. #145

Large responses cause increased memory usage. #145

dsully commented Mar 15, 2017

dsully commented Mar 15, 2017

elnuno commented Apr 11, 2017

elnuno commented Apr 12, 2017 •

edited

Loading

gar1t commented Mar 2, 2018

ionrock commented Mar 3, 2018

gar1t commented Mar 3, 2018

hexagonrecursion commented Feb 13, 2021

pradyunsg commented Sep 20, 2021

itamarst commented Sep 27, 2021

itamarst commented Oct 4, 2021

itamarst commented Oct 13, 2021

gar1t commented Oct 13, 2021

pradyunsg commented Oct 14, 2021 •

edited

Loading

itamarst commented Oct 14, 2021

itamarst commented Oct 14, 2021 •

edited

Loading

astrojuanlu commented Oct 27, 2021

pradyunsg commented Oct 27, 2021 •

edited

Loading

Large responses cause increased memory usage. #145

Large responses cause increased memory usage. #145

Comments

dsully commented Mar 15, 2017

dsully commented Mar 15, 2017

elnuno commented Apr 11, 2017

elnuno commented Apr 12, 2017 • edited Loading

gar1t commented Mar 2, 2018

ionrock commented Mar 3, 2018

gar1t commented Mar 3, 2018

hexagonrecursion commented Feb 13, 2021

pradyunsg commented Sep 20, 2021

itamarst commented Sep 27, 2021

itamarst commented Oct 4, 2021

itamarst commented Oct 13, 2021

gar1t commented Oct 13, 2021

pradyunsg commented Oct 14, 2021 • edited Loading

itamarst commented Oct 14, 2021

itamarst commented Oct 14, 2021 • edited Loading

astrojuanlu commented Oct 27, 2021

pradyunsg commented Oct 27, 2021 • edited Loading

elnuno commented Apr 12, 2017 •

edited

Loading

pradyunsg commented Oct 14, 2021 •

edited

Loading

itamarst commented Oct 14, 2021 •

edited

Loading

pradyunsg commented Oct 27, 2021 •

edited

Loading