Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when reading a stream via callback #98

Closed
chrippa opened this issue Oct 31, 2012 · 11 comments
Closed

Memory leak when reading a stream via callback #98

chrippa opened this issue Oct 31, 2012 · 11 comments

Comments

@chrippa
Copy link

chrippa commented Oct 31, 2012

When using sh to read a network stream I noticed a memory leak. This is a simple test case:

from sh import curl
from time import sleep

url = "ftp://ftp.port80.se/1000M"

def read_callback(data):
    print("data", len(data))

curl(url, _out=read_callback)

while True:
    sleep(1)

When you run this you can look at top and see that the data is never freed even though there is no reference keeping it around.

This has been tested with:
sh 1.05 and git
Python 2.7 and 3.2

@amoffat
Copy link
Owner

amoffat commented Oct 31, 2012

I was able to confirm this. Really strange...even wrapping the code in a function (everything before the while True), the memory still grows. There is a fork-exec happening internally, maybe this is somehow related..

@amoffat
Copy link
Owner

amoffat commented Nov 8, 2012

Just an update, I'm still looking into this. There's some really weird behavior going on where garbage collection isn't occurring. Having a hard nailing down exactly where the references are being held.

@amoffat
Copy link
Owner

amoffat commented Nov 10, 2012

Fixed on master and pushed to pypi v1.06. The cause was some nasty cyclical references that were preventing garbage collection

@amoffat amoffat closed this as completed Nov 10, 2012
@chrippa
Copy link
Author

chrippa commented Nov 13, 2012

Hmm, I'm still able to reproduce this with the test case I posted. I double checked on two different installs and also checked that sh.version is 1.06, just to make sure I didn't run the old version. I tested with Python 2.6, 2.7 and 3.3.

Here is a screenshot of htop: http://i.imgur.com/p6L1x.png

@amoffat
Copy link
Owner

amoffat commented Nov 13, 2012

@chrippa What I found with the fix that I added was that python's garbage collector is really really lazy. I would have to run gc.collect() to get the objects to collect in a timely manner. Give this a shot and let me know if you see a change in the memory usage.

When I get home tonight, I'll post a test case I was using...it was similar to yours. A while loop, sh.cat(largefile) over and over. Before the fix, memory grew indefinitely, after the fix, it stayed constant. Maybe your test case is different enough though (because it is using a callback) that there is still uncollected garbage...I'll need to confirm this

@amoffat amoffat reopened this Nov 13, 2012
@chrippa
Copy link
Author

chrippa commented Nov 13, 2012

I tried adding gc.collect() to the while loop and also in the callback of the test case but it did not make any difference.

I'm using sh in my project to read a live video stream from a subprocess (rtmpdump) so I need to read chunks of data and not all data at once like with your cat example. I used to read directly from the Popen stdout object in pbs, but if I understand correctly I need to use callbacks to do the same in sh.

@amoffat
Copy link
Owner

amoffat commented Nov 13, 2012

Gotcha. So one thing that may be misleading is that sh commands do buffer all the data internally. So while your callback is being called with each chunk, all the chunks are getting aggregated internally. So if you do something like this

process = curl(url, _out=read_callback)
print(process.stdout)

It would print all of the chunks concatenated together. So I might be misunderstanding what you are looking for... are you saying that the entire process object (in the above example) is not being garbage collected when it goes out of scope? Or are you saying that the memory should not be growing as your callback is being called?

If it's the second one, we can probably disable stdout getting aggregated internally if a callback is being used. But I want to be sure that the issue you're seeing isn't the garbage collection issue (of the process object not being collected and resources being freed if you call del on it).

@chrippa
Copy link
Author

chrippa commented Nov 13, 2012

Ah, this makes more sense now. I was expecting once the callback was called the data would be gone with it. A way to disable the aggregating would solve the problem.

@amoffat
Copy link
Owner

amoffat commented Nov 14, 2012

I have a fix on the dev branch right now, if you could, go ahead and download it and drop it into your PYTHONPATH to test it https://raw.github.com/amoffat/sh/dev/sh.py. The new special keyword arguments you'll want to use are _no_out, and _no_pipe. What these do is explicitly disable the aggregating of those internal buffers:

def read_callback(data):
    print("data", len(data))

curl(url, _out=read_callback, _no_out=True, _no_pipe=True)

I'm wondering thought if we should automatically disable aggregating if a callback is used, since a callback will probably only be used to process a large amount of data, and you wouldn't want to automatically store a large amount of data (which is your use case)

Anyways, when you get a chance, confirm for me that the dev file works for you, and I can roll that up for the 1.07 release.

@chrippa
Copy link
Author

chrippa commented Nov 14, 2012

Thanks for the fix.It's working fine here!

@amoffat
Copy link
Owner

amoffat commented Nov 21, 2012

Fixes on master and v1.07

@amoffat amoffat closed this as completed Nov 21, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants