Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline failure #119

Closed
asharp opened this issue Jan 23, 2013 · 19 comments
Closed

Pipeline failure #119

asharp opened this issue Jan 23, 2013 · 19 comments

Comments

@asharp
Copy link

asharp commented Jan 23, 2013

Hi,

Given the code
import sh
sh.dd(sh.dd('if=/dev/zero', 'bs=1M', "count=1024"), 'of=/dev/null')

sh dies with a MemoryError as seen in http://pastebin.ca/2306288

It looks as though sh is trying to buffer the input rather than connecting the stdout of the inner dd process to the stdin of the outer dd process, and is such blowing up.

To give you an example of the behaviour I was expecting, see the example below.

import subprocess

s = subprocess.Popen(['dd', 'if=/dev/zero', 'bs=1M', 'count=10240'], stdout=subprocess.PIPE)
r = subprocess.Popen(['dd', 'of=/dev/null', 'bs=1M'], stdin=s.stdout)

r.wait()

I've taken a look through sh.py but I can't seem to work out what exactly I'd need to do in order to patch this. Would someone mind lending a hand?

@amoffat
Copy link
Owner

amoffat commented Jan 23, 2013

So just from a quick look, the _piped=True option is not being used on the inner sh.dd command. Without it, the outer command waits for the inner command to complete, which would be never :) Try re-running with _piped=True

@asharp
Copy link
Author

asharp commented Jan 23, 2013

When I run the command
sh.dd(sh.dd('if=/dev/zero', 'bs=1M', 'count=1', _piped=True), of='/dev/null')

I receive the error http://pastebin.ca/2306364

@amoffat
Copy link
Owner

amoffat commented Jan 23, 2013

  RAN: '/bin/dd --of=/dev/null'

  STDOUT:


  STDERR:
/bin/dd: invalid option -- '-'
Try `/bin/dd --help' for more information.

When you write of='/dev/null' it expands to --of=/dev/null, so you'll probably want to just pass the argument "of=/dev/null"

@asharp
Copy link
Author

asharp commented Jan 23, 2013

Ah. Silly mistake. Sorry about that. Can you check with me that
sh.dd(sh.dd("dd", "if=/dev/zero", "bs=1M", _piped=True), "of=/dev/null")

is the correct command? I'm running it, and it's returning (no errors), but no output is being returned and no processes are being made.

@amoffat
Copy link
Owner

amoffat commented Jan 23, 2013

sh.dd(sh.dd("dd", "if=/dev/zero", "bs=1M", _piped=True), "of=/dev/null")

You're passing "dd" as the first argument to sh.dd

@amoffat
Copy link
Owner

amoffat commented Jan 23, 2013

Why that doesn't raise an exception though is something to look at...

@asharp
Copy link
Author

asharp commented Jan 24, 2013

Sorry to keep poking you, but is
sh.dd(sh.dd("if=/dev/zero", "bs=1M", 'count=512', _piped=True), "of=/dev/null")

The correct form?

I'm still seeing a lot of memory/cpu usage inside python. (> 1 gig of ram, that isn't freed after the command returns)

@amoffat
Copy link
Owner

amoffat commented Jan 24, 2013

If you tell me what you're looking to do specifically, I could probably advise you better.

@asharp
Copy link
Author

asharp commented Jan 24, 2013

I have a series of essentially bash pipelines I use to process data. At the moment I manage these pipelines using a simple wrapper around Popen, however it's ugly and a pain to use. I'd much prefer to use sh, however i'm having performance problems. The test case I wrote was a simplified test case to work out what the problem was. If I can find a way to run a test case similar to the dd example above (ie. move large quantities of data quickly through a pipeline) using sh, then all is well.

@amoffat
Copy link
Owner

amoffat commented Jan 29, 2013

Apologies for the delay in this. There's a couple of things to take note of. One is that the memory used by a Command object will not be released until that object is garbage collected. You can take a look at some of the past memory issues in #99. The other issue is that Command objects internally buffer some stdout. This can be limited with the _internal_bufsize argument on http://amoffat.github.com/sh/special_arguments.html#special-arguments

Let me know if that works on you. I personally haven't used sh for very high bandwidth processes, so the practical limits haven't been pushed as far as you may be pushing them

@asharp
Copy link
Author

asharp commented Jan 30, 2013

Thanks for the argument, and i'll try that out shortly.

The problem that I see is that sh is copying data between the processes in a pipeline rather then just handing it off to the OS to do. From what I can see we don't gain anything from it, it simply burns massive amounts of CPU time and slows progress to a crawl (...And unexpectedly OOMing the box, if you're running this on a system with memory overcommit enabled...)

That being said, it should be a real simple change, I just can't seem to work out where exactly to put it., nor what sort of interface it should have to keep in line with the rest of the project.

I'm more than happy to help though.

@amoffat
Copy link
Owner

amoffat commented Jan 31, 2013

The problem that I see is that sh is copying data between the processes in a pipeline rather then just handing it off to the OS to do. From what I can see we don't gain anything from it, it simply burns massive amounts of CPU time and slows progress to a crawl

A process launched from sh doesn't know if it's going to be piped, so it has to internally buffer output. That's really the crux of the problem. For example:

import sh
print sh.wc(sh.ls())

ls() doesn't know about wc() so it needs to store its own output. When wc() sees that an argument to it is another Command object, it knows to look in its pipe buffer for data to pipe in.

Don't be afraid to dig around deeply in the source either...I tried to comment the non-obvious things pretty well. You have a Command object, which represents a program to be run. It spawns a RunningCommand object when the Command object is called. The RunningCommand object spawns a OProc object to do the low-level process interaction. The OProc object spawns a few StreamReader and StreamWriter objects to read and write to stdin/stdout.

@asharp
Copy link
Author

asharp commented Jan 31, 2013

Ah, yeah. That makes sense then. Thanks for the explanation :)

Just thinking out loud here

  1. Could we specify an option to, say the Command to let it know that it should expose its stdout pipe (rather than buffering it)? I'm, thinking that the next Command could then check to see if that pipe exists on creation so it can then completely join on.

  2. Could we defer the actual execution until somebody actually tries to read out of it (or it gets destroyed)? As an example

x = sh.ls() #Wouldn't actually run ls yet
x.stdout #Now it would run ls and produce output

This then gives us a bit more leeway as the inner sh.ls returns essentially a future object the outer sh.wc can work out what's going on and a more efficient pipeline can be built.

By also running on del, it means that a single standalone command like

sh.rm('/tmp/somefile')

would still run instantly. (After that line the reference drops, the refcount of the sh.rm "future object" drops to zero and del is called).

What are your thoughts?

@amoffat
Copy link
Owner

amoffat commented Feb 6, 2013

  1. This would defy expectations and introduce problems with order of operations. Say you wanted to do process1, then process2, and finanly process3, and all 3 mutate some resource. You would expect that calling them in order would work.

  2. It may be possible, I haven't really had a performance problem for my use cases to warrant doing this. Work up a proof of concept and I can dig deeper.

@asharp
Copy link
Author

asharp commented Feb 28, 2013

  1. Should work fine, I'm just not entirely sure where I need to poke. Also, although you might not have a pressing need for it, it is a fairly crippling bug in sh that renders it essentially useless for a decent number of common tasks.

2 Could you provide an example of when that wouldn't work?

@amoffat
Copy link
Owner

amoffat commented Feb 28, 2013

p1 = sh.cat("first", _out=someFile)
p2 = sh.cat("second", _out=someFile)

p2.stdout
p1.stdout

The file contains "second" then "first"

  1. I'll be more useful answering specific questions about how things work. Your starting place will probably be class OProc

mcclymont added a commit to mcclymont/sh that referenced this issue Oct 9, 2013
…ween

2 processes directly using OS pipes and connecting them with dup2, instead
of buffering the output of process 1 into python memory before sending to
process 2.

This is in response to issue amoffat#119 amoffat#119

The effect can be enabled by setting call arg _piping="direct" on the inner process.
I had issues with getting 'Input/Output error reading stdin' from dd, until
I set _tty_out=False. In the code I have therefore set _tty_out to False when
_piping == "direct". This has caused 3 of the unit tests to fail.

test2.py on my PC shows 13s vs 0.5s performance difference
mcclymont added a commit to mcclymont/sh that referenced this issue Oct 9, 2013
…ween

2 processes directly using OS pipes and connecting them with dup2, instead
of buffering the output of process 1 into python memory before sending to
process 2.

This is in response to issue amoffat#119 amoffat#119

The effect can be enabled by setting call arg _piping="direct" on the inner process.
I had issues with getting 'Input/Output error reading stdin' from dd, until
I set _tty_out=False. In the code I have therefore set _tty_out to False when
_piping == "direct". This has caused 3 of the unit tests to fail.

test2.py on my PC shows 13s vs 0.5s performance difference
@mcclymont
Copy link
Contributor

I've taken a look at this issue. I managed to get asharp's nested dd call to go from 12 seconds to 0.5 seconds by connecting the pipes directly. However it caused a few unit test problems, so I separated the change out into a dependent parameter (_piping="direct"). My changes are at https://github.com/mcclymont/sh/tree/direct-piping. More details available in the commit message

@amoffat
Copy link
Owner

amoffat commented Oct 10, 2013

Looks good so far @mcclymont ..I will take a closer look and merge this weekend

@amoffat
Copy link
Owner

amoffat commented Dec 30, 2014

closing, since @mcclymont 's direct piping are on release-1.10 branch.

http://i.imgur.com/FWzGP6k.gif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants