Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orca hanging when large JSONs are piped in #110

Closed
sdrap opened this issue Aug 6, 2018 · 22 comments
Closed

Orca hanging when large JSONs are piped in #110

sdrap opened this issue Aug 6, 2018 · 22 comments

Comments

@sdrap
Copy link

sdrap commented Aug 6, 2018

I have a dataframe with 20 years of daily data. iplot can process any plots.

However, I can only use orca when I slice the dataframe for less than 4 years of data.
It fails in the notebook as well as in command line from a json dumped file with the following text.

A JavaScript error occurred in the main process
Uncaught Exception:
TypeError: path must be a string or Buffer
    at Object.fs.mkdirSync (fs.js:891:18)
    at main (/usr/local/lib/node_modules/orca/bin/graph.js:105:8)
    at Object.<anonymous> (/usr/local/lib/node_modules/orca/bin/orca_electron.js:73:25)
    at Object.<anonymous> (/usr/local/lib/node_modules/orca/bin/orca_electron.js:99:3)
    at Module._compile (module.js:569:30)
    at Object.Module._extensions..js (module.js:580:10)
    at Module.load (module.js:503:32)
    at tryModuleLoad (module.js:466:12)
    at Function.Module._load (module.js:458:3)
    at loadApplicationPackage (/usr/local/lib/node_modules/electron/dist/resources/default_app.asar/main.js:287:12)

The JSON files are 250Kb large for the 20 years of data.
data.zip

@etpinard
Copy link
Contributor

etpinard commented Aug 6, 2018

I suspect you're calling orca as:

orca graph data.zip

which won't work, as we currently only accept path/to/json/files, url/to/json/files or json strings.

Is this indeed the case?

@sdrap
Copy link
Author

sdrap commented Aug 6, 2018

No I uploaded as a zip because github doesn't accept plain json. I run it as follows

orca graph data.json -o test.png -debug

or alternatively

cat data.json | orca graph > test.png

which just hang.

I don't know exactly what is the problem.

@etpinard
Copy link
Contributor

etpinard commented Aug 6, 2018

I think something is up with your data encoding:

var s = fs.readFileSync('./data.json');
typeof JSON.parse(s)
// => return 'string'

Orca expect the result of JSON.parse to be an object.

@sdrap
Copy link
Author

sdrap commented Aug 6, 2018

I am puzzled a bit with JSON. I dumped in the file using

A = df['S'].iplot(**kwargs)
file = 'data.json'
with open(file, 'w') as outfile:
    json.dump(json.dumps(A, cls=plotly.utils.PlotlyJSONEncoder), outfile)

Strangely enough it starts with "{ and put a backslash in front of every double quote.
I edited the file removing these characters in a working json that I can process through cat.

I have here the two json (the one year and 20 years) and process them by

cat data.json | orca graph >test.png

the small one get through the large one not.

I also made a copy paste of the JSON file and processed it through the pipe.
However I get the the error from cat

bash: /bin/cat: Argument list too long

I thought it could be related to the ARG_MAX value of my system which is capped to 2097152.
However when I count the number of characters of the data it is 255272.

In the python notebook, with command

A = df['S'].iplot(**kwargs)
B = json.dumps(A, cls=plotly.utils.PlotlyJSONEncoder)
file = '-o test.png'
call(['orca', 'graph', B, file])

I always get the error [Errno 7] Argument list too long: 'orca'

while

A = df['2018']['S'].iplot(**kwargs)
B = json.dumps(A, cls=plotly.utils.PlotlyJSONEncoder)
file = '-o test.png'
call(['orca', 'graph', B, file])

goes through.

Terribly sorry to disturb, it may well be a problem outside of the scope of orca.
data2.zip
data.zip

@etpinard
Copy link
Contributor

etpinard commented Aug 7, 2018

Terribly sorry to disturb, it may well be a problem outside of the scope of orca.

No worries at all. The orca CLI is bound to have a few rough edges at the moment. Thanks very much for writing in.

As mentioned in #104 (comment), I'd recommend first saving large JSONs to a temporary file.

Now, your snippet

A = df['S'].iplot(**kwargs)
file = 'data.json'
with open(file, 'w') as outfile:
    json.dump(json.dumps(A, cls=plotly.utils.PlotlyJSONEncoder), outfile)

seems odd. Wouldn't

A = df['S'].iplot(**kwargs)
file = 'data.json'
with open(file, 'w') as outfile:
    json.dump(A, outifle, cls=plotly.utils.PlotlyJSONEncoder)

suffice?

@sdrap
Copy link
Author

sdrap commented Aug 7, 2018

Many thanks, my dump dumps was totally dumb. Your solution provide the right data format to cat in the pipeline.

However the problem remain the same. The 1 year dataset can be processed, the 20 years can not.

I really don't know where the problem comes from and I don't know how to run the debug command in the pipe

cat data.json | orca graph > test.png

If I can be of any help to run some commands or some tests, do not hesitate to ask.
I add once again a correct dump of the two data sets (small and large)
datalarge.zip
datasmall.zip

@etpinard
Copy link
Contributor

etpinard commented Aug 7, 2018

I got it to work using:

orca graph datalarge.json

image

Unfortunately, from python this means dumping your figure object into a temporary file, but as discussed in #104, piping very large JSONs into orca does not scale well. This will always be slow as we have to wait for the full JSON to be in memory. In other words we can't start creating a graph from a partial JSON chunk.

Luckily, @jonmmease is creating an official python wrapper for orca that should handle all the temporary file messiness.

@sdrap
Copy link
Author

sdrap commented Aug 7, 2018

Oh yes! that's great I could get it the way you did. I will use this solution with temporary files, since in the notebook it doesn't work. Just a matter of writing a small script to handle all the temp files.

Many thanks, I had been waiting for a long time for this export solution and it is really nice :).

@etpinard etpinard changed the title Orca hanging on large json files Orca hanging when large JSONs are piped in Aug 7, 2018
@jonmmease
Copy link
Contributor

@etpinard, a thought just occurred to me. Do you have a sense of what would it take to launch orca in server mode (as a Python subprocess) and then send requests to it from Python? This would save the orca startup time (once the server process is launched the first time) and avoid the temporary file business.

@etpinard
Copy link
Contributor

etpinard commented Aug 8, 2018

Do you have a sense of what would it take to launch orca in server mode (as a Python subprocess) and then send requests to it from Python?

It shouldn't be too hard if you'd like to experiment. The server part of orca predates orca itself. Taking a look at our orca serve tests is probably the best way to got going. As always, let me know if you have any questions.

Note also that orca graph accepts multiple input e.g.

orca graph fig.json fig1.json fig2.json

# which can also be saved in a directory e.g.
orca graph fig.json fig1.json fig2.json -d orca-outputs/

So to improve perf, one could generate all JSONs files to be exported then call orca on all thoses files at once.

@jonmmease
Copy link
Contributor

jonmmease commented Aug 8, 2018

Thanks @etpinard , I'll take a look.

Yeah, if we go the temp file approach I was planning to work out an API to allow users to batch convert collections of figures in one go. The --parallel-limit option applies to batch conversion case right?

@etpinard
Copy link
Contributor

etpinard commented Aug 8, 2018

The parallelization option applies to batch conversion case right?

Yep, the --parallel-limit [or --parallelLimit] CLI options set the limit of parallel tasks run. Its default value is 1.

One note on parallelization, no matter the --parallel-limit value set orca only creates one Electron instance. Parallelization is especially productive for exporting plotly.js graphs (except PDF and EPS exports) where only one browser window is created no matter the --parallel-limit value, as we can create multiple graph divs on the same page and export them individually (with Plotly.toImage(gd)). For other export types, parallelization leads to the creation of more browser windows, which can slow down the process in extreme cases.

@jonmmease
Copy link
Contributor

Image conversion with orca server from Python!

screen shot 2018-08-08 at 10 26 54 am

I haven't tried any large graphs yet, but for small stuff it's impressively responsive!

Here are the only two issues I see at the moment:

  1. Launching the server on on my macbook pro starts up an "orca" process plus 7 "orca Helper" processes that consume close to 400MB of Memory. Is there any way to control the number of helpers?

screen shot 2018-08-08 at 10 32 52 am

  1. The server processes don't shutdown when I use the terminate or kill methods on the subprocess:
# Shutdown process with `SIGTERM`
orca_proc.terminate()
# Shutdown process with `SIGKILL`
orca_proc.kill()

I'll see what I can find on the Python side. Do you expect the server process to respect these signals?

@jonmmease
Copy link
Contributor

Some rough timing numbers for single trace (After server is running, from execution to display in the notebook)

  • Small scatter (non-GL) trace: ~50ms
  • Small scattergl: ~150ms
  • scattergl 10,000: ~200ms
  • scattergl 100,000: ~500ms
  • scattergl 1,000,000: ~ 3.5s

I think these numbers are awesome! For comparison, using the plotly graph approach in the README with a small plot takes about 1.7s.

@etpinard
Copy link
Contributor

etpinard commented Aug 8, 2018

but for small stuff it's impressively responsive!

Great.

"orca" process plus 7 "orca Helper"

Interesting find. From what I'm seeing only the standalone executable behaves this way. The ./bin/orca.js script does not spin up that many "helpers". This could be related to orca serve booting up one browser window per available export component, but I doubt the idle windows would consume that much memory. We should compare with other Electron-based desktop apps.

The server processes don't shutdown when I use the terminate or kill methods on the subprocess:

orca serve

# and then
<ctrl-c>

seems to kill all processes, I'm not sure if that's equivalent to SIGTERM and SIGKILL. Perhaps we'll need to listen to a few more events in bin/serve.js.

scattergl 1,000,000: ~ 3.5s
I think these numbers are awesome!

Fantastic 🎉

@jonmmease
Copy link
Contributor

On my side, when I run ./bin/orca.js serve I get the same number of child processes, they're just named Electron rather than orca:

screen shot 2018-08-08 at 11 15 21 am

@jonmmease
Copy link
Contributor

jonmmease commented Aug 8, 2018

Here's a clue, the returned subprocess PID is a bash process, not the main electron process. If I run os.kill(electron_pid, ...) the shutdown works.

screen shot 2018-08-08 at 11 38 26 am

Ohhh, it's the orca.sh wrapper script that's getting killed, which isn't killing or orca process. Getting there...

@jonmmease
Copy link
Contributor

Ok, I figured out a solution based on this article: http://veithen.github.io/2014/11/16/sigterm-propagation.html

In our wrapper bash script we basically just need to prefix the call to orca with exec. Then the bash process becomes the orca process and the signals sent from Python make it to orca.

Since we haven't merged it yet, I'll update this in my conda build PR.

@etpinard
Copy link
Contributor

etpinard commented Aug 8, 2018

On my side, when I run ./bin/orca.js serve I get the same number of child processes

Confirmed. I accidentally ran orca graph which spins up 3 electron processes which looks like the intended behavior. So I'm not sure why we get to 7 processes in orca serve but it must be related to opening one window per component. Perhaps we add a --graph-only flag to orca serve or something fancier to reduce that number down to 3?

From comparison, orca serve gives:

image

on Ubuntu 18.04, which is significantly less memory than for @jonmmease 🤔

@jonmmease
Copy link
Contributor

Are other components the dashboard/dash/thumbnail parts? If so, then yeah, a --graph-only option might be a nice way to go.

Based on my experiments today, I think the server approach is going to be cleaner (no temp files) and provide a better use experience (more responsive). So I think it is probably worth looking into what it would take to trim the process count down by a few.

@jonmmease
Copy link
Contributor

Well, I took a look, and it seemed pretty straightforward, and then I ended up with a PR 🙂 #112

@jonmmease
Copy link
Contributor

Closing as #112 was merged months ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants