-
-
Notifications
You must be signed in to change notification settings - Fork 10
Dask - Yarn with jupyter notebook as client #60
Comments
I think this should be possible as things stand now. Knit creates a Py4J gateway (similar to spark) locally and proceeds to communicate directly with YARN/HDFS. If yarn-site.xml and hdfs-site.xml are configured properly and you have access to the remote machines, I think things should "just work" from your jupyter machine. Have you tried running knit commands seeing things fail? If so, can you post those failures ? Here is where we run the Also, @mrocklin and I are working on a nicer dask interface: https://github.com/dask/dask-yarn. Currently, things are stalled on getting TravisCI setup nicely. |
Thanks for the answer. I was using the last release (0.1.1). It now seems possible to handle that case. I installed the github version (master) and get somewhat further ... but still gets some permission issue I haven't been able to solve yet (see below). FYI, I did not get that issue working on the same cluster only, i.e without a remote machine (with knit release 0.1.1). I let you know if I can solve this. /home/id843828/.conda/envs/blaze_server/lib/python2.7/site-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name) Py4JJavaError: An error occurred while calling t.start. |
OK. The above issue was because of some wrong assumption I had. BTW, as I said, HADOOP and YARN must be installed on the client side (jupyter machine in my case, remote machine) which might be a bit limiting... or at least confusing since I don't think this is mentioned anywhere in the doc (sorry if I missed something). Despite my comment ;) I really enjoy the project !! Thx a lot for that. And again it works just great when playing directly on the cluster (i.e., without some remote machine). Looking forward to hearing your suggestions ;) |
Thanks @jpoullet2000 for reporting. This looks like knit is trying to write the permissions for user: |
Knit is being replaced with Skein (https://github.com/jcrist/skein), a more general library for deploying applications on YARN. dask-yarn (https://github.com/dask/dask-yarn) now uses this. The issue of starting a cluster from not on an edge node still stands (see jcrist/skein#28). I'm not sure if/how other tools (spark?) support this, as you'd need access to the resourcemanager from outside the cluster which is uncommon from what I understand. One option would be to use paramiko to automatically run the required commands remotely and setup the ssh tunnels, but I'm not sure if that's worth it. If you have thoughts on this I'd love to hear your feedback on that issue. As for the hadoop permissions issues, those should all be solved in skein. The new release of dask-yarn (https://dask-yarn.readthedocs.io/en/latest/) uses this and should be much more resilient to different hadoop configurations. Closing. |
Context:
I would like from my jupyter notebook to start up a dask scheduler on the cluster namenode, and then add some dask workers in YARN containers.
Working only on the cluster (no jupyter notebook), it works just fine. However I'd like to drive this from on a remote machine. The machine with jupyter notebook could read the yarn-site.xml from the cluster and run remotely the "hadoop jar" command to spawn up the dask workers, and eventually the "yarn application -kill " when the job is done.
It seems that in "knit" all this commands are supposed to be installed on the remote machine which is a bit restrictive.
What would you suggest?
Thx... and happy 2017 ;)
The text was updated successfully, but these errors were encountered: