Dask - Yarn with jupyter notebook as client #60

jpoullet2000 · 2017-01-03T15:07:55Z

Context:

cluster where YARN jobs should be run
jupyter notebook on a different machine (not on the cluster)

I would like from my jupyter notebook to start up a dask scheduler on the cluster namenode, and then add some dask workers in YARN containers.
Working only on the cluster (no jupyter notebook), it works just fine. However I'd like to drive this from on a remote machine. The machine with jupyter notebook could read the yarn-site.xml from the cluster and run remotely the "hadoop jar" command to spawn up the dask workers, and eventually the "yarn application -kill " when the job is done.
It seems that in "knit" all this commands are supposed to be installed on the remote machine which is a bit restrictive.

What would you suggest?
Thx... and happy 2017 ;)

quasiben · 2017-01-03T16:56:32Z

I think this should be possible as things stand now. Knit creates a Py4J gateway (similar to spark) locally and proceeds to communicate directly with YARN/HDFS. If yarn-site.xml and hdfs-site.xml are configured properly and you have access to the remote machines, I think things should "just work" from your jupyter machine. Have you tried running knit commands seeing things fail? If so, can you post those failures ?

Here is where we run the hadoop jar command: https://github.com/dask/knit/blob/master/knit/core.py#L235

Also, @mrocklin and I are working on a nicer dask interface: https://github.com/dask/dask-yarn. Currently, things are stalled on getting TravisCI setup nicely.

jpoullet2000 · 2017-01-04T16:37:30Z

Thanks for the answer. I was using the last release (0.1.1). It now seems possible to handle that case. I installed the github version (master) and get somewhat further ... but still gets some permission issue I haven't been able to solve yet (see below). FYI, I did not get that issue working on the same cluster only, i.e without a remote machine (with knit release 0.1.1). I let you know if I can solve this.

/home/id843828/.conda/envs/blaze_server/lib/python2.7/site-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(

Py4JJavaError: An error occurred while calling t.start.
: org.apache.hadoop.security.AccessControlException: Permission denied: user=id843828, access=WRITE, inode="/user/id843828/.knitDeps":hdfs:hdfs:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:292)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:213)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1771)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1755)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1738)
at org.apache.hadoop.hdfs.server.namenode.FSDirMkdirOp.mkdirs(FSDirMkdirOp.java:71)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3905)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:1048)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)

jpoullet2000 · 2017-01-05T14:27:11Z

OK. The above issue was because of some wrong assumption I had.
I had initiated
os.environ['HADOOP_CONF_DIR']
os.environ['YARN_CONF_DIR']
thinking KNIT would take them as arguments somewhere ;). It seems not to be the case, so I was pointing to the cluster that is defined in my /usr/hdp/current/hadoop-client/conf directory (where I don't have the writing permissions obviously ;)).
I want my jupyter notebook to be able to spawn dask jobs on different clusters, just by changing the HADOOP_CONF_DIR or YARN_CONF_DIR env variables. This is what I'm already doing successfully with spark (modifying the spark-defaults.conf on the fly before starting the spark context). I could probably have my yarn installed on some conda env (the same way I have spark... yes because I don't have the writing rights on /usr/hdp/current/) and do the same trick, but I was wondering whether they would be some nicer way to do it.

BTW, as I said, HADOOP and YARN must be installed on the client side (jupyter machine in my case, remote machine) which might be a bit limiting... or at least confusing since I don't think this is mentioned anywhere in the doc (sorry if I missed something).

Despite my comment ;) I really enjoy the project !! Thx a lot for that. And again it works just great when playing directly on the cluster (i.e., without some remote machine).

Looking forward to hearing your suggestions ;)

quasiben · 2017-01-05T14:52:38Z

Thanks @jpoullet2000 for reporting. This looks like knit is trying to write the permissions for user: id843828 as hdfs:hdfs. I'll hopefully be able to dig into this later today

jcrist · 2018-07-03T18:14:46Z

Knit is being replaced with Skein (https://github.com/jcrist/skein), a more general library for deploying applications on YARN. dask-yarn (https://github.com/dask/dask-yarn) now uses this.

The issue of starting a cluster from not on an edge node still stands (see jcrist/skein#28). I'm not sure if/how other tools (spark?) support this, as you'd need access to the resourcemanager from outside the cluster which is uncommon from what I understand. One option would be to use paramiko to automatically run the required commands remotely and setup the ssh tunnels, but I'm not sure if that's worth it. If you have thoughts on this I'd love to hear your feedback on that issue.

As for the hadoop permissions issues, those should all be solved in skein. The new release of dask-yarn (https://dask-yarn.readthedocs.io/en/latest/) uses this and should be much more resilient to different hadoop configurations.

Closing.

mrocklin mentioned this issue Jan 3, 2017

"can't pickle thread.lock objects" when calling array.store with distributed dask/distributed#780

Open

jcrist closed this as completed Jul 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask - Yarn with jupyter notebook as client #60

Dask - Yarn with jupyter notebook as client #60

jpoullet2000 commented Jan 3, 2017

quasiben commented Jan 3, 2017

jpoullet2000 commented Jan 4, 2017

jpoullet2000 commented Jan 5, 2017 •

edited

Loading

quasiben commented Jan 5, 2017

jcrist commented Jul 3, 2018

Dask - Yarn with jupyter notebook as client #60

Dask - Yarn with jupyter notebook as client #60

Comments

jpoullet2000 commented Jan 3, 2017

quasiben commented Jan 3, 2017

jpoullet2000 commented Jan 4, 2017

jpoullet2000 commented Jan 5, 2017 • edited Loading

quasiben commented Jan 5, 2017

jcrist commented Jul 3, 2018

jpoullet2000 commented Jan 5, 2017 •

edited

Loading