Skip to content
This repository has been archived by the owner on Aug 13, 2018. It is now read-only.

Dask - Yarn with jupyter notebook as client #60

Closed
jpoullet2000 opened this issue Jan 3, 2017 · 5 comments
Closed

Dask - Yarn with jupyter notebook as client #60

jpoullet2000 opened this issue Jan 3, 2017 · 5 comments

Comments

@jpoullet2000
Copy link

Context:

  • cluster where YARN jobs should be run
  • jupyter notebook on a different machine (not on the cluster)

I would like from my jupyter notebook to start up a dask scheduler on the cluster namenode, and then add some dask workers in YARN containers.
Working only on the cluster (no jupyter notebook), it works just fine. However I'd like to drive this from on a remote machine. The machine with jupyter notebook could read the yarn-site.xml from the cluster and run remotely the "hadoop jar" command to spawn up the dask workers, and eventually the "yarn application -kill " when the job is done.
It seems that in "knit" all this commands are supposed to be installed on the remote machine which is a bit restrictive.

What would you suggest?
Thx... and happy 2017 ;)

@quasiben
Copy link
Member

quasiben commented Jan 3, 2017

I think this should be possible as things stand now. Knit creates a Py4J gateway (similar to spark) locally and proceeds to communicate directly with YARN/HDFS. If yarn-site.xml and hdfs-site.xml are configured properly and you have access to the remote machines, I think things should "just work" from your jupyter machine. Have you tried running knit commands seeing things fail? If so, can you post those failures ?

Here is where we run the hadoop jar command: https://github.com/dask/knit/blob/master/knit/core.py#L235

Also, @mrocklin and I are working on a nicer dask interface: https://github.com/dask/dask-yarn. Currently, things are stalled on getting TravisCI setup nicely.

@jpoullet2000
Copy link
Author

Thanks for the answer. I was using the last release (0.1.1). It now seems possible to handle that case. I installed the github version (master) and get somewhat further ... but still gets some permission issue I haven't been able to solve yet (see below). FYI, I did not get that issue working on the same cluster only, i.e without a remote machine (with knit release 0.1.1). I let you know if I can solve this.

/home/id843828/.conda/envs/blaze_server/lib/python2.7/site-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(

Py4JJavaError: An error occurred while calling t.start.
: org.apache.hadoop.security.AccessControlException: Permission denied: user=id843828, access=WRITE, inode="/user/id843828/.knitDeps":hdfs:hdfs:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:292)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:213)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1771)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1755)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1738)
at org.apache.hadoop.hdfs.server.namenode.FSDirMkdirOp.mkdirs(FSDirMkdirOp.java:71)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3905)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:1048)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)

@jpoullet2000
Copy link
Author

jpoullet2000 commented Jan 5, 2017

OK. The above issue was because of some wrong assumption I had.
I had initiated
os.environ['HADOOP_CONF_DIR']
os.environ['YARN_CONF_DIR']
thinking KNIT would take them as arguments somewhere ;). It seems not to be the case, so I was pointing to the cluster that is defined in my /usr/hdp/current/hadoop-client/conf directory (where I don't have the writing permissions obviously ;)).
I want my jupyter notebook to be able to spawn dask jobs on different clusters, just by changing the HADOOP_CONF_DIR or YARN_CONF_DIR env variables. This is what I'm already doing successfully with spark (modifying the spark-defaults.conf on the fly before starting the spark context). I could probably have my yarn installed on some conda env (the same way I have spark... yes because I don't have the writing rights on /usr/hdp/current/) and do the same trick, but I was wondering whether they would be some nicer way to do it.

BTW, as I said, HADOOP and YARN must be installed on the client side (jupyter machine in my case, remote machine) which might be a bit limiting... or at least confusing since I don't think this is mentioned anywhere in the doc (sorry if I missed something).

Despite my comment ;) I really enjoy the project !! Thx a lot for that. And again it works just great when playing directly on the cluster (i.e., without some remote machine).

Looking forward to hearing your suggestions ;)

@quasiben
Copy link
Member

quasiben commented Jan 5, 2017

Thanks @jpoullet2000 for reporting. This looks like knit is trying to write the permissions for user: id843828 as hdfs:hdfs. I'll hopefully be able to dig into this later today

@jcrist
Copy link
Member

jcrist commented Jul 3, 2018

Knit is being replaced with Skein (https://github.com/jcrist/skein), a more general library for deploying applications on YARN. dask-yarn (https://github.com/dask/dask-yarn) now uses this.

The issue of starting a cluster from not on an edge node still stands (see jcrist/skein#28). I'm not sure if/how other tools (spark?) support this, as you'd need access to the resourcemanager from outside the cluster which is uncommon from what I understand. One option would be to use paramiko to automatically run the required commands remotely and setup the ssh tunnels, but I'm not sure if that's worth it. If you have thoughts on this I'd love to hear your feedback on that issue.

As for the hadoop permissions issues, those should all be solved in skein. The new release of dask-yarn (https://dask-yarn.readthedocs.io/en/latest/) uses this and should be much more resilient to different hadoop configurations.

Closing.

@jcrist jcrist closed this as completed Jul 3, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants