-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid shipping jar when using spark as a filesystem #49
Comments
@shivaram mentioned that it's fine to use spark to read to & from a filesystem, but he recommended just using a local spark context and connecting to the filesystem on the remote spark cluster instead of connecting a spark context to the remote spark cluster. This should solve this issue because we wouldn't need to ship any jars. |
BTW you can also just try to use the FileSystem API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html) -- Is that not enough for some of your use cases ? |
We were trying to use that before (at least I think that's what Dan was using), but there was much more effort required to get it configured and working correctly when connecting to a spark ec2 cluster. — On Wed, Apr 15, 2015 at 5:56 PM, Shivaram Venkataraman
|
To make it connect / work with the Spark ec2 cluster you can do something like this
|
Yeah we did that, but it was trying to connect to the private ec2 DNS of the data nodes which wasn't working — On Wed, Apr 15, 2015 at 7:31 PM, Shivaram Venkataraman
|
Security groups and/or VPN setup? These are the conventional ways I've seen On Wed, Apr 15, 2015 at 7:34 PM, Tomer Kaftan [email protected]
|
The security groups were sufficiently open. I could probably figure out some sort of configuration to work, but right now we're focusing on getting Velox to work out of the box w/ little configuration, and going through a spark context makes that a little bit simpler. — On Wed, Apr 15, 2015 at 7:46 PM, Evan Sparks [email protected]
|
Yeah I was using the Filesystem API. It worked fine when Velox was running on ec2 and could resolve AWS private IP addresses, but when Velox is running outside of ec2 (e.g. on your laptop) we were running into issues. I'm assuming that there is a fix, but it's not the highest priority right now. The other advantage of using Spark is that Spark has already done the work to talk to multiple versions of HDFS. We would have to replicate that work in Velox or only support a single version of hadoop. |
Closed by issue #51 |
When treating spark as a general file system to read & write into hdfs, s3, etc. (e.g. writing observations & reading user weights following a retrain), avoid shipping the JAR! May not be naively possible in some cases (e.g. user-defined contexts).
Longer-term, this would be fixed by issue #48, and by writing to/from HDFS & other filesystems depending on how closely we decide to tie Velox to specific spark cluster configurations & file destinations
The text was updated successfully, but these errors were encountered: