Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid shipping jar when using spark as a filesystem #49

Closed
tomerk opened this issue Apr 14, 2015 · 9 comments
Closed

Avoid shipping jar when using spark as a filesystem #49

tomerk opened this issue Apr 14, 2015 · 9 comments
Labels

Comments

@tomerk
Copy link
Contributor

tomerk commented Apr 14, 2015

When treating spark as a general file system to read & write into hdfs, s3, etc. (e.g. writing observations & reading user weights following a retrain), avoid shipping the JAR! May not be naively possible in some cases (e.g. user-defined contexts).

Longer-term, this would be fixed by issue #48, and by writing to/from HDFS & other filesystems depending on how closely we decide to tie Velox to specific spark cluster configurations & file destinations

@tomerk tomerk added the fixup label Apr 14, 2015
@tomerk
Copy link
Contributor Author

tomerk commented Apr 16, 2015

@shivaram mentioned that it's fine to use spark to read to & from a filesystem, but he recommended just using a local spark context and connecting to the filesystem on the remote spark cluster instead of connecting a spark context to the remote spark cluster. This should solve this issue because we wouldn't need to ship any jars.

@shivaram
Copy link

BTW you can also just try to use the FileSystem API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html) -- Is that not enough for some of your use cases ?

@tomerk
Copy link
Contributor Author

tomerk commented Apr 16, 2015

We were trying to use that before (at least I think that's what Dan was using), but there was much more effort required to get it configured and working correctly when connecting to a spark ec2 cluster.


Best Wishes,
Tomer Kaftan

On Wed, Apr 15, 2015 at 5:56 PM, Shivaram Venkataraman
[email protected] wrote:

BTW you can also just try to use the FileSystem API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html) -- Is that not enough for some of your use cases ?

Reply to this email directly or view it on GitHub:
#49 (comment)

@shivaram
Copy link

To make it connect / work with the Spark ec2 cluster you can do something like this

// Create a config object that just loads the same core-site.xml / hdfs-site.xml as Spark
val config = new Configuration(true)
// Call config.addResource() if required
val fs = FileSystem.get(config)

@tomerk
Copy link
Contributor Author

tomerk commented Apr 16, 2015

Yeah we did that, but it was trying to connect to the private ec2 DNS of the data nodes which wasn't working


Best Wishes,
Tomer Kaftan

On Wed, Apr 15, 2015 at 7:31 PM, Shivaram Venkataraman
[email protected] wrote:

To make it connect / work with the Spark ec2 cluster you can do something like this

// Create a config object that just loads the same core-site.xml / hdfs-site.xml as Spark
val config = new Configuration(true)
// Call config.addResource() if required
val fs = FileSystem.get(config)

Reply to this email directly or view it on GitHub:
#49 (comment)

@etrain
Copy link

etrain commented Apr 16, 2015

Security groups and/or VPN setup? These are the conventional ways I've seen
this handled.

On Wed, Apr 15, 2015 at 7:34 PM, Tomer Kaftan [email protected]
wrote:

Yeah we did that, but it was trying to connect to the private ec2 DNS of
the data nodes which wasn't working


Best Wishes,
Tomer Kaftan

On Wed, Apr 15, 2015 at 7:31 PM, Shivaram Venkataraman
[email protected] wrote:

To make it connect / work with the Spark ec2 cluster you can do
something like this

// Create a config object that just loads the same core-site.xml /
hdfs-site.xml as Spark
val config = new Configuration(true)
// Call config.addResource() if required
val fs = FileSystem.get(config)

Reply to this email directly or view it on GitHub:

#49 (comment)


Reply to this email directly or view it on GitHub
#49 (comment)
.

@tomerk
Copy link
Contributor Author

tomerk commented Apr 16, 2015

The security groups were sufficiently open. I could probably figure out some sort of configuration to work, but right now we're focusing on getting Velox to work out of the box w/ little configuration, and going through a spark context makes that a little bit simpler.


Best Wishes,
Tomer Kaftan

On Wed, Apr 15, 2015 at 7:46 PM, Evan Sparks [email protected]
wrote:

Security groups and/or VPN setup? These are the conventional ways I've seen
this handled.
On Wed, Apr 15, 2015 at 7:34 PM, Tomer Kaftan [email protected]
wrote:

Yeah we did that, but it was trying to connect to the private ec2 DNS of
the data nodes which wasn't working


Best Wishes,
Tomer Kaftan

On Wed, Apr 15, 2015 at 7:31 PM, Shivaram Venkataraman
[email protected] wrote:

To make it connect / work with the Spark ec2 cluster you can do
something like this

// Create a config object that just loads the same core-site.xml /
hdfs-site.xml as Spark
val config = new Configuration(true)
// Call config.addResource() if required
val fs = FileSystem.get(config)

Reply to this email directly or view it on GitHub:

#49 (comment)


Reply to this email directly or view it on GitHub
#49 (comment)
.


Reply to this email directly or view it on GitHub:
#49 (comment)

@dcrankshaw
Copy link
Contributor

Yeah I was using the Filesystem API. It worked fine when Velox was running on ec2 and could resolve AWS private IP addresses, but when Velox is running outside of ec2 (e.g. on your laptop) we were running into issues. I'm assuming that there is a fix, but it's not the highest priority right now. The other advantage of using Spark is that Spark has already done the work to talk to multiple versions of HDFS. We would have to replicate that work in Velox or only support a single version of hadoop.

@tomerk
Copy link
Contributor Author

tomerk commented Apr 21, 2015

Closed by issue #51

@tomerk tomerk closed this as completed Apr 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants