Avoid shipping jar when using spark as a filesystem #49

tomerk · 2015-04-14T06:06:14Z

When treating spark as a general file system to read & write into hdfs, s3, etc. (e.g. writing observations & reading user weights following a retrain), avoid shipping the JAR! May not be naively possible in some cases (e.g. user-defined contexts).

Longer-term, this would be fixed by issue #48, and by writing to/from HDFS & other filesystems depending on how closely we decide to tie Velox to specific spark cluster configurations & file destinations

tomerk · 2015-04-16T00:41:09Z

@shivaram mentioned that it's fine to use spark to read to & from a filesystem, but he recommended just using a local spark context and connecting to the filesystem on the remote spark cluster instead of connecting a spark context to the remote spark cluster. This should solve this issue because we wouldn't need to ship any jars.

shivaram · 2015-04-16T00:56:10Z

BTW you can also just try to use the FileSystem API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html) -- Is that not enough for some of your use cases ?

tomerk · 2015-04-16T01:18:31Z

We were trying to use that before (at least I think that's what Dan was using), but there was much more effort required to get it configured and working correctly when connecting to a spark ec2 cluster.

—
Best Wishes,
Tomer Kaftan

On Wed, Apr 15, 2015 at 5:56 PM, Shivaram Venkataraman
[email protected] wrote:

BTW you can also just try to use the FileSystem API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html) -- Is that not enough for some of your use cases ?

Reply to this email directly or view it on GitHub:
#49 (comment)

shivaram · 2015-04-16T02:31:38Z

To make it connect / work with the Spark ec2 cluster you can do something like this

// Create a config object that just loads the same core-site.xml / hdfs-site.xml as Spark
val config = new Configuration(true)
// Call config.addResource() if required
val fs = FileSystem.get(config)

tomerk · 2015-04-16T02:34:03Z

Yeah we did that, but it was trying to connect to the private ec2 DNS of the data nodes which wasn't working

—
Best Wishes,
Tomer Kaftan

On Wed, Apr 15, 2015 at 7:31 PM, Shivaram Venkataraman
[email protected] wrote:

To make it connect / work with the Spark ec2 cluster you can do something like this
// Create a config object that just loads the same core-site.xml / hdfs-site.xml as Spark
val config = new Configuration(true)
// Call config.addResource() if required
val fs = FileSystem.get(config)
Reply to this email directly or view it on GitHub:
#49 (comment)

etrain · 2015-04-16T02:46:09Z

Security groups and/or VPN setup? These are the conventional ways I've seen
this handled.

On Wed, Apr 15, 2015 at 7:34 PM, Tomer Kaftan [email protected]
wrote:

Yeah we did that, but it was trying to connect to the private ec2 DNS of
the data nodes which wasn't working

—
Best Wishes,
Tomer Kaftan

On Wed, Apr 15, 2015 at 7:31 PM, Shivaram Venkataraman
[email protected] wrote:
To make it connect / work with the Spark ec2 cluster you can do
something like this
// Create a config object that just loads the same core-site.xml /
hdfs-site.xml as Spark
val config = new Configuration(true)
// Call config.addResource() if required
val fs = FileSystem.get(config)
Reply to this email directly or view it on GitHub:

#49 (comment)
—
Reply to this email directly or view it on GitHub
#49 (comment)
.

tomerk · 2015-04-16T03:39:39Z

The security groups were sufficiently open. I could probably figure out some sort of configuration to work, but right now we're focusing on getting Velox to work out of the box w/ little configuration, and going through a spark context makes that a little bit simpler.

—
Best Wishes,
Tomer Kaftan

On Wed, Apr 15, 2015 at 7:46 PM, Evan Sparks [email protected]
wrote:

Security groups and/or VPN setup? These are the conventional ways I've seen
this handled.
On Wed, Apr 15, 2015 at 7:34 PM, Tomer Kaftan [email protected]
wrote:
Yeah we did that, but it was trying to connect to the private ec2 DNS of
the data nodes which wasn't working

—
Best Wishes,
Tomer Kaftan

On Wed, Apr 15, 2015 at 7:31 PM, Shivaram Venkataraman
[email protected] wrote:
To make it connect / work with the Spark ec2 cluster you can do
something like this
// Create a config object that just loads the same core-site.xml /
hdfs-site.xml as Spark
val config = new Configuration(true)
// Call config.addResource() if required
val fs = FileSystem.get(config)
Reply to this email directly or view it on GitHub:

#49 (comment)
—
Reply to this email directly or view it on GitHub
#49 (comment)
.

Reply to this email directly or view it on GitHub:
#49 (comment)

dcrankshaw · 2015-04-16T04:45:23Z

Yeah I was using the Filesystem API. It worked fine when Velox was running on ec2 and could resolve AWS private IP addresses, but when Velox is running outside of ec2 (e.g. on your laptop) we were running into issues. I'm assuming that there is a fix, but it's not the highest priority right now. The other advantage of using Spark is that Spark has already done the work to talk to multiple versions of HDFS. We would have to replicate that work in Velox or only support a single version of hadoop.

tomerk · 2015-04-21T18:21:13Z

Closed by issue #51

tomerk added the fixup label Apr 14, 2015

tomerk mentioned this issue Apr 17, 2015

Added a broadcast via spark & switched to a single long-lived spark context #51

Merged

tomerk closed this as completed Apr 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid shipping jar when using spark as a filesystem #49

Avoid shipping jar when using spark as a filesystem #49

tomerk commented Apr 14, 2015

tomerk commented Apr 16, 2015

shivaram commented Apr 16, 2015

tomerk commented Apr 16, 2015

BTW you can also just try to use the FileSystem API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html) -- Is that not enough for some of your use cases ?

shivaram commented Apr 16, 2015

tomerk commented Apr 16, 2015

etrain commented Apr 16, 2015

tomerk commented Apr 16, 2015

dcrankshaw commented Apr 16, 2015

tomerk commented Apr 21, 2015

Avoid shipping jar when using spark as a filesystem #49

Avoid shipping jar when using spark as a filesystem #49

Comments

tomerk commented Apr 14, 2015

tomerk commented Apr 16, 2015

shivaram commented Apr 16, 2015

tomerk commented Apr 16, 2015

BTW you can also just try to use the FileSystem API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html) -- Is that not enough for some of your use cases ?

shivaram commented Apr 16, 2015

tomerk commented Apr 16, 2015

etrain commented Apr 16, 2015

tomerk commented Apr 16, 2015

dcrankshaw commented Apr 16, 2015

tomerk commented Apr 21, 2015