Add Spark support #15

vitalyg · 2014-12-18T00:42:22Z

Most of the code is generic enough to run on a different framework other than Scalding. However, there is some dependency on the new Execution module of Scalding that I couldn't completely get around.
Is it possible to refactor the code that will be less Scalding specific, or just explain to me how it all works, and I'll try to do it?

avibryant · 2014-12-20T18:38:39Z

The only dependency on Execution (or anything Scalding specific in general) is in the com.stripe.brushfire.scalding package, which is to say, really, the Trainer class. Although I realize there is probably some small amount of Trainer that could be generalized and reused for a Spark implementation, I think the first step is just to build a completely new Trainer for Spark, which uses Spark idioms, and then see how similar they actually are.

cc @non who was also looking into this...

avibryant · 2014-12-20T18:43:55Z

(But if there is anything specific I can explain about how the Scalding version works I'd be happy to do so)

vitalyg · 2014-12-22T21:22:49Z

@avibryant I was referring to the scalding package. The rest is very general. Also, Scalding and Spark are very interchangeable, but unfortunately there is no Spark equivalent for Scalding's new Execution feature (which is awesome by the way).

I would like to try to build a new Trainer class for Spark, but unfortunately, I am not sure I follow how the Scalding version is evaluated. What happens after what and what happens in parallel. But maybe if we can go over the code, I can translate it and then we can even have nice benchmarks to compare.

avibryant · 2014-12-22T21:32:42Z

@vitalyg I'd be happy to go over it with you, maybe over IRC or something next week some time?

The simplest thing to start with is updateTargets, which is used for constructing the root node of an empty tree, and can also be used to update the leaf distributions for an existing tree from new training data.

The idea here is that you pass over the training data once:
https://github.com/stripe/brushfire/blob/master/src/main/scala/com/stripe/brushfire/scalding/Trainer.scala#L76

For each tree we're building, we find out how many times to include this instance in that tree:
https://github.com/stripe/brushfire/blob/master/src/main/scala/com/stripe/brushfire/scalding/Trainer.scala#L79

Then, that many times, we find the leaf corresponding to that instance in that tree, and we emit a key -> value pair which is (treeIndex, leafIndex) -> instance target:
https://github.com/stripe/brushfire/blob/master/src/main/scala/com/stripe/brushfire/scalding/Trainer.scala#L79

Then, we can, in parallel, sum up all of those values.

Then we group just by key to bring together all of the summed targets for a tree, by leafIndex:
https://github.com/stripe/brushfire/blob/master/src/main/scala/com/stripe/brushfire/scalding/Trainer.scala#L84

Then we (in parallel, but only with as much parallelism as we have trees) modify the trees to have the new targets:
https://github.com/stripe/brushfire/blob/master/src/main/scala/com/stripe/brushfire/scalding/Trainer.scala#L92

At the end we write out the new trees.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Spark support #15

Add Spark support #15

vitalyg commented Dec 18, 2014

avibryant commented Dec 20, 2014

avibryant commented Dec 20, 2014

vitalyg commented Dec 22, 2014

avibryant commented Dec 22, 2014

Add Spark support #15

Add Spark support #15

Comments

vitalyg commented Dec 18, 2014

avibryant commented Dec 20, 2014

avibryant commented Dec 20, 2014

vitalyg commented Dec 22, 2014

avibryant commented Dec 22, 2014