-
Notifications
You must be signed in to change notification settings - Fork 50
Add Spark support #15
Comments
The only dependency on Execution (or anything Scalding specific in general) is in the cc @non who was also looking into this... |
(But if there is anything specific I can explain about how the Scalding version works I'd be happy to do so) |
@avibryant I was referring to the I would like to try to build a new |
@vitalyg I'd be happy to go over it with you, maybe over IRC or something next week some time? The simplest thing to start with is updateTargets, which is used for constructing the root node of an empty tree, and can also be used to update the leaf distributions for an existing tree from new training data. The idea here is that you pass over the training data once: For each tree we're building, we find out how many times to include this instance in that tree: Then, that many times, we find the leaf corresponding to that instance in that tree, and we emit a key -> value pair which is (treeIndex, leafIndex) -> instance target: Then, we can, in parallel, sum up all of those values. Then we group just by key to bring together all of the summed targets for a tree, by leafIndex: Then we (in parallel, but only with as much parallelism as we have trees) modify the trees to have the new targets: At the end we write out the new trees. |
Most of the code is generic enough to run on a different framework other than Scalding. However, there is some dependency on the new Execution module of Scalding that I couldn't completely get around.
Is it possible to refactor the code that will be less Scalding specific, or just explain to me how it all works, and I'll try to do it?
The text was updated successfully, but these errors were encountered: