Skip to content
This repository has been archived by the owner on Apr 8, 2021. It is now read-only.

WIP: TrainingStep #77

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

WIP: TrainingStep #77

wants to merge 10 commits into from

Conversation

avibryant
Copy link
Contributor

attn @tixxit @non

The intent here is to capture the basic mechanics of various training steps - updateTargets, expand, prune, etc - in a way that can be reused in multiple execution environments (local, scalding, spark, ...).

For now, this is all added directly to and used only by the Local trainer. The near-term impact is that the local trainer will stream over its input data in the same way that the distributed trainers do, rather than requiring it to all be loaded into memory. For this PR to be complete, we should move the training steps to their own module (maybe also doing #51), and refactor the scalding trainer to use them.

It's very possible that this is too much or too little abstraction - right now it seems a bit overfit to the needs of the specific training steps and platforms we support, and I suspect it will be brittle going forward. (In fact, featureImportance already doesn't work with this, though I can argue that we should move to a TreeTraversal-based strategy for that which would). At the same time, I think some approach like this will be valuable going forward, and I think it's better to start going imperfectly down this path.

type V1
type V2

def prepare(trees: Map[Int, Tree[K,V,T]], instance: Instance[K,V,T]): Seq[((Int,Int,K1), V1)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make a case class LeafId(forestIndex: Int, leafIndex: Int) or something - would probably be generally useful.

@tixxit
Copy link
Contributor

tixxit commented Dec 21, 2015

I really like the idea overall. Hard to tell if it is too overfit, but I agree that we need to start somewhere!

I'm sort of giving some nit-picky comments to start, but will hopefully give some more useful ones as I understand the abstraction better.

val treeMap = trees.zipWithIndex.map{case (t,i) => i->t}.toMap
var sums1 = Map[(Int,Int,step.K1),step.V1]()

trainingData.foreach{instance =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of golf, but I think this should do roughly the same thing (including laziness):

val sums1 = MapAlgebra.rollupSum(trainingData.iterator.flatMap(step.prepare(treeMap, _)))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait - I think I misunderstood rollupSum. Something similar in spirit should exist though :\

@avibryant
Copy link
Contributor Author

BTW one thing I'm kinda grumpy about is the distinction between TrainingStep and OutputStep. I wanted each step to be able to either produce new trees or some sidechannel output or both (for example, in the long run I'd really like to compute out-of-band error during the expand step). But ValidationStep really does look structurally quite different; it doesn't care about per-leaf or even per-tree, and is instead just computing a single value across the whole forest. A notional FeatureImportanceStep would be similar. So I'm not sure how best to model that possibility.

@avibryant avibryant mentioned this pull request Dec 23, 2015
@avibryant
Copy link
Contributor Author

This is very WIP still, but I've gone forward with the brushfire-training reorg, because having TrainingStep gives me somewhere to land prune and expandInMemory outside the Tree which is still reusable, which was previously a blocker for that. This probably ends up as an overly large PR, but oh well.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants