-
Notifications
You must be signed in to change notification settings - Fork 50
WIP: TrainingStep #77
base: master
Are you sure you want to change the base?
Conversation
type V1 | ||
type V2 | ||
|
||
def prepare(trees: Map[Int, Tree[K,V,T]], instance: Instance[K,V,T]): Seq[((Int,Int,K1), V1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make a case class LeafId(forestIndex: Int, leafIndex: Int)
or something - would probably be generally useful.
I really like the idea overall. Hard to tell if it is too overfit, but I agree that we need to start somewhere! I'm sort of giving some nit-picky comments to start, but will hopefully give some more useful ones as I understand the abstraction better. |
val treeMap = trees.zipWithIndex.map{case (t,i) => i->t}.toMap | ||
var sums1 = Map[(Int,Int,step.K1),step.V1]() | ||
|
||
trainingData.foreach{instance => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of golf, but I think this should do roughly the same thing (including laziness):
val sums1 = MapAlgebra.rollupSum(trainingData.iterator.flatMap(step.prepare(treeMap, _)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait - I think I misunderstood rollupSum
. Something similar in spirit should exist though :\
BTW one thing I'm kinda grumpy about is the distinction between |
This is very WIP still, but I've gone forward with the |
attn @tixxit @non
The intent here is to capture the basic mechanics of various training steps - updateTargets, expand, prune, etc - in a way that can be reused in multiple execution environments (local, scalding, spark, ...).
For now, this is all added directly to and used only by the Local trainer. The near-term impact is that the local trainer will stream over its input data in the same way that the distributed trainers do, rather than requiring it to all be loaded into memory. For this PR to be complete, we should move the training steps to their own module (maybe also doing #51), and refactor the scalding trainer to use them.
It's very possible that this is too much or too little abstraction - right now it seems a bit overfit to the needs of the specific training steps and platforms we support, and I suspect it will be brittle going forward. (In fact, featureImportance already doesn't work with this, though I can argue that we should move to a TreeTraversal-based strategy for that which would). At the same time, I think some approach like this will be valuable going forward, and I think it's better to start going imperfectly down this path.