-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to calculate LOCO for dates/texts by Leaving Out Entire Vector. #418
Conversation
…eaving out their entire vector.
Codecov Report
@@ Coverage Diff @@
## master #418 +/- ##
==========================================
+ Coverage 86.93% 86.93% +<.01%
==========================================
Files 337 337
Lines 11098 11100 +2
Branches 362 366 +4
==========================================
+ Hits 9648 9650 +2
Misses 1450 1450
Continue to review full report at Codecov.
|
|
||
case VectorAggregationStrategy.LeaveOutVector => | ||
val copyFeatureSparse = featureSparse.copy | ||
aggIndices.map {case (i, oldInd) => copyFeatureSparse.updated(i, oldInd, 0.0)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe foreach
instead of map
?
Also doesn't copyFeatureSparse
need to be a var
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, i will use foreach
But I am not sure about whether i should make it var
. Reassignment of val
is not allowed, but object can have its internal state modified, So guess val
is fine.
val a = Array(10, 20)
a.update(0, 15) // works
a = Array(20, 30) // fails
val sumLOCOs = locos.reduce((a1, a2) => a1.zip(a2).map { case (l, r) => l + r }) | ||
sumLOCOs.map(_ / indices.length) | ||
case VectorAggregationStrategy.LeaveOutVector => | ||
indices.map { i => featureArray.update(i, 0.0) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foreach
also here
case VectorAggregationStrategy.LeaveOutVector => | ||
indices.map { i => featureArray.update(i, 0.0) } | ||
val newScore = model.transformFn(l.toRealNN, featureArray.toOPVector).score.toSeq | ||
baseScore.zip(newScore).map { case (b, n) => b - n } | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need to revert featureArray
with the old vals ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because its a copy so no need, check line val featureArray = v.copy.toArray
val featureIndexSet = featuresSparse.indices.toSet | ||
|
||
// Besides non 0 values, we want to check the text/date features as well | ||
val zeroValIndices = (textFeatureIndices ++ dateFeatureIndices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you manage to remove the zero val indices logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From featureSparse
vector we know the active indices count, but to calculate the average
LOCO for each date/text field, we needed the zero val indices logic. We used to calculate the total count of indices for a date/text feature in each transformation of a individual record/row in transformFn
function. There is no need to do this, we can just calculate the total count of indices per date/text feature only once using OpVectorColumnHistory
at the global level (i.e outside transformFn
function ), this is what I did in line 139
private lazy val textFeaturesCount: Map[String, Int] = getFeatureCount(isTextIndex)
private lazy val dateFeaturesCount: Map[String, Int] = getFeatureCount(isDateIndex)
val oldVal = v(i) | ||
featureArray.update(i, 0.0) | ||
val newScore = model.transformFn(l.toRealNN, featureArray.toOPVector).score.toSeq | ||
featureArray.update(i, oldVal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then why do you update back, since featureArray
is a copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we have to because featureArray is a copy created before we calculate loco for each element, i.e
val featureArray = v.copy.toArray
val locos = indices.map { i =>
featureArray.update(i, 0.0)
calculateLOCO(featureArray, ....)
featureArray.update(i, 0.0)
}
If we refactor the above code like below, then we don't need to update back.
val locos = indices.map { i =>
val featureArray = v.copy.toArray
featureArray.update(i, 0.0)
calculateLOCO(featureArray,...)
}
def setTextAggregationStrategy(strategy: VectorAggregationStrategy): this.type = | ||
set(textAggregationStrategy, strategy.entryName) | ||
def getTextAggregationStrategy: VectorAggregationStrategy = VectorAggregationStrategy.withName( | ||
$(textAggregationStrategy)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really want/need to expose strategies for each feature type? I imagine that there are other types of features me may eventually want to control as either average or leave out vector and putting in individual settings for each seems excessive...maybe one parameter to control how ALL vector treated features are handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. One parameter should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@tovbinm can you please take a look at this PR ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont see where do we went from N*M to N complexity. Please comment.
core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala
Outdated
Show resolved
Hide resolved
private def getFeaturesSize(predicate: OpVectorColumnHistory => Boolean): Map[String, Int] = histories | ||
.filter(predicate) | ||
.groupBy { h => getRawFeatureName(h).get } | ||
.mapValues(_.length).view.toMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- what's the point of
.view
here? either add it before filter of not add it at all. - add docs
core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, overall; address previous comments please
core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala
Outdated
Show resolved
Hide resolved
@tovbinm @gerashegalov I have addressed your review comments. Thanks |
core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala
Show resolved
Hide resolved
strategy: VectorAggregationStrategy, | ||
baseScore: Array[Double], | ||
featureSize: Int | ||
): Array[Double] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cost of computeDiff is O(n).
When VectorAggregationStrategy=Avg
note that, we do computeDiff m times and
When VectorAggregationStrategy=LeaveOutVector
, we do computeDiff only 1 time.
So the former computation is O(n*m) and the latter is O(n)
where n is the size of entire feature vector (i.e containing all the features) and m is the size of individual text/date feature vector which is to be aggregated.
@tovbinm ^^
@tovbinm can you please merge this PR ? |
@sanmitra build timed out. restarted. |
@sanmitra the build is failing on master branch due to LOCO test failure. please have a look. thanks. |
Related issues
Raw features of types - texts and dates are converted to vectors during feature engineering. To calculate locos for such features, we calculate loco for each element in the vector and then average them. So, the complexity of calculating LOCO for each date/text feature is O(n*m)
where n is the size of entire feature vector (i.e containing all the features) and m is the size of individual text/date feature vector.
Describe the proposed solution
Alternate way to calculate the locos for date/text feature is zero out the entire vector of that feature and then calculate loco. So, the complexity of calculating LOCO for each date/text feature is O(n). The enum for this new approach is
LeaveOutVector
and old approach isAvg
, and user one can choose between them.