Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to calculate LOCO for dates/texts by Leaving Out Entire Vector. #418

Merged
merged 21 commits into from
Nov 5, 2019

Conversation

sanmitra
Copy link
Contributor

Related issues
Raw features of types - texts and dates are converted to vectors during feature engineering. To calculate locos for such features, we calculate loco for each element in the vector and then average them. So, the complexity of calculating LOCO for each date/text feature is O(n*m)
where n is the size of entire feature vector (i.e containing all the features) and m is the size of individual text/date feature vector.

Describe the proposed solution
Alternate way to calculate the locos for date/text feature is zero out the entire vector of that feature and then calculate loco. So, the complexity of calculating LOCO for each date/text feature is O(n). The enum for this new approach is LeaveOutVector and old approach is Avg, and user one can choose between them.

@codecov
Copy link

codecov bot commented Oct 11, 2019

Codecov Report

Merging #418 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #418      +/-   ##
==========================================
+ Coverage   86.93%   86.93%   +<.01%     
==========================================
  Files         337      337              
  Lines       11098    11100       +2     
  Branches      362      366       +4     
==========================================
+ Hits         9648     9650       +2     
  Misses       1450     1450
Impacted Files Coverage Δ
...e/op/stages/impl/insights/RecordInsightsLOCO.scala 96.84% <100%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c69c356...9776e79. Read the comment docs.


case VectorAggregationStrategy.LeaveOutVector =>
val copyFeatureSparse = featureSparse.copy
aggIndices.map {case (i, oldInd) => copyFeatureSparse.updated(i, oldInd, 0.0)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe foreach instead of map?
Also doesn't copyFeatureSparse need to be a var?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, i will use foreach
But I am not sure about whether i should make it var. Reassignment of val is not allowed, but object can have its internal state modified, So guess val is fine.

val a = Array(10, 20)
a.update(0, 15) // works
a = Array(20, 30) // fails

val sumLOCOs = locos.reduce((a1, a2) => a1.zip(a2).map { case (l, r) => l + r })
sumLOCOs.map(_ / indices.length)
case VectorAggregationStrategy.LeaveOutVector =>
indices.map { i => featureArray.update(i, 0.0) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foreach also here

case VectorAggregationStrategy.LeaveOutVector =>
indices.map { i => featureArray.update(i, 0.0) }
val newScore = model.transformFn(l.toRealNN, featureArray.toOPVector).score.toSeq
baseScore.zip(newScore).map { case (b, n) => b - n }
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need to revert featureArray with the old vals ?

Copy link
Contributor Author

@sanmitra sanmitra Oct 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because its a copy so no need, check line val featureArray = v.copy.toArray

val featureIndexSet = featuresSparse.indices.toSet

// Besides non 0 values, we want to check the text/date features as well
val zeroValIndices = (textFeatureIndices ++ dateFeatureIndices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you manage to remove the zero val indices logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From featureSparse vector we know the active indices count, but to calculate the average LOCO for each date/text field, we needed the zero val indices logic. We used to calculate the total count of indices for a date/text feature in each transformation of a individual record/row in transformFn function. There is no need to do this, we can just calculate the total count of indices per date/text feature only once using OpVectorColumnHistory at the global level (i.e outside transformFn function ), this is what I did in line 139

private lazy val textFeaturesCount: Map[String, Int] = getFeatureCount(isTextIndex)
private lazy val dateFeaturesCount: Map[String, Int] = getFeatureCount(isDateIndex) 

val oldVal = v(i)
featureArray.update(i, 0.0)
val newScore = model.transformFn(l.toRealNN, featureArray.toOPVector).score.toSeq
featureArray.update(i, oldVal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why do you update back, since featureArray is a copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we have to because featureArray is a copy created before we calculate loco for each element, i.e

val featureArray = v.copy.toArray
val locos = indices.map { i =>
   featureArray.update(i, 0.0)
   calculateLOCO(featureArray, ....)
   featureArray.update(i, 0.0)
}

If we refactor the above code like below, then we don't need to update back.

val locos = indices.map { i =>
  val featureArray = v.copy.toArray
  featureArray.update(i, 0.0)
  calculateLOCO(featureArray,...)
}

def setTextAggregationStrategy(strategy: VectorAggregationStrategy): this.type =
set(textAggregationStrategy, strategy.entryName)
def getTextAggregationStrategy: VectorAggregationStrategy = VectorAggregationStrategy.withName(
$(textAggregationStrategy))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want/need to expose strategies for each feature type? I imagine that there are other types of features me may eventually want to control as either average or leave out vector and putting in individual settings for each seems excessive...maybe one parameter to control how ALL vector treated features are handled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. One parameter should be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@sanmitra
Copy link
Contributor Author

@tovbinm can you please take a look at this PR ?

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see where do we went from N*M to N complexity. Please comment.

private def getFeaturesSize(predicate: OpVectorColumnHistory => Boolean): Map[String, Int] = histories
.filter(predicate)
.groupBy { h => getRawFeatureName(h).get }
.mapValues(_.length).view.toMap
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. what's the point of .view here? either add it before filter of not add it at all.
  2. add docs

Copy link
Contributor

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, overall; address previous comments please

@sanmitra
Copy link
Contributor Author

sanmitra commented Nov 4, 2019

@tovbinm @gerashegalov I have addressed your review comments. Thanks

strategy: VectorAggregationStrategy,
baseScore: Array[Double],
featureSize: Int
): Array[Double] = {
Copy link
Contributor Author

@sanmitra sanmitra Nov 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost of computeDiff is O(n).
When VectorAggregationStrategy=Avg note that, we do computeDiff m times and
When VectorAggregationStrategy=LeaveOutVector, we do computeDiff only 1 time.
So the former computation is O(n*m) and the latter is O(n)

where n is the size of entire feature vector (i.e containing all the features) and m is the size of individual text/date feature vector which is to be aggregated.

@tovbinm ^^

@sanmitra
Copy link
Contributor Author

sanmitra commented Nov 5, 2019

@tovbinm can you please merge this PR ?

@tovbinm
Copy link
Collaborator

tovbinm commented Nov 5, 2019

@sanmitra build timed out. restarted.

@sanmitra sanmitra merged commit 8ec6234 into master Nov 5, 2019
@sanmitra sanmitra deleted the san/loco-aggregate branch November 5, 2019 00:28
@tovbinm
Copy link
Collaborator

tovbinm commented Nov 12, 2019

@sanmitra the build is failing on master branch due to LOCO test failure. please have a look. thanks.

@nicodv nicodv mentioned this pull request Jun 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants