-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid creating SparseVectors for LOCO #377
Avoid creating SparseVectors for LOCO #377
Conversation
Wow! |
Codecov Report
@@ Coverage Diff @@
## master #377 +/- ##
==========================================
- Coverage 86.83% 86.83% -0.01%
==========================================
Files 336 336
Lines 10955 10957 +2
Branches 347 577 +230
==========================================
+ Hits 9513 9514 +1
- Misses 1442 1443 +1
Continue to review full report at Codecov.
|
Wait, if we apply |
it's still WIP but I have this guard |
Neat. What about memory complexity? |
6694415
to
9c6cb21
Compare
9c6cb21
to
66a5e99
Compare
reworked the solution to avoid the memory overhead of the dense vector. |
|
||
agggregateDiffs(0, Left(featureSparse), indexToExamine, minMaxHeap, aggregationMap, | ||
baseScore) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the sparse features you just put in a value of 0? Cant we just skip adding them to the heap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same idea but in one of the iteration I ran into test failures and deferred it to later. I'll recheck now that I have everything green. @michaelweilsalesforce any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of failures have you encountered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it may be that we were doing an unnecessary calculation and that just happened to be captured in the test...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaelweilsalesforce you can reproduce it by commenting out the line 171-172.
Aggregate all the derived hashing tf features of rawFeature - text. 0.08025355373244505 was not less than 1.0E-10 expected aggregated LOCO value (0.006978569889777832) should be the same as actual (0.08723212362222289)
Aggregate x_HourOfDay and y_HourOfDay of rawFeature - dateFeature. 0.016493734169231777 was not less than 1.0E-10 expected aggregated LOCO value (0.016493734169231777) should be the same as actual (0.032987468338463555)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leahmcguire @gerashegalov The reason for tracking zero values is whenever we want to average LOCOs of a same raw text feature we are also including the zero values.
E.g if text feature TextA
has on a row 6 non zero values loco1
, ..., loco6
and 4 0s, we are dividing by 10 :
(loco1
+ loco2
+ ... + loco6
)/10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LEt me write a fix that will not go over the zeros
Thanks Michael!
…On Wed, Aug 7, 2019 at 5:11 PM Michael Weil ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In
core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala
<#377 (comment)>
:
>
+ agggregateDiffs(0, Left(featureSparse), indexToExamine, minMaxHeap, aggregationMap,
+ baseScore)
LEt me write a fix that will not go over the zeros
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#377?email_source=notifications&email_token=AAYKJYSL3NJYROY5C2RIPFTQDNQEZA5CNFSM4IH6XKZKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCA5M2TY#discussion_r311811639>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAYKJYWIBVUMCL35MSVTNC3QDNQEZANCNFSM4IH6XKZA>
.
|
cb2dc05
to
f95f4bf
Compare
…ogrifAI into gera/perf-regression
@gerashegalov Here is a proposal that skips the diffs for zero values. Code can be nicer though |
Thank you, looks good, just a few polishes |
@@ -116,34 +114,28 @@ class RecordInsightsLOCO[T <: Model[T]] | |||
Set(FeatureType.typeName[DateMap], FeatureType.typeName[DateTimeMap]) | |||
|
|||
// Indices of features derived from Text(Map)Vectorizer | |||
private lazy val textFeatureIndices = getIndicesOfFeatureType(textTypes ++ textMapTypes) | |||
private lazy val textFeatureIndices: Seq[Int] = getIndicesOfFeatureType(textTypes ++ textMapTypes, | |||
h => h.indicatorValue.isEmpty && h.descriptorValue.isEmpty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe update comment to indicate only getting hashed text values
val name = history.parentFeatureOrigins.headOption.map(_ + groupSuffix) | ||
|
||
// If the descriptor value of a derived date feature exists, then it is likely to be | ||
// from unit circle transformer. We aggregate such features for each (rawFeatureName, timePeriod). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is true now - but may not always be true. If you want this to apply only for date unit circles should also check that one of the parentFeatureStages
is a DateToUnitCircleTransformer
or DateToUnitCircleVectorizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is not consistent : Unit Circle Transformation in DateMapVectorizer
is not reflected in the parentStages (Seq[DateMapVectorizer]
instead).
I think the check on descriptor value is coherent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or I can check the parentType instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this change is explicitly to deal with date features that are transformed to unit circle then the check needs to be explicitly for that. Otherwise this is also applied to lat lon values (and anything else that we add later) and if we just check the type of the parent it assumes that we will always have unit circle transformation of dates - which could change at some point...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but as I said above checking the parentFeatureStages
won't work : for instance DateMapVectorizer
may apply Unit circle transformation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DateMapVectorizer does days between reference date and the date. The only two that do unit vector are DateToUnitCircleTransformer and DateToUnitCircleVectorizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then there must be a bug in the shortcut : when println(s"name ${history.columnName} stage ${history.parentFeatureStages} descriptor value ${history.descriptorValue}")
I get
name dateMapFeature_k0_y_DayOfYear_33 stage ArrayBuffer(vecDateMap_DateMapVectorizer_00000000004c) descriptor value Some(y_DayOfYear)
name dateMapFeature_k1_x_DayOfYear_34 stage ArrayBuffer(vecDateMap_DateMapVectorizer_00000000004c) descriptor value Some(x_DayOfYear)
name dateMapFeature_k1_y_DayOfYear_35 stage ArrayBuffer(vecDateMap_DateMapVectorizer_00000000004c) descriptor value Some(y_DayOfYear)
name dateFeature_x_HourOfDay_0 stage ArrayBuffer() descriptor value Some(x_HourOfDay)
name dateFeature_y_HourOfDay_1 stage ArrayBuffer() descriptor value Some(y_HourOfDay)
Those features both use the .vetcorize
shortcut.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blarg! you are right there is a bug in the feature history that means we loose info if the same feature undergoes multiple transformations :-( https://github.com/salesforce/TransmogrifAI/blob/master/features/src/main/scala/com/salesforce/op/utils/spark/OpVectorMetadata.scala#L53
Can you put a todo to update once the bug is fixed
val (i, n) = (indices.head, indices.length) | ||
val zeroCounts = zeroCountByFeature.get(name).getOrElse(0) | ||
val diffToExamine = ar.map(_ / (n + zeroCounts)) | ||
minMaxHeap enqueue LOCOValue(i, diffToExamine(indexToExamine), diffToExamine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait so we are aggregating everything into a map and then putting it into a heap and then just taking it out of the heap? doesn't that defeat the whole purpose of the heap? Shouldn't we be putting each value into the heap as we calculating it rather than aggregating the whole thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are only aggregating TF and Date features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok - can you add a comment to that effect
// Count zeros by feature name | ||
val zeroCountByFeature = zeroValIndices | ||
.groupBy(i => getRawFeatureName(histories(i)).get) | ||
.mapValues(_.length).view.toMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What’s the point of .view here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to force map materialization after toMap in 2.11
Shall we merge this one? |
Bug fixes: - Ensure correct metrics despite model failures on some CV folds [#404](#404) - Fix flaky `ModelInsight` tests [#395](#395) - Avoid creating `SparseVector`s for LOCO [#377](#377) New features / updates: - Model combiner [#385](#399) - Added new sample for HousingPrices [#365](#365) - Test to verify that custom metrics appear in model insight metrics [#387](#387) - Add `FeatureDistribution` to `SerializationFormat`s [#383](#383) - Add metadata to `OpStandadrdScaler` to allow for descaling [#378](#378) - Improve json serde error in `evalMetFromJson` [#380](#380) - Track mean & standard deviation as metrics for numeric features and for text length of text features [#354](#354) - Making model selectors robust to failing models [#372](#372) - Use compact and compressed model json by default [#375](#375) - Descale feature contribution for Linear Regression & Logistic Regression [#345](#345) Dependency updates: - Update tika version [#382](#382)
Thanks for the contribution! It looks like @mweilsalesforce is an internal user so signing the CLA is not required. However, we need to confirm this. |
Thanks for the contribution! Unfortunately we can't verify the commit author(s): Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request. |
Related issues
#376
Describe the proposed solution
reuse the original SparseVector as a mutable template
Additional context
In a scoring job: