Avoid creating SparseVectors for LOCO #377

gerashegalov · 2019-07-30T17:41:12Z

Related issues
#376

Describe the proposed solution
reuse the original SparseVector as a mutable template

Additional context
In a scoring job:

before LOCO	with record insights	with record insights after PR
< 10s	168 sec	11sec

leahmcguire · 2019-07-30T17:45:52Z

Wow!

codecov · 2019-07-30T17:59:02Z

Codecov Report

Merging #377 into master will decrease coverage by <.01%.
The diff coverage is 90.24%.

@@            Coverage Diff             @@
##           master     #377      +/-   ##
==========================================
- Coverage   86.83%   86.83%   -0.01%     
==========================================
  Files         336      336              
  Lines       10955    10957       +2     
  Branches      347      577     +230     
==========================================
+ Hits         9513     9514       +1     
- Misses       1442     1443       +1

Impacted Files	Coverage Δ
...ala/com/salesforce/op/utils/spark/RichVector.scala	`84.61% <0%> (-15.39%)`	⬇️
...e/op/stages/impl/insights/RecordInsightsLOCO.scala	`96.77% <100%> (+0.1%)`	⬆️
...m/salesforce/op/evaluators/EvaluationMetrics.scala	`86.66% <0%> (-0.84%)`	⬇️
...op/stages/impl/selector/ModelSelectorSummary.scala	`92.55% <0%> (+0.71%)`	⬆️
...es/src/main/scala/com/salesforce/op/OpParams.scala	`89.79% <0%> (+4.08%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dc64b4f...087b1d2. Read the comment docs.

michaelweilsalesforce · 2019-07-30T18:00:37Z

Wait, if we apply foreachActive on a dense vector, wouldn't this look at ALL the element of the vectors?

gerashegalov · 2019-07-30T18:07:07Z

it's still WIP but I have this guard if oldVal != 0.0 in the pattern match @michaelweilsalesforce

tovbinm · 2019-07-30T18:29:28Z

Neat. What about memory complexity?

gerashegalov · 2019-08-05T09:02:53Z

reworked the solution to avoid the memory overhead of the dense vector.

leahmcguire · 2019-08-05T23:28:40Z

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala


+    agggregateDiffs(0, Left(featureSparse), indexToExamine, minMaxHeap, aggregationMap,
+      baseScore)


So the sparse features you just put in a value of 0? Cant we just skip adding them to the heap?

I had the same idea but in one of the iteration I ran into test failures and deferred it to later. I'll recheck now that I have everything green. @michaelweilsalesforce any thoughts?

What kind of failures have you encountered?

it may be that we were doing an unnecessary calculation and that just happened to be captured in the test...

@michaelweilsalesforce you can reproduce it by commenting out the line 171-172.

Aggregate all the derived hashing tf features of rawFeature - text. 0.08025355373244505 was not less than 1.0E-10 expected aggregated LOCO value (0.006978569889777832) should be the same as actual (0.08723212362222289) Aggregate x_HourOfDay and y_HourOfDay of rawFeature - dateFeature. 0.016493734169231777 was not less than 1.0E-10 expected aggregated LOCO value (0.016493734169231777) should be the same as actual (0.032987468338463555)

@leahmcguire @gerashegalov The reason for tracking zero values is whenever we want to average LOCOs of a same raw text feature we are also including the zero values.
E.g if text feature TextA has on a row 6 non zero values loco1, ..., loco6 and 4 0s, we are dividing by 10 :
(loco1 + loco2 + ... + loco6)/10

LEt me write a fix that will not go over the zeros

gerashegalov · 2019-08-08T01:19:17Z

Thanks Michael!

…

On Wed, Aug 7, 2019 at 5:11 PM Michael Weil ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala <#377 (comment)> : > + agggregateDiffs(0, Left(featureSparse), indexToExamine, minMaxHeap, aggregationMap, + baseScore) LEt me write a fix that will not go over the zeros — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#377?email_source=notifications&email_token=AAYKJYSL3NJYROY5C2RIPFTQDNQEZA5CNFSM4IH6XKZKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCA5M2TY#discussion_r311811639>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAYKJYWIBVUMCL35MSVTNC3QDNQEZANCNFSM4IH6XKZA> .

…ogrifAI into gera/perf-regression

michaelweilsalesforce · 2019-08-08T19:29:55Z

@gerashegalov Here is a proposal that skips the diffs for zero values. Code can be nicer though

gerashegalov · 2019-08-08T19:33:58Z

Thank you, looks good, just a few polishes

leahmcguire · 2019-08-09T16:42:19Z

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

@@ -116,34 +114,28 @@ class RecordInsightsLOCO[T <: Model[T]]
    Set(FeatureType.typeName[DateMap], FeatureType.typeName[DateTimeMap])

  // Indices of features derived from Text(Map)Vectorizer
-  private lazy val textFeatureIndices = getIndicesOfFeatureType(textTypes ++ textMapTypes)
+  private lazy val textFeatureIndices: Seq[Int] = getIndicesOfFeatureType(textTypes ++ textMapTypes,
+    h => h.indicatorValue.isEmpty && h.descriptorValue.isEmpty)


maybe update comment to indicate only getting hashed text values

leahmcguire · 2019-08-09T16:46:52Z

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

+    val name = history.parentFeatureOrigins.headOption.map(_ + groupSuffix)
+
+    // If the descriptor value of a derived date feature exists, then it is likely to be
+    // from unit circle transformer. We aggregate such features for each (rawFeatureName, timePeriod).


this is true now - but may not always be true. If you want this to apply only for date unit circles should also check that one of the parentFeatureStages is a DateToUnitCircleTransformer or DateToUnitCircleVectorizer

This check is not consistent : Unit Circle Transformation in DateMapVectorizer is not reflected in the parentStages (Seq[DateMapVectorizer] instead).
I think the check on descriptor value is coherent.

Or I can check the parentType instead

If this change is explicitly to deal with date features that are transformed to unit circle then the check needs to be explicitly for that. Otherwise this is also applied to lat lon values (and anything else that we add later) and if we just check the type of the parent it assumes that we will always have unit circle transformation of dates - which could change at some point...

I agree, but as I said above checking the parentFeatureStages won't work : for instance DateMapVectorizer may apply Unit circle transformation

DateMapVectorizer does days between reference date and the date. The only two that do unit vector are DateToUnitCircleTransformer and DateToUnitCircleVectorizer

Then there must be a bug in the shortcut : when println(s"name ${history.columnName} stage ${history.parentFeatureStages} descriptor value ${history.descriptorValue}") I get

name dateMapFeature_k0_y_DayOfYear_33 stage ArrayBuffer(vecDateMap_DateMapVectorizer_00000000004c) descriptor value Some(y_DayOfYear) name dateMapFeature_k1_x_DayOfYear_34 stage ArrayBuffer(vecDateMap_DateMapVectorizer_00000000004c) descriptor value Some(x_DayOfYear) name dateMapFeature_k1_y_DayOfYear_35 stage ArrayBuffer(vecDateMap_DateMapVectorizer_00000000004c) descriptor value Some(y_DayOfYear) name dateFeature_x_HourOfDay_0 stage ArrayBuffer() descriptor value Some(x_HourOfDay) name dateFeature_y_HourOfDay_1 stage ArrayBuffer() descriptor value Some(y_HourOfDay)

Those features both use the .vetcorize shortcut.

blarg! you are right there is a bug in the feature history that means we loose info if the same feature undergoes multiple transformations :-( https://github.com/salesforce/TransmogrifAI/blob/master/features/src/main/scala/com/salesforce/op/utils/spark/OpVectorMetadata.scala#L53

Can you put a todo to update once the bug is fixed

leahmcguire · 2019-08-09T16:51:46Z

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

+      val (i, n) = (indices.head, indices.length)
+      val zeroCounts = zeroCountByFeature.get(name).getOrElse(0)
+      val diffToExamine = ar.map(_ / (n + zeroCounts))
+      minMaxHeap enqueue LOCOValue(i, diffToExamine(indexToExamine), diffToExamine)


wait so we are aggregating everything into a map and then putting it into a heap and then just taking it out of the heap? doesn't that defeat the whole purpose of the heap? Shouldn't we be putting each value into the heap as we calculating it rather than aggregating the whole thing?

We are only aggregating TF and Date features

ah ok - can you add a comment to that effect

tovbinm · 2019-08-12T19:35:34Z

core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala

+    // Count zeros by feature name
+    val zeroCountByFeature = zeroValIndices
+      .groupBy(i => getRawFeatureName(histories(i)).get)
+      .mapValues(_.length).view.toMap


What’s the point of .view here?

to force map materialization after toMap in 2.11

tovbinm · 2019-08-20T21:07:08Z

Shall we merge this one?

Bug fixes: - Ensure correct metrics despite model failures on some CV folds [#404](#404) - Fix flaky `ModelInsight` tests [#395](#395) - Avoid creating `SparseVector`s for LOCO [#377](#377) New features / updates: - Model combiner [#385](#399) - Added new sample for HousingPrices [#365](#365) - Test to verify that custom metrics appear in model insight metrics [#387](#387) - Add `FeatureDistribution` to `SerializationFormat`s [#383](#383) - Add metadata to `OpStandadrdScaler` to allow for descaling [#378](#378) - Improve json serde error in `evalMetFromJson` [#380](#380) - Track mean & standard deviation as metrics for numeric features and for text length of text features [#354](#354) - Making model selectors robust to failing models [#372](#372) - Use compact and compressed model json by default [#375](#375) - Descale feature contribution for Linear Regression & Logistic Regression [#345](#345) Dependency updates: - Update tika version [#382](#382)

salesforce-cla · 2021-03-11T07:14:09Z

Thanks for the contribution! It looks like @mweilsalesforce is an internal user so signing the CLA is not required. However, we need to confirm this.

salesforce-cla · 2021-03-11T07:14:09Z

Thanks for the contribution! Unfortunately we can't verify the commit author(s): Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

gerashegalov requested review from leahmcguire and tovbinm as code owners July 30, 2019 17:41

salesforce-cla bot added the cla:signed label Jul 30, 2019

gerashegalov added the work in progress label Jul 30, 2019

gerashegalov force-pushed the gera/perf-regression branch from 6694415 to 9c6cb21 Compare August 3, 2019 13:50

WIP

66a5e99

gerashegalov force-pushed the gera/perf-regression branch from 9c6cb21 to 66a5e99 Compare August 5, 2019 07:38

code check fixes

77b97ee

gerashegalov changed the title ~~Use DenseVector for o(1) LOCO vector creation~~ Avoid creating SparseVectors for LOCO Aug 5, 2019

gerashegalov added ready for review and removed work in progress labels Aug 5, 2019

gerashegalov requested review from michaelweilsalesforce and sanmitra August 5, 2019 15:58

try sparse vector clone for test fix

35b8b5c

leahmcguire reviewed Aug 5, 2019

View reviewed changes

Merge branch 'master' into gera/perf-regression

cb2dc05

gerashegalov added 4 commits August 8, 2019 08:57

WIP

a494129

code check fixes

1c383e5

try sparse vector clone for test fix

a1d7d81

better names

f95f4bf

gerashegalov force-pushed the gera/perf-regression branch from cb2dc05 to f95f4bf Compare August 8, 2019 15:57

mweilsalesforce added 2 commits August 8, 2019 10:56

Skip Zeros when computing LOCOs for Text and Date fields

a8ff84f

Merge branch 'gera/perf-regression' of github.com:gerashegalov/Transm…

709594f

…ogrifAI into gera/perf-regression

mweilsalesforce and others added 2 commits August 8, 2019 11:13

Fix Scalastyle

96b2941

minor fixes

ee76053

zipped

2248d49

redundant toArray

c3f743d

leahmcguire reviewed Aug 9, 2019

View reviewed changes

Updating comments + adding extra checks

dc39cb3

leahmcguire approved these changes Aug 12, 2019

View reviewed changes

Merge branch 'master' into gera/perf-regression

25a6ac5

tovbinm reviewed Aug 12, 2019

View reviewed changes

mweilsalesforce and others added 2 commits August 12, 2019 13:11

Adding TODO

3b7b63a

Merge branch 'master' into gera/perf-regression

b271b74

Merge branch 'master' into gera/perf-regression

087b1d2

leahmcguire merged commit 16ea717 into salesforce:master Aug 21, 2019

gerashegalov mentioned this pull request Aug 27, 2019

RecordInsightsLOCO#returnTopPosNeg complexity is =0(n^2 * log n) per vector generation #376

Closed

gerashegalov mentioned this pull request Sep 8, 2019

0.6.1 release #403

Merged

salesforce-cla bot removed the cla:signed label Mar 11, 2021

salesforce-cla bot added the cla:missing label Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid creating SparseVectors for LOCO #377

Avoid creating SparseVectors for LOCO #377

gerashegalov commented Jul 30, 2019 •

edited

Loading

leahmcguire commented Jul 30, 2019

codecov bot commented Jul 30, 2019 •

edited

Loading

michaelweilsalesforce commented Jul 30, 2019

gerashegalov commented Jul 30, 2019

tovbinm commented Jul 30, 2019

gerashegalov commented Aug 5, 2019

leahmcguire Aug 5, 2019

gerashegalov Aug 6, 2019 •

edited

Loading

michaelweilsalesforce Aug 6, 2019

leahmcguire Aug 6, 2019

gerashegalov Aug 7, 2019

michaelweilsalesforce Aug 7, 2019

michaelweilsalesforce Aug 7, 2019

michaelweilsalesforce Aug 8, 2019

gerashegalov commented Aug 8, 2019 via email

michaelweilsalesforce commented Aug 8, 2019

gerashegalov commented Aug 8, 2019

leahmcguire Aug 9, 2019

leahmcguire Aug 9, 2019

michaelweilsalesforce Aug 9, 2019

michaelweilsalesforce Aug 9, 2019

leahmcguire Aug 12, 2019

michaelweilsalesforce Aug 12, 2019

leahmcguire Aug 12, 2019

michaelweilsalesforce Aug 12, 2019

leahmcguire Aug 12, 2019

leahmcguire Aug 9, 2019

michaelweilsalesforce Aug 9, 2019

leahmcguire Aug 9, 2019

tovbinm Aug 12, 2019

gerashegalov Aug 13, 2019 •

edited

Loading

tovbinm commented Aug 20, 2019 •

edited

Loading

salesforce-cla bot commented Mar 11, 2021

salesforce-cla bot commented Mar 11, 2021


		agggregateDiffs(0, Left(featureSparse), indexToExamine, minMaxHeap, aggregationMap,
		baseScore)

Avoid creating SparseVectors for LOCO #377

Avoid creating SparseVectors for LOCO #377

Conversation

gerashegalov commented Jul 30, 2019 • edited Loading

leahmcguire commented Jul 30, 2019

codecov bot commented Jul 30, 2019 • edited Loading

Codecov Report

michaelweilsalesforce commented Jul 30, 2019

gerashegalov commented Jul 30, 2019

tovbinm commented Jul 30, 2019

gerashegalov commented Aug 5, 2019

Choose a reason for hiding this comment

gerashegalov Aug 6, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerashegalov commented Aug 8, 2019 via email

michaelweilsalesforce commented Aug 8, 2019

gerashegalov commented Aug 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerashegalov Aug 13, 2019 • edited Loading

Choose a reason for hiding this comment

tovbinm commented Aug 20, 2019 • edited Loading

salesforce-cla bot commented Mar 11, 2021

salesforce-cla bot commented Mar 11, 2021

gerashegalov commented Jul 30, 2019 •

edited

Loading

codecov bot commented Jul 30, 2019 •

edited

Loading

gerashegalov Aug 6, 2019 •

edited

Loading

gerashegalov Aug 13, 2019 •

edited

Loading

tovbinm commented Aug 20, 2019 •

edited

Loading