-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove duplicate features using sanity checker feature to feature correlations #476
Conversation
Codecov Report
@@ Coverage Diff @@
## master #476 +/- ##
==========================================
+ Coverage 87.00% 87.03% +0.03%
==========================================
Files 345 345
Lines 11643 11671 +28
Branches 376 614 +238
==========================================
+ Hits 10130 10158 +28
Misses 1513 1513
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice addition. Left several minor comments.
@@ -640,11 +671,13 @@ class SanityCheckerTest extends OpEstimatorSpec[OPVector, BinaryModel[RealNN, OP | |||
|
|||
val metadata: Metadata = getMetadata(outputColName, transformedData) | |||
val summary = SanityCheckerSummary.fromMetadata(metadata.getSummaryMetadata()) | |||
|
|||
println(summary.correlations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
@@ -98,6 +99,7 @@ class BadFeatureZooTest extends FlatSpec with TestSparkContext with Logging { | |||
|
|||
// Check that all the PickList features are dropped and they are the only ones dropped | |||
// (9 choices + "other" + empty = 11 total dropped columns) | |||
println(retrieved.dropped) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
core/src/main/scala/com/salesforce/op/stages/impl/preparators/DerivedFeatureFilterUtils.scala
Show resolved
Hide resolved
@@ -361,6 +373,10 @@ private[op] case class ColumnStatistics | |||
corrLabel.filter(Math.abs(_) > maxCorrelation).map(corr => | |||
s"correlation $corr higher than max correlation $maxCorrelation" | |||
), | |||
column.flatMap{ case cl => featureCorrs.take(cl.index).find(Math.abs(_) > maxFeatureCorr).map(corr => | |||
s"this feature has correlations $corr with other features higher than max feature-feature" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semantics but would: "with another feature higher than" be better?
core/src/main/scala/com/salesforce/op/stages/impl/preparators/SanityChecker.scala
Show resolved
Hide resolved
) | ||
def setMaxCorrelation(value: Double): this.type = set(maxCorrelation, value) | ||
def getMaxCorrelation: Double = $(maxCorrelation) | ||
|
||
final val maxFeatureCorr = new DoubleParam( | ||
parent = this, name = "maxFeatureCorr", | ||
doc = "Maximum correlation (absolute value) allowed between a feature two feature vectors which will" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "a feature two feature vectors"
corrType: CorrelationType | ||
) extends MetadataLike { | ||
require(featuresIn.length == values.length, "Feature names and correlation values arrays must have the same length") | ||
require(featuresIn.length == valuesWithLabel.length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be defensive, maybe add a require
for a square correlation matrix? (or later in the code, as it may not be calculated)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correlation matrix may be empty so not sure it makes sense
corrType = CorrelationType.withNameInsensitive(wrapped.get[String](SanityCheckerNames.CorrelationType)) | ||
) | ||
val features = wrapped.getArray[String](SanityCheckerNames.FeaturesIn).toSeq | ||
if (wrapped.underlyingMap.keySet.contains("values")) { // old sanity checker meta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice backwards compatibility hack! maybe use "correlationsWithLabel" instead of "values", as its speaks more to the type of change that occurred?
LGTM |
@@ -94,15 +94,24 @@ trait SanityCheckerParams extends DerivedFeatureFilterParams { | |||
final val maxCorrelation = new DoubleParam( | |||
parent = this, name = "maxCorrelation", | |||
doc = "Maximum correlation (absolute value) allowed between a feature in the feature vector and the label", | |||
isValid = ParamValidators.inRange(lowerBound = 0.0, upperBound = 1.0, lowerInclusive = true, upperInclusive = true) | |||
isValid = ParamValidators.inRange(lowerBound = -0.1, upperBound = 1.1, lowerInclusive = true, upperInclusive = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did these ranges need to be changed? We shouldn't get correlations outside of [0, 1], right?
final val minCorrelation = new DoubleParam( | ||
parent = this, name = "minCorrelation", | ||
doc = "Minimum correlation (absolute value) allowed between a feature in the feature vector and the label", | ||
isValid = ParamValidators.inRange(lowerBound = 0.0, upperBound = 1.0, lowerInclusive = true, upperInclusive = true) | ||
isValid = ParamValidators.inRange(lowerBound = -0.1, upperBound = 1.1, lowerInclusive = true, upperInclusive = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - why are these changed?
stats.filter(s => s.corrLabel.isDefined && !s.corrLabel.get.isNaN).map(s => s.name -> s.corrLabel.get), | ||
stats.filter(s => s.corrLabel.isDefined && s.corrLabel.get.isNaN).map(_.name), | ||
correlations = new Correlations( | ||
stats.filter(s => s.corrLabel.isDefined).map(s => (s.name, s.corrLabel.get, s.featureCorrs)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not have to check for NaNs anymore? What if the feature is constant and we turn off the min variance threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I fixed it - in that when it goes to metadata it turns it into strings and then when it gets out back into the class it is turned into doubles. Now nans work! The json serializer would not have helped because it was a problem with the spark metadata.
* @param values correlation of feature with label | ||
* @param nanCorrs nan correlation features | ||
* @param valuesWithLabel correlation of feature with label | ||
* @param valuesWithFeatures correlations between features features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"correlations between features features"
@@ -88,6 +88,7 @@ class BadFeatureZooTest extends FlatSpec with TestSparkContext with Logging { | |||
val genFeatureVector = Seq(rawCity, rawCountry, rawPickList, rawCurrency).transmogrify() | |||
val checkedFeatures = new SanityChecker() | |||
.setCheckSample(1.0) | |||
.setMaxFeatureCorr(1.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see - you're using values > 1.0 to turn it off. I think it may be clearer if we keep the bounds in [0, 1.0], but make it a < check instead of a <= check (so that setting it to 1.0 would turn it off).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I did that! Turns out that with doubles you can get a correlation > 1.0 with perfectly correlated things :-P
Eg in a couple of test 1 had it set to 1.0 with corr > and it failed because the correlation was 1.00000000004
) | ||
def setMaxCorrelation(value: Double): this.type = set(maxCorrelation, value) | ||
def getMaxCorrelation: Double = $(maxCorrelation) | ||
|
||
final val maxFeatureCorr = new DoubleParam( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better spell out Correlation in a public parameter
Thanks for the contribution! Unfortunately we can't verify the commit author(s): leahmcguire <l***@s***.com> Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request. |
Related issues
Refer to issue(s) addressed in this pull request from Issues page.
Describe the proposed solution
When features are exact duplicates of each other we would like to remove one before modeling
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context about the changes here.