Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBoost classification & regression models + Spark 2.3.2 #44

Merged
merged 78 commits into from
Oct 15, 2018
Merged

Conversation

tovbinm
Copy link
Collaborator

@tovbinm tovbinm commented Aug 8, 2018

Describe the proposed solution
Adding XGBoost classification & regression models. This should eventually allow us to train models with better or on par performance than Random Forest. But more importantly use wider sparse feature vectors.

Describe alternatives you've considered
Fix Spark Random Forest implementation.

Additional context
This change adds xgboost4j-spark dependency and also upgrades to Spark 2.3.2.

TODO

  • compare model quality and runtime performance against RF models
  • if performs well, add into our model selectors

/**
* Copied from [[ml.dmlc.xgboost4j.scala.spark.XGBoost.removeMissingValues]] private method
*/
def removeMissingValues(xgbLabelPoints: Iterator[LabeledPoint], missing: Float): Iterator[LabeledPoint] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refection trick doesn't work for object?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method is not present in the snapshot version I use. We need to switch to latest 0.8 version once they release it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done! it's present in 0.80 release.

@@ -63,7 +63,7 @@ class ModelInsightsTest extends FlatSpec with PassengerSparkFixtureTest {
implicit val doubleOptEquality = new Equality[Option[Double]] {
def areEqual(a: Option[Double], b: Any): Boolean = b match {
case None => a.isEmpty
case s: Option[Double] => (a.exists(_.isNaN) && s.exists(_.isNaN)) ||
case s: Option[Double]@unchecked => (a.exists(_.isNaN) && s.exists(_.isNaN)) ||
(a.nonEmpty && a.toSeq.zip(s.toSeq).forall{ case (n, m) => n == m })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a.exists(_.isNaN) && s.exists(_.isNaN)) || (a == s)

@tovbinm tovbinm changed the title XGBoost classification & regression models + Spark 2.3.1 XGBoost classification & regression models + Spark 2.3.2 Sep 28, 2018
@codecov
Copy link

codecov bot commented Sep 28, 2018

Codecov Report

Merging #44 into master will decrease coverage by 0.67%.
The diff coverage is 47.39%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #44      +/-   ##
==========================================
- Coverage    86.4%   85.72%   -0.68%     
==========================================
  Files         299      302       +3     
  Lines        9750     9881     +131     
  Branches      354      540     +186     
==========================================
+ Hits         8424     8470      +46     
- Misses       1326     1411      +85
Impacted Files Coverage Δ
...ges/impl/classification/OpLogisticRegression.scala 57.14% <ø> (ø) ⬆️
...m/salesforce/op/aggregators/ExtendedMultiset.scala 75% <0%> (-25%) ⬇️
...ce/op/stages/impl/classification/OpLinearSVC.scala 77.27% <100%> (ø) ⬆️
...ssification/OpMultilayerPerceptronClassifier.scala 69.23% <100%> (+5.59%) ⬆️
.../scala/com/salesforce/op/features/types/Maps.scala 92.68% <100%> (+0.27%) ⬆️
...lesforce/op/utils/reflection/ReflectionUtils.scala 97.36% <100%> (+0.14%) ⬆️
...om/salesforce/op/utils/spark/OpSparkListener.scala 97.4% <100%> (-1.3%) ⬇️
...s/sparkwrappers/specific/SparkModelConverter.scala 94.11% <100%> (+0.78%) ⬆️
...a/com/salesforce/op/filters/RawFeatureFilter.scala 88.99% <100%> (+0.2%) ⬆️
...op/evaluators/OpMultiClassificationEvaluator.scala 94.73% <100%> (+0.07%) ⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update abdfff0...5a98621. Read the comment docs.

@tovbinm
Copy link
Collaborator Author

tovbinm commented Oct 11, 2018

@leahmcguire if there are no objections - let's get this merged.

CheckIsResponseValues(in1, in2)
}

def setWeightCol(value: String): this.type = set(weightCol, value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comments for these params

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done as well

CheckIsResponseValues(in1, in2)
}

def setWeightCol(value: String): this.type = set(weightCol, value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please put comments on these settings

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -199,6 +199,7 @@ object RegressionModelsToTry extends Enum[RegressionModelsToTry] {
case object OpRandomForestRegressor extends RegressionModelsToTry
case object OpGBTRegressor extends RegressionModelsToTry
case object OpGeneralizedLinearRegression extends RegressionModelsToTry
case object OpXGBoostRegressor extends RegressionModelsToTry
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also define some default grid settings for this to run in regression

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for the other model selectors

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to make additions of xgb to model selectors a separate pr.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok if that is the plan then lots hold off on adding it to the enum as well - particularly since you only added it to regression and not multiclass and binary

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. removed.

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tovbinm tovbinm merged commit b1aec92 into master Oct 15, 2018
@tovbinm tovbinm deleted the mt/xgboost branch October 15, 2018 17:08
@salesforce-cla
Copy link

Thanks for the contribution! It looks like @Jauntbox is an internal user so signing the CLA is not required. However, we need to confirm this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants