-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XGBoost classification & regression models + Spark 2.3.2 #44
Conversation
/** | ||
* Copied from [[ml.dmlc.xgboost4j.scala.spark.XGBoost.removeMissingValues]] private method | ||
*/ | ||
def removeMissingValues(xgbLabelPoints: Iterator[LabeledPoint], missing: Float): Iterator[LabeledPoint] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refection trick doesn't work for object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method is not present in the snapshot version I use. We need to switch to latest 0.8 version once they release it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done! it's present in 0.80 release.
@@ -63,7 +63,7 @@ class ModelInsightsTest extends FlatSpec with PassengerSparkFixtureTest { | |||
implicit val doubleOptEquality = new Equality[Option[Double]] { | |||
def areEqual(a: Option[Double], b: Any): Boolean = b match { | |||
case None => a.isEmpty | |||
case s: Option[Double] => (a.exists(_.isNaN) && s.exists(_.isNaN)) || | |||
case s: Option[Double]@unchecked => (a.exists(_.isNaN) && s.exists(_.isNaN)) || | |||
(a.nonEmpty && a.toSeq.zip(s.toSeq).forall{ case (n, m) => n == m }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(a.exists(_.isNaN) && s.exists(_.isNaN)) || (a == s)
Codecov Report
@@ Coverage Diff @@
## master #44 +/- ##
==========================================
- Coverage 86.4% 85.72% -0.68%
==========================================
Files 299 302 +3
Lines 9750 9881 +131
Branches 354 540 +186
==========================================
+ Hits 8424 8470 +46
- Misses 1326 1411 +85
Continue to review full report at Codecov.
|
@leahmcguire if there are no objections - let's get this merged. |
CheckIsResponseValues(in1, in2) | ||
} | ||
|
||
def setWeightCol(value: String): this.type = set(weightCol, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add comments for these params
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done as well
CheckIsResponseValues(in1, in2) | ||
} | ||
|
||
def setWeightCol(value: String): this.type = set(weightCol, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please put comments on these settings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -199,6 +199,7 @@ object RegressionModelsToTry extends Enum[RegressionModelsToTry] { | |||
case object OpRandomForestRegressor extends RegressionModelsToTry | |||
case object OpGBTRegressor extends RegressionModelsToTry | |||
case object OpGeneralizedLinearRegression extends RegressionModelsToTry | |||
case object OpXGBoostRegressor extends RegressionModelsToTry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please also define some default grid settings for this to run in regression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also for the other model selectors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to make additions of xgb to model selectors a separate pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok if that is the plan then lots hold off on adding it to the enum as well - particularly since you only added it to regression and not multiclass and binary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good. removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for the contribution! It looks like @Jauntbox is an internal user so signing the CLA is not required. However, we need to confirm this. |
Describe the proposed solution
Adding XGBoost classification & regression models. This should eventually allow us to train models with better or on par performance than Random Forest. But more importantly use wider sparse feature vectors.
Describe alternatives you've considered
Fix Spark Random Forest implementation.
Additional context
This change adds xgboost4j-spark dependency and also upgrades to Spark 2.3.2.
TODO