-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support mixed model for RandomForestClassificationModel #200
Conversation
Wonder if you can avoid serializing the model JSON string by just using treelite.dump_as_json() on the driver side, after collecting the trees from the executors. Not sure if that JSON format is compatible/sufficient, but it would avoid serializing the model essentially twice (in treelite and JSON formats). |
python/tests/test_random_forest.py
Outdated
from pyspark.ml.common import _py2java | ||
|
||
example = _py2java(spark.sparkContext, Vectors.dense(1.0)) | ||
print(z.predict(example)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need assert statements to ensure the predicted results are the same as pyspark predictions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. The predictions between cuML and pyspark may be different. It's hard to compare the prediction results.
This PR only supports converting trees when impurity is gini. Signed-off-by: Bobby Wang <[email protected]>
build |
Thx @leewyang, it seems that cuML model is not compatible with treelite model. treelite_json = cu_rf.convert_to_treelite_model().dump_as_json()
AttributeError: 'cuml.fil.fil.TreeliteModel' object has no attribute 'dump_as_json' |
Wonder if this might be useful: rapidsai/cuml#3853 Otherwise, LGTM. (Also, pls remove "[DRAFT]" if ready for review). |
""" | ||
return self.cpu().predictProbability(value) | ||
|
||
def evaluate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider a gpu version of this where we have all the supporting metrics needed to construct the summary also implemented on gpu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In future work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, Let me file a task for it.
Really good, Seems we can just return the json strings of treelite in the executor side according to the example of that PR. Great. Thx, I will file a follow up for it. |
build |
build |
This PR implemented
cpu()
to convert Cuml RandomForestClassificationModel to Spark RandomForestClassificationModel, and a simple test to predict a single instance can pass. Still, this PR needs to be refactored.For simplicity, This PR just returned the model JSON string (for CPU inference) alongside the model bytes (for GPU inference), but it increases the bandwidth when collecting from executor to driver which may cause perf issue. We need to find a way to combine them.
There's some difference between Cuml RandomForestClassificationModel and Spark RandomForestClassificationModel.
Right now, this PR only supports Gini impurity, I will file a followup to support entropy and variance.