Support mixed model for RandomForestClassificationModel #200

wbo4958 · 2023-04-04T07:39:16Z

This PR implemented cpu() to convert Cuml RandomForestClassificationModel to Spark RandomForestClassificationModel, and a simple test to predict a single instance can pass. Still, this PR needs to be refactored.

For simplicity, This PR just returned the model JSON string (for CPU inference) alongside the model bytes (for GPU inference), but it increases the bandwidth when collecting from executor to driver which may cause perf issue. We need to find a way to combine them.

There's some difference between Cuml RandomForestClassificationModel and Spark RandomForestClassificationModel.

Cuml RandomForestClassificationModel doesn't include GiniStat in the InternalNode
Cuml RandomForestClassificationModel uses probability as the leaf value, while Spark RandomForestClassificationModel uses stats (how many instances each label contains), but we can convert the probs to the stats.
Cuml RandomForestClassificationModel doesn't have impurity information, but we can calculate it according to the leaf value.

Right now, this PR only supports Gini impurity, I will file a followup to support entropy and variance.

leewyang · 2023-04-04T15:34:00Z

For simplicity, This PR just returned the model JSON string (for CPU inference) alongside the model bytes (for GPU inference), but it increases the bandwidth when collecting from executor to driver which may cause perf issue. We need to find a way to combine them.

Wonder if you can avoid serializing the model JSON string by just using treelite.dump_as_json() on the driver side, after collecting the trees from the executors. Not sure if that JSON format is compatible/sufficient, but it would avoid serializing the model essentially twice (in treelite and JSON formats).

lijinf2 · 2023-04-05T16:23:51Z

python/tests/test_random_forest.py

+        from pyspark.ml.common import _py2java
+
+        example = _py2java(spark.sparkContext, Vectors.dense(1.0))
+        print(z.predict(example))


Need assert statements to ensure the predicted results are the same as pyspark predictions.

Hmm. The predictions between cuML and pyspark may be different. It's hard to compare the prediction results.

This PR only supports converting trees when impurity is gini. Signed-off-by: Bobby Wang <[email protected]>

wbo4958 · 2023-04-11T07:57:07Z

build

wbo4958 · 2023-04-11T08:00:27Z

For simplicity, This PR just returned the model JSON string (for CPU inference) alongside the model bytes (for GPU inference), but it increases the bandwidth when collecting from executor to driver which may cause perf issue. We need to find a way to combine them.

Wonder if you can avoid serializing the model JSON string by just using treelite.dump_as_json() on the driver side, after collecting the trees from the executors. Not sure if that JSON format is compatible/sufficient, but it would avoid serializing the model essentially twice (in treelite and JSON formats).

Thx @leewyang, it seems that cuML model is not compatible with treelite model.

treelite_json = cu_rf.convert_to_treelite_model().dump_as_json()
AttributeError: 'cuml.fil.fil.TreeliteModel' object has no attribute 'dump_as_json'

leewyang · 2023-04-11T15:40:54Z

Wonder if this might be useful: rapidsai/cuml#3853
Sounds like it'd require an extra serialize/deserialize to disk to get it to the actual treelite model class, but presumably this would only need to be done on the driver.

Otherwise, LGTM. (Also, pls remove "[DRAFT]" if ready for review).

eordentlich · 2023-04-11T22:24:13Z

python/src/spark_rapids_ml/classification.py

+        """
+        return self.cpu().predictProbability(value)
+
+    def evaluate(


We should consider a gpu version of this where we have all the supporting metrics needed to construct the summary also implemented on gpu.

In future work.

Good suggestion, Let me file a task for it.

wbo4958 · 2023-04-11T23:25:47Z

Wonder if this might be useful: rapidsai/cuml#3853 Sounds like it'd require an extra serialize/deserialize to disk to get it to the actual treelite model class, but presumably this would only need to be done on the driver.

Otherwise, LGTM. (Also, pls remove "[DRAFT]" if ready for review).

Really good, Seems we can just return the json strings of treelite in the executor side according to the example of that PR. Great. Thx, I will file a follow up for it.

wbo4958 · 2023-04-12T00:44:30Z

build

wbo4958 · 2023-04-12T01:12:02Z

build

lijinf2 reviewed Apr 5, 2023

View reviewed changes

Support mixed model for RandomForestClassificationModel

ed65bb1

This PR only supports converting trees when impurity is gini. Signed-off-by: Bobby Wang <[email protected]>

wbo4958 force-pushed the rf-mixed-model branch from 19ff49e to ed65bb1 Compare April 11, 2023 07:56

wbo4958 marked this pull request as ready for review April 11, 2023 07:58

wbo4958 requested review from leewyang and eordentlich April 11, 2023 08:03

eordentlich reviewed Apr 11, 2023

View reviewed changes

wbo4958 changed the title ~~[Draft] Support mixed model for RandomForestClassificationModel~~ Support mixed model for RandomForestClassificationModel Apr 11, 2023

leewyang previously approved these changes Apr 11, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/branch-23.04' into rf-mixed-model

1a3519d

wbo4958 dismissed leewyang’s stale review via 1a3519d April 12, 2023 00:43

fix bug

99a5957

wbo4958 requested review from eordentlich, leewyang and lijinf2 April 12, 2023 02:35

wbo4958 mentioned this pull request Apr 12, 2023

RFclassifier: Support entropy impurity when convering cuml trees to Spark #222

Merged

leewyang approved these changes Apr 12, 2023

View reviewed changes

wbo4958 merged commit 34ce8bc into NVIDIA:branch-23.04 Apr 12, 2023

wbo4958 deleted the rf-mixed-model branch April 12, 2023 23:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support mixed model for RandomForestClassificationModel #200

Support mixed model for RandomForestClassificationModel #200

wbo4958 commented Apr 4, 2023 •

edited

Loading

leewyang commented Apr 4, 2023

lijinf2 Apr 5, 2023 •

edited

Loading

wbo4958 Apr 11, 2023

wbo4958 commented Apr 11, 2023

wbo4958 commented Apr 11, 2023

leewyang commented Apr 11, 2023 •

edited

Loading

eordentlich Apr 11, 2023

eordentlich Apr 11, 2023

wbo4958 Apr 11, 2023

wbo4958 commented Apr 11, 2023

wbo4958 commented Apr 12, 2023

wbo4958 commented Apr 12, 2023

Support mixed model for RandomForestClassificationModel #200

Support mixed model for RandomForestClassificationModel #200

Conversation

wbo4958 commented Apr 4, 2023 • edited Loading

leewyang commented Apr 4, 2023

lijinf2 Apr 5, 2023 • edited Loading

Choose a reason for hiding this comment

wbo4958 Apr 11, 2023

Choose a reason for hiding this comment

wbo4958 commented Apr 11, 2023

wbo4958 commented Apr 11, 2023

leewyang commented Apr 11, 2023 • edited Loading

eordentlich Apr 11, 2023

Choose a reason for hiding this comment

eordentlich Apr 11, 2023

Choose a reason for hiding this comment

wbo4958 Apr 11, 2023

Choose a reason for hiding this comment

wbo4958 commented Apr 11, 2023

wbo4958 commented Apr 12, 2023

wbo4958 commented Apr 12, 2023

wbo4958 commented Apr 4, 2023 •

edited

Loading

lijinf2 Apr 5, 2023 •

edited

Loading

leewyang commented Apr 11, 2023 •

edited

Loading