Implement binning of `X` as an optional preprocessing step for trees #25

adam2392 · 2023-02-06T02:17:56Z

First, let us add the binning capabilities into all forest methods:

extratrees
randomforest

And as a result, BaseDecisionTree and DecisionTreeClassifier will have the necessary sklearn code, so we can use binning as well in Oblique trees (including morf trees), and unsupervised trees.

Reference discussion on preliminary work done: neurodata/scikit-learn#23

The text was updated successfully, but these errors were encountered:

adam2392 · 2023-02-06T02:28:56Z

Downstream it will be useful to verify that:

all unit tests pass
compare w/ and w/o binning in terms of accuracy/roc_score/precision vs fit-time and predict-times. I think this can be multiple plots with x-axis being the runtime (fit or predict) and the y-axis being the scoring metric. We should replicate this for each type of trees on a variety of different n_samples and n_features.

This could become a solid computing conference paper, or low-hanging fruit software paper regarding improvement of decision trees.

adam2392 · 2023-02-06T02:32:13Z

Relevant paper: https://arxiv.org/pdf/1609.06119.pdf

adam2392 · 2023-02-06T02:32:18Z

See related discussion in scikit-learn: scikit-learn/scikit-learn#5212

jshinm · 2023-03-11T00:35:02Z

The lab has received a request to build this binning tree in order to analyze thousands of medical data to test numerous hypothesis. ~~@jshinm will start working on this~~

adam2392 · 2023-04-26T01:04:12Z

This has been completed naively in upstream scikit-learn-tree (i.e. the fork of scikit-learn repo)

jovo · 2023-09-26T15:02:06Z

@adam2392 what is the deal with binning in scikit tree? can we bin? if not, can we please at least add an issue to bin?

adam2392 · 2023-09-26T15:22:42Z

We can bin naively already for all forests. That is, we bin at the Python API level. As to whether or not this improves matters is another experimental issue. We need someone to run a side-by-side experiment w/ and w/o binning for increasing feature dimension and sample size.

To fully enable this feature, we would need to figure out a design to add this into Cython code cleanly. That is, someone must implement the logic to check n_bins splits, rather than naively checking all n_samples splits. A simple glance at the code as of right now, I don't see a super simple way of adding this.

I am happy to code review for someone who wants to add this though and do the experiments.

jovo · 2023-09-26T15:45:38Z

I don't understand. Are you saying one could naively implement bin per node in Python, and that code would be easy to write? Can you show us the code for how to do that? At first, we could simply test for speed if that works.

adam2392 · 2023-09-26T15:49:01Z

I don't understand. Are you saying one could naively implement bin per node in Python, and that code would be easy to write? Can you show us the code for how to do that? At first, we could simply test for speed if that works.

Yes that is correct. I implemented this in scikit-learn fork a few months ago: https://github.com/neurodata/scikit-learn/blob/679c9a2ef7a560424781f2a210889e21c7125734/sklearn/ensemble/_forest.py#L533-L563

This is available in all scikit-tree forests as a result.

My point re cython code is that this hack in Python is not enough potentially; it might not even help because the Cython code is still naively sorting/searching all n_samples features, rather than n_bins features. We can get even more by going into Cython.

jovo · 2023-09-26T15:54:03Z

@adam2392 So you hypothesize, without strong empirical results, that this is too slow, for some definition of too? Also, has scikit-learn has finalized their inclusion of missing data and categorical support for decision trees?

@sampan501 You could test this hypothesis on your data and see?

adam2392 · 2023-09-26T15:59:21Z

@adam2392 So you hypothesize, without strong empirical results, that this is too slow, for some definition of too?

Yes because the Cython code remains the same, so it would be very surprising if there is a significant improvement. I could see the sorting being faster, or the for-loop within Cython being potentially faster due to more similar feature values (because they've been binned), but the expensive for-loop still remains the same.

Just to be clear, I am stating that I am 100% sure we will go from O(n log(n)) to O(n) if we modify the Cython code. But there is no reason to think that we can achieve that speedup with the code I added in Python alone.

Also, has scikit-learn has finalized their inclusion of missing data and categorical support for decision trees?

Re missing data, no, but it seems like it is in-progress. Categorical support I believe is not even being considered at this point unfortunately :/.

jovo · 2023-09-27T02:13:04Z

@sampan501 might be worth trying on the linear dataset for the power curves, and also might not be. i'd just try it on the 1024 sample size, just the RF part (not the power), and see if it is any faster

adam2392 added Python Related to Python code Cython Related to Cython code labels Feb 7, 2023

adam2392 mentioned this issue Mar 6, 2023

[ENH] Add "extratrees" (i.e. randomized trees) for oblique trees #46

Closed

jshinm self-assigned this Mar 11, 2023

adam2392 added the help wanted Extra attention is needed label Mar 14, 2023

jshinm removed their assignment Mar 16, 2023

adam2392 closed this as completed Apr 26, 2023

adam2392 reopened this Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement binning of `X` as an optional preprocessing step for trees #25

Implement binning of `X` as an optional preprocessing step for trees #25

adam2392 commented Feb 6, 2023

adam2392 commented Feb 6, 2023

adam2392 commented Feb 6, 2023

adam2392 commented Feb 6, 2023

jshinm commented Mar 11, 2023 •

edited

Loading

adam2392 commented Apr 26, 2023

jovo commented Sep 26, 2023

adam2392 commented Sep 26, 2023

jovo commented Sep 26, 2023

adam2392 commented Sep 26, 2023 •

edited

Loading

jovo commented Sep 26, 2023

adam2392 commented Sep 26, 2023 •

edited

Loading

jovo commented Sep 27, 2023

Implement binning of X as an optional preprocessing step for trees #25

Implement binning of X as an optional preprocessing step for trees #25

Comments

adam2392 commented Feb 6, 2023

adam2392 commented Feb 6, 2023

adam2392 commented Feb 6, 2023

adam2392 commented Feb 6, 2023

jshinm commented Mar 11, 2023 • edited Loading

adam2392 commented Apr 26, 2023

jovo commented Sep 26, 2023

adam2392 commented Sep 26, 2023

jovo commented Sep 26, 2023

adam2392 commented Sep 26, 2023 • edited Loading

jovo commented Sep 26, 2023

adam2392 commented Sep 26, 2023 • edited Loading

jovo commented Sep 27, 2023

Implement binning of `X` as an optional preprocessing step for trees #25

Implement binning of `X` as an optional preprocessing step for trees #25

jshinm commented Mar 11, 2023 •

edited

Loading

adam2392 commented Sep 26, 2023 •

edited

Loading

adam2392 commented Sep 26, 2023 •

edited

Loading