Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement binning of X as an optional preprocessing step for trees #25

Open
adam2392 opened this issue Feb 6, 2023 · 12 comments
Open
Labels
Cython Related to Cython code help wanted Extra attention is needed Python Related to Python code

Comments

@adam2392
Copy link
Collaborator

adam2392 commented Feb 6, 2023

First, let us add the binning capabilities into all forest methods:

  • extratrees
  • randomforest

And as a result, BaseDecisionTree and DecisionTreeClassifier will have the necessary sklearn code, so we can use binning as well in Oblique trees (including morf trees), and unsupervised trees.

Reference discussion on preliminary work done: neurodata/scikit-learn#23

@adam2392
Copy link
Collaborator Author

adam2392 commented Feb 6, 2023

Downstream it will be useful to verify that:

  1. all unit tests pass
  2. compare w/ and w/o binning in terms of accuracy/roc_score/precision vs fit-time and predict-times. I think this can be multiple plots with x-axis being the runtime (fit or predict) and the y-axis being the scoring metric. We should replicate this for each type of trees on a variety of different n_samples and n_features.

This could become a solid computing conference paper, or low-hanging fruit software paper regarding improvement of decision trees.

@adam2392
Copy link
Collaborator Author

adam2392 commented Feb 6, 2023

Relevant paper: https://arxiv.org/pdf/1609.06119.pdf

@adam2392
Copy link
Collaborator Author

adam2392 commented Feb 6, 2023

See related discussion in scikit-learn: scikit-learn/scikit-learn#5212

@adam2392 adam2392 added Python Related to Python code Cython Related to Cython code labels Feb 7, 2023
@jshinm
Copy link
Contributor

jshinm commented Mar 11, 2023

The lab has received a request to build this binning tree in order to analyze thousands of medical data to test numerous hypothesis. @jshinm will start working on this

@jshinm jshinm self-assigned this Mar 11, 2023
@adam2392 adam2392 added the help wanted Extra attention is needed label Mar 14, 2023
@jshinm jshinm removed their assignment Mar 16, 2023
@adam2392
Copy link
Collaborator Author

This has been completed naively in upstream scikit-learn-tree (i.e. the fork of scikit-learn repo)

@jovo
Copy link
Member

jovo commented Sep 26, 2023

@adam2392 what is the deal with binning in scikit tree? can we bin? if not, can we please at least add an issue to bin?

@adam2392
Copy link
Collaborator Author

We can bin naively already for all forests. That is, we bin at the Python API level. As to whether or not this improves matters is another experimental issue. We need someone to run a side-by-side experiment w/ and w/o binning for increasing feature dimension and sample size.

To fully enable this feature, we would need to figure out a design to add this into Cython code cleanly. That is, someone must implement the logic to check n_bins splits, rather than naively checking all n_samples splits. A simple glance at the code as of right now, I don't see a super simple way of adding this.

I am happy to code review for someone who wants to add this though and do the experiments.

@adam2392 adam2392 reopened this Sep 26, 2023
@jovo
Copy link
Member

jovo commented Sep 26, 2023

I don't understand. Are you saying one could naively implement bin per node in Python, and that code would be easy to write? Can you show us the code for how to do that? At first, we could simply test for speed if that works.

@adam2392
Copy link
Collaborator Author

adam2392 commented Sep 26, 2023

I don't understand. Are you saying one could naively implement bin per node in Python, and that code would be easy to write? Can you show us the code for how to do that? At first, we could simply test for speed if that works.

Yes that is correct. I implemented this in scikit-learn fork a few months ago: https://github.com/neurodata/scikit-learn/blob/679c9a2ef7a560424781f2a210889e21c7125734/sklearn/ensemble/_forest.py#L533-L563

This is available in all scikit-tree forests as a result.

My point re cython code is that this hack in Python is not enough potentially; it might not even help because the Cython code is still naively sorting/searching all n_samples features, rather than n_bins features. We can get even more by going into Cython.

@jovo
Copy link
Member

jovo commented Sep 26, 2023

@adam2392 So you hypothesize, without strong empirical results, that this is too slow, for some definition of too? Also, has scikit-learn has finalized their inclusion of missing data and categorical support for decision trees?

@sampan501 You could test this hypothesis on your data and see?

@adam2392
Copy link
Collaborator Author

adam2392 commented Sep 26, 2023

@adam2392 So you hypothesize, without strong empirical results, that this is too slow, for some definition of too?

Yes because the Cython code remains the same, so it would be very surprising if there is a significant improvement. I could see the sorting being faster, or the for-loop within Cython being potentially faster due to more similar feature values (because they've been binned), but the expensive for-loop still remains the same.

Just to be clear, I am stating that I am 100% sure we will go from O(n log(n)) to O(n) if we modify the Cython code. But there is no reason to think that we can achieve that speedup with the code I added in Python alone.

Also, has scikit-learn has finalized their inclusion of missing data and categorical support for decision trees?

Re missing data, no, but it seems like it is in-progress. Categorical support I believe is not even being considered at this point unfortunately :/.

@jovo
Copy link
Member

jovo commented Sep 27, 2023

@sampan501 might be worth trying on the linear dataset for the power curves, and also might not be. i'd just try it on the 1024 sample size, just the RF part (not the power), and see if it is any faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cython Related to Cython code help wanted Extra attention is needed Python Related to Python code
Projects
None yet
Development

No branches or pull requests

3 participants