-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement binning of X
as an optional preprocessing step for trees
#25
Comments
Downstream it will be useful to verify that:
This could become a solid computing conference paper, or low-hanging fruit software paper regarding improvement of decision trees. |
Relevant paper: https://arxiv.org/pdf/1609.06119.pdf |
See related discussion in scikit-learn: scikit-learn/scikit-learn#5212 |
The lab has received a request to build this binning tree in order to analyze thousands of medical data to test numerous hypothesis. |
This has been completed naively in upstream |
@adam2392 what is the deal with binning in scikit tree? can we bin? if not, can we please at least add an issue to bin? |
We can bin naively already for all forests. That is, we bin at the Python API level. As to whether or not this improves matters is another experimental issue. We need someone to run a side-by-side experiment w/ and w/o binning for increasing feature dimension and sample size. To fully enable this feature, we would need to figure out a design to add this into Cython code cleanly. That is, someone must implement the logic to check I am happy to code review for someone who wants to add this though and do the experiments. |
I don't understand. Are you saying one could naively implement bin per node in Python, and that code would be easy to write? Can you show us the code for how to do that? At first, we could simply test for speed if that works. |
Yes that is correct. I implemented this in scikit-learn fork a few months ago: https://github.com/neurodata/scikit-learn/blob/679c9a2ef7a560424781f2a210889e21c7125734/sklearn/ensemble/_forest.py#L533-L563 This is available in all scikit-tree forests as a result. My point re cython code is that this hack in Python is not enough potentially; it might not even help because the Cython code is still naively sorting/searching all |
@adam2392 So you hypothesize, without strong empirical results, that this is too slow, for some definition of too? Also, has scikit-learn has finalized their inclusion of missing data and categorical support for decision trees? @sampan501 You could test this hypothesis on your data and see? |
Yes because the Cython code remains the same, so it would be very surprising if there is a significant improvement. I could see the sorting being faster, or the for-loop within Cython being potentially faster due to more similar feature values (because they've been binned), but the expensive for-loop still remains the same. Just to be clear, I am stating that I am 100% sure we will go from O(n log(n)) to O(n) if we modify the Cython code. But there is no reason to think that we can achieve that speedup with the code I added in Python alone.
Re missing data, no, but it seems like it is in-progress. Categorical support I believe is not even being considered at this point unfortunately :/. |
@sampan501 might be worth trying on the linear dataset for the power curves, and also might not be. i'd just try it on the 1024 sample size, just the RF part (not the power), and see if it is any faster |
First, let us add the binning capabilities into all forest methods:
And as a result,
BaseDecisionTree
andDecisionTreeClassifier
will have the necessary sklearn code, so we can use binning as well in Oblique trees (including morf trees), and unsupervised trees.Reference discussion on preliminary work done: neurodata/scikit-learn#23
The text was updated successfully, but these errors were encountered: