-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] scikit-learn based meta estimators #2876
Comments
The blog is a good example that we can get good performance with this approach. (@beckernick probably has additional thoughts on challenges he saw along the way) For the SVC example, I think that starting with the SVC meta estimator seems like a good approach to me, as long as it's getting a strong speedup - I think it should be an empirical question. Agreed that it seems likely to be a minor consideration at most. My only concern is that users should be able to pass in cudf and gpuarray/cuPy-style data seamlessly. Can those arrays be passed in now or will they generate an error with the meta estimator? If they are generating an error currently, then we may need to add a wrapper to allow these datatypes to be used here too. |
The sklearn meta estimators require that our models have numpy output arrays. Many of them also need the input as numpy array. (At least those that I have tested). For SVC I needed to wrap the meta estimator into type conversion statements to ensure that the input/output array types work as expected. |
Nowadays, we're in a pretty good place compatibility wise and most of the challenges have been resolved 😄 . I agree with both of you and think that in the short term it's worth relying on input/output type configurability and paying the transfer costs. Inputs that go through the scikit-learn code path for validating data will usually (but not always) hit a call to
This is generally consistent with what I've seen for other estimators as well. For the complex models, its easily negated by the time taken by the estimator fit/predict calls. And for the simpler models, this still only adds a small amount of absolute time.
In the medium/longer term, I think we should consider building our own with concurrent streams in mind. I think perhaps the larger value-add of creating our own functions for meta-estimators and cross-validators is not necessarily from eliminating D/H transfers but from enabling overlapping various fit/predict kernels across streams to maximize utilization. Today, when we go through a scikit-learn meta-estimator or cross-validator each call to I suspect with concurrent streams these would still be blocking in the scikit-learn cross-validator/meta-estimator world, but potentially non-blocking in a future cuML version. Keeping the GPU at peak utilization would be immensely valuable. |
This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. |
What are the plans / guidelines of using scikit-learn meta estimators in combination with cuML algorithms?
Using input/output type configurability provides a great way to combine scikit-learn meta estimators with cuML algorithms. One just needs to set the input and output type to numpy, and one can already use existing algorithms from scikit-learn.
Concrete examples:
Meta estimators within cuML
Some ML algorithms requires us to use meta estimators under the hood of cuML:
Pros of using sklearn as it is:
Cons:
sklearn
in non-pytest code #2467.Questions:
Short term: is it ok to go forward with multiclass SVC by using sklearn.multiclass (numpy input), or is there a strong objection adding more direct imports from sklearn?
On the medium/long run, how do we plan to support device arrays with these meta estimators? One can think of a solution analogous to the sklearn-based preprocessing PR [REVIEW] Sklearn-based preprocessing #2645.
The text was updated successfully, but these errors were encountered: