-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backend agnostic machine learning models #962
Conversation
# sklearn.model_selection: Model Selection | ||
"GroupKFold": "sklearn.model_selection.GroupKFold", | ||
"GroupShuffleSplit": "sklearn.model_selection.GroupShuffleSplit", | ||
"KFold": "sklearn.model_selection.KFold", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's probably a lot of mappings here that aren't even compatible with model_class
or experiment_class
.
I pretty much just listed every class available with sklearn and cuml and allow it to be mapped no matter what, since these are just string mappings and it has to error eventually anyways if it's not compatible with CREATE MODEL
/CREATE EXPERIMENT
. But let me know if there are chunks of mappings that I should just remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would be to generate this mapping dynamically for a few reasons:
- as scikit-learn updates, it's very likely that many of these classes will get shuffled around, which will end up breaking CI until a PR is made to manually update things
- though this mapping currently works on the scikit-learn version getting pulled in our python>=3.9 testing, we don't actually know how many versions this mapping holds up for between 1.0.0 and 1.2.1
It looks like for scikit-learn, we should be able to do something like
from sklearn.utils import all_estimators
estimators = {k: v.__module__ + "." + v.__qualname__ for k,v in all_estimators()}
To grab its estimators dynamically; I'm not sure if the other packages have something similar to this, but would be interested in if it's possible to generalize the code used for all_estimators
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know about this, cool! LightGBM and XGBoost don't seem to have a similar function, but they have a much smaller number of estimators anyway where I wouldn't expect them to be prone to changing.
I couldn't find the function in cuML, but I opened rapidsai/cuml#5162 as a possibility for the future. In the meantime, I kept the gpu_classes
dict as is.
tests/integration/test_model.py
Outdated
c.sql( | ||
""" | ||
CREATE OR REPLACE MODEL my_model WITH ( | ||
model_class = 'SGDClassifier', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGDClassifier
is not supported by cuml.
tests/integration/test_model.py
Outdated
c.sql( | ||
""" | ||
CREATE MODEL IF NOT EXISTS my_model_lightgbm WITH ( | ||
model_class = 'LGBMClassifier', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LightGBM does not seem to support dask_cudf
DataFrames. I actually get the same error TypeError: Implicit conversion to a NumPy array is not allowed
as from #943
Codecov Report
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more @@ Coverage Diff @@
## main #962 +/- ##
==========================================
+ Coverage 78.50% 81.21% +2.70%
==========================================
Files 76 77 +1
Lines 4322 4365 +43
Branches 788 792 +4
==========================================
+ Hits 3393 3545 +152
+ Misses 758 644 -114
- Partials 171 176 +5
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Co-authored-by: Charles Blackmon-Luca <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the gpuCI failures here are due to is_cudf_type
not behaving as expected; I think the modifications you made to the utility in #984 should be sufficient to resolve here
tests/integration/test_model.py
Outdated
# TODO: In this query, we are using cuml.dask.linear_model.LinearRegression | ||
# instead of cuml.linear_model.LinearRegression. | ||
# Is there any way to assert that we are using the cuML Dask estimator | ||
# (and not just the cuML estimator)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a concern for just this specific model class, or throughout the test suite? In either case, as long as wrap_predict=False
then something like
assert "dask" in str(c.schema["root"].models["my_model"][0].__class__).lower()
Should work here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks! The reason I wanted to have a check is because originally this code was part of test_dask_cuml_training_and_prediction
which was specifically meant to test cuml.dask.linear_model.LinearRegression
functionality.
With this PR, LinearRegression
and some others map to the cuml.dask
version automatically. I figured it was nicer to map to the Dask version of a model if it's available - what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured it was nicer to map to the Dask version of a model if it's available
That makes sense to me, this shouldn't be too limiting since users can always just specify the non-Dask version explicitly if they need to use it for some reason.
IMO unless there's some implicit fallback behavior in the cuml.dask
classes when certain conditions aren't met such that they default to their non-Dask variants, I would find the checks a little superfluous, as the mapping to Dask classes and their import is relatively straightforward and I wouldn't expect it to change unless we specifically modified the mapping; if there's a case where this could change unexpectedly though would be happy to know more about it 🙂 and would recommend adding this check conditionally to check_trained_model
so that it can easily be extended to other cases where we might want to assert the Dask class is being used.
tests/integration/test_model.py
Outdated
# TODO: In this query, we are using cuml.dask.linear_model.LinearRegression | ||
# instead of cuml.linear_model.LinearRegression. | ||
# Is there any way to assert that we are using the cuML Dask estimator | ||
# (and not just the cuML estimator)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured it was nicer to map to the Dask version of a model if it's available
That makes sense to me, this shouldn't be too limiting since users can always just specify the non-Dask version explicitly if they need to use it for some reason.
IMO unless there's some implicit fallback behavior in the cuml.dask
classes when certain conditions aren't met such that they default to their non-Dask variants, I would find the checks a little superfluous, as the mapping to Dask classes and their import is relatively straightforward and I wouldn't expect it to change unless we specifically modified the mapping; if there's a case where this could change unexpectedly though would be happy to know more about it 🙂 and would recommend adding this check conditionally to check_trained_model
so that it can easily be extended to other cases where we might want to assert the Dask class is being used.
# ImportError: dask-glm >= 0.2.1.dev was not found, please install it to use multi-GPU logistic regression. | ||
# "LogisticRegression": "cuml.dask.extended.linear_model.logistic_regression.LogisticRegression", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to throw a TODO or FIXME in here to track this failure? Not sure if there's an upstream issue open around this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I opened #1015 and linked to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is something we should fix on the Dask-SQL side (see #1015 (comment)). Would it make sense to have some sort of try/except logic around this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sarahyurick 🙂 with that additional context, IMO I feel like the best option on our end for now would be to include cuml.dask.extended.linear_model.logistic_regression.LogisticRegression
in the mapping and raise the ImportError
directly, with the idea being that users would just subsequently install dask-glm
into their environment and try rerunning their query.
I suppose if we wanted to be as informative as possible, we could reraise the error with some addendum about using the non-Dask equivalent class if installing dask-glm
isn't an option, though I think that message would probably make more sense to add upstream since it would be informative to cuML users at large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes generally lgtm. Just a few clarifying questions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @sarahyurick. As a followup to this we should update some of the docs to incorporate some of the features added in this pr. https://dask-sql.readthedocs.io/en/latest/machine_learning.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @sarahyurick 😄
Closes #370
Most of the additions come from listing out all class mappings in
ml_classes.py
and adding various tests intest_model.py
. Obviously, it would be too much to test all mappings, so I just included the classes already tested elsewhere.