Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend agnostic machine learning models #962

Merged
merged 43 commits into from
Jan 31, 2023

Conversation

sarahyurick
Copy link
Collaborator

Closes #370

Most of the additions come from listing out all class mappings in ml_classes.py and adding various tests in test_model.py. Obviously, it would be too much to test all mappings, so I just included the classes already tested elsewhere.

# sklearn.model_selection: Model Selection
"GroupKFold": "sklearn.model_selection.GroupKFold",
"GroupShuffleSplit": "sklearn.model_selection.GroupShuffleSplit",
"KFold": "sklearn.model_selection.KFold",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably a lot of mappings here that aren't even compatible with model_class or experiment_class.

I pretty much just listed every class available with sklearn and cuml and allow it to be mapped no matter what, since these are just string mappings and it has to error eventually anyways if it's not compatible with CREATE MODEL/CREATE EXPERIMENT. But let me know if there are chunks of mappings that I should just remove.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to generate this mapping dynamically for a few reasons:

  • as scikit-learn updates, it's very likely that many of these classes will get shuffled around, which will end up breaking CI until a PR is made to manually update things
  • though this mapping currently works on the scikit-learn version getting pulled in our python>=3.9 testing, we don't actually know how many versions this mapping holds up for between 1.0.0 and 1.2.1

It looks like for scikit-learn, we should be able to do something like

from sklearn.utils import all_estimators

estimators = {k: v.__module__ + "." + v.__qualname__ for k,v in all_estimators()}

To grab its estimators dynamically; I'm not sure if the other packages have something similar to this, but would be interested in if it's possible to generalize the code used for all_estimators.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about this, cool! LightGBM and XGBoost don't seem to have a similar function, but they have a much smaller number of estimators anyway where I wouldn't expect them to be prone to changing.

I couldn't find the function in cuML, but I opened rapidsai/cuml#5162 as a possibility for the future. In the meantime, I kept the gpu_classes dict as is.

c.sql(
"""
CREATE OR REPLACE MODEL my_model WITH (
model_class = 'SGDClassifier',
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGDClassifier is not supported by cuml.

c.sql(
"""
CREATE MODEL IF NOT EXISTS my_model_lightgbm WITH (
model_class = 'LGBMClassifier',
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LightGBM does not seem to support dask_cudf DataFrames. I actually get the same error TypeError: Implicit conversion to a NumPy array is not allowed as from #943

@codecov-commenter
Copy link

codecov-commenter commented Dec 8, 2022

Codecov Report

Merging #962 (98c42d5) into main (cb2ae32) will increase coverage by 2.70%.
The diff coverage is 85.10%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main     #962      +/-   ##
==========================================
+ Coverage   78.50%   81.21%   +2.70%     
==========================================
  Files          76       77       +1     
  Lines        4322     4365      +43     
  Branches      788      792       +4     
==========================================
+ Hits         3393     3545     +152     
+ Misses        758      644     -114     
- Partials      171      176       +5     
Impacted Files Coverage Δ
dask_sql/physical/rel/custom/create_experiment.py 87.35% <66.66%> (-2.52%) ⬇️
dask_sql/physical/rel/custom/create_model.py 88.60% <75.00%> (+0.93%) ⬆️
dask_sql/physical/utils/ml_classes.py 92.85% <92.85%> (ø)
dask_sql/utils.py 100.00% <100.00%> (ø)
dask_sql/_version.py 35.31% <0.00%> (+1.41%) ⬆️
dask_sql/physical/rel/custom/predict_model.py 91.89% <0.00%> (+16.21%) ⬆️
dask_sql/input_utils/hive.py 100.00% <0.00%> (+81.74%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Co-authored-by: Charles Blackmon-Luca <[email protected]>
Copy link
Collaborator

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the gpuCI failures here are due to is_cudf_type not behaving as expected; I think the modifications you made to the utility in #984 should be sufficient to resolve here

tests/integration/test_model.py Outdated Show resolved Hide resolved
Comment on lines 79 to 82
# TODO: In this query, we are using cuml.dask.linear_model.LinearRegression
# instead of cuml.linear_model.LinearRegression.
# Is there any way to assert that we are using the cuML Dask estimator
# (and not just the cuML estimator)?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a concern for just this specific model class, or throughout the test suite? In either case, as long as wrap_predict=False then something like

assert "dask" in str(c.schema["root"].models["my_model"][0].__class__).lower()

Should work here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks! The reason I wanted to have a check is because originally this code was part of test_dask_cuml_training_and_prediction which was specifically meant to test cuml.dask.linear_model.LinearRegression functionality.

With this PR, LinearRegression and some others map to the cuml.dask version automatically. I figured it was nicer to map to the Dask version of a model if it's available - what do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured it was nicer to map to the Dask version of a model if it's available

That makes sense to me, this shouldn't be too limiting since users can always just specify the non-Dask version explicitly if they need to use it for some reason.

IMO unless there's some implicit fallback behavior in the cuml.dask classes when certain conditions aren't met such that they default to their non-Dask variants, I would find the checks a little superfluous, as the mapping to Dask classes and their import is relatively straightforward and I wouldn't expect it to change unless we specifically modified the mapping; if there's a case where this could change unexpectedly though would be happy to know more about it 🙂 and would recommend adding this check conditionally to check_trained_model so that it can easily be extended to other cases where we might want to assert the Dask class is being used.

dask_sql/physical/utils/ml_classes.py Outdated Show resolved Hide resolved
Comment on lines 79 to 82
# TODO: In this query, we are using cuml.dask.linear_model.LinearRegression
# instead of cuml.linear_model.LinearRegression.
# Is there any way to assert that we are using the cuML Dask estimator
# (and not just the cuML estimator)?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured it was nicer to map to the Dask version of a model if it's available

That makes sense to me, this shouldn't be too limiting since users can always just specify the non-Dask version explicitly if they need to use it for some reason.

IMO unless there's some implicit fallback behavior in the cuml.dask classes when certain conditions aren't met such that they default to their non-Dask variants, I would find the checks a little superfluous, as the mapping to Dask classes and their import is relatively straightforward and I wouldn't expect it to change unless we specifically modified the mapping; if there's a case where this could change unexpectedly though would be happy to know more about it 🙂 and would recommend adding this check conditionally to check_trained_model so that it can easily be extended to other cases where we might want to assert the Dask class is being used.

Comment on lines 25 to 26
# ImportError: dask-glm >= 0.2.1.dev was not found, please install it to use multi-GPU logistic regression.
# "LogisticRegression": "cuml.dask.extended.linear_model.logistic_regression.LogisticRegression",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to throw a TODO or FIXME in here to track this failure? Not sure if there's an upstream issue open around this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I opened #1015 and linked to it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is something we should fix on the Dask-SQL side (see #1015 (comment)). Would it make sense to have some sort of try/except logic around this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sarahyurick 🙂 with that additional context, IMO I feel like the best option on our end for now would be to include cuml.dask.extended.linear_model.logistic_regression.LogisticRegression in the mapping and raise the ImportError directly, with the idea being that users would just subsequently install dask-glm into their environment and try rerunning their query.

I suppose if we wanted to be as informative as possible, we could reraise the error with some addendum about using the non-Dask equivalent class if installing dask-glm isn't an option, though I think that message would probably make more sense to add upstream since it would be informative to cuML users at large.

Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes generally lgtm. Just a few clarifying questions

tests/integration/test_model.py Show resolved Hide resolved
tests/integration/test_model.py Show resolved Hide resolved
Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @sarahyurick. As a followup to this we should update some of the docs to incorporate some of the features added in this pr. https://dask-sql.readthedocs.io/en/latest/machine_learning.html

Copy link
Collaborator

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @sarahyurick 😄

@charlesbluca charlesbluca merged commit 5d8ef43 into dask-contrib:main Jan 31, 2023
@sarahyurick sarahyurick deleted the agnostic_models branch May 26, 2023 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ENH] Backend agnostic machine learning models
4 participants