Backend agnostic machine learning models #962

sarahyurick · 2022-12-07T23:12:59Z

Closes #370

Most of the additions come from listing out all class mappings in ml_classes.py and adding various tests in test_model.py. Obviously, it would be too much to test all mappings, so I just included the classes already tested elsewhere.

sarahyurick · 2022-12-07T23:18:15Z

dask_sql/physical/rel/custom/ml_classes.py

+        # sklearn.model_selection: Model Selection
+        "GroupKFold": "sklearn.model_selection.GroupKFold",
+        "GroupShuffleSplit": "sklearn.model_selection.GroupShuffleSplit",
+        "KFold": "sklearn.model_selection.KFold",


There's probably a lot of mappings here that aren't even compatible with model_class or experiment_class.

I pretty much just listed every class available with sklearn and cuml and allow it to be mapped no matter what, since these are just string mappings and it has to error eventually anyways if it's not compatible with CREATE MODEL/CREATE EXPERIMENT. But let me know if there are chunks of mappings that I should just remove.

My preference would be to generate this mapping dynamically for a few reasons:

as scikit-learn updates, it's very likely that many of these classes will get shuffled around, which will end up breaking CI until a PR is made to manually update things

though this mapping currently works on the scikit-learn version getting pulled in our python>=3.9 testing, we don't actually know how many versions this mapping holds up for between 1.0.0 and 1.2.1

It looks like for scikit-learn, we should be able to do something like

from sklearn.utils import all_estimators estimators = {k: v.__module__ + "." + v.__qualname__ for k,v in all_estimators()}

To grab its estimators dynamically; I'm not sure if the other packages have something similar to this, but would be interested in if it's possible to generalize the code used for all_estimators.

I didn't know about this, cool! LightGBM and XGBoost don't seem to have a similar function, but they have a much smaller number of estimators anyway where I wouldn't expect them to be prone to changing.

I couldn't find the function in cuML, but I opened rapidsai/cuml#5162 as a possibility for the future. In the meantime, I kept the gpu_classes dict as is.

sarahyurick · 2022-12-07T23:26:33Z

tests/integration/test_model.py

+    c.sql(
+        """
+        CREATE OR REPLACE MODEL my_model WITH (
+            model_class = 'SGDClassifier',


SGDClassifier is not supported by cuml.

sarahyurick · 2022-12-07T23:28:35Z

tests/integration/test_model.py

+    c.sql(
+        """
+        CREATE MODEL IF NOT EXISTS my_model_lightgbm WITH (
+            model_class = 'LGBMClassifier',


LightGBM does not seem to support dask_cudf DataFrames. I actually get the same error TypeError: Implicit conversion to a NumPy array is not allowed as from #943

codecov-commenter · 2022-12-08T00:46:50Z

Codecov Report

Merging #962 (98c42d5) into main (cb2ae32) will increase coverage by 2.70%.
The diff coverage is 85.10%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main     #962      +/-   ##
==========================================
+ Coverage   78.50%   81.21%   +2.70%     
==========================================
  Files          76       77       +1     
  Lines        4322     4365      +43     
  Branches      788      792       +4     
==========================================
+ Hits         3393     3545     +152     
+ Misses        758      644     -114     
- Partials      171      176       +5

Impacted Files	Coverage Δ
dask_sql/physical/rel/custom/create_experiment.py	`87.35% <66.66%> (-2.52%)`	⬇️
dask_sql/physical/rel/custom/create_model.py	`88.60% <75.00%> (+0.93%)`	⬆️
dask_sql/physical/utils/ml_classes.py	`92.85% <92.85%> (ø)`
dask_sql/utils.py	`100.00% <100.00%> (ø)`
dask_sql/_version.py	`35.31% <0.00%> (+1.41%)`	⬆️
dask_sql/physical/rel/custom/predict_model.py	`91.89% <0.00%> (+16.21%)`	⬆️
dask_sql/input_utils/hive.py	`100.00% <0.00%> (+81.74%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

dask_sql/physical/rel/custom/create_experiment.py

Co-authored-by: Charles Blackmon-Luca <[email protected]>

tests/integration/test_model.py

dask_sql/physical/rel/custom/create_model.py

dask_sql/physical/rel/custom/create_experiment.py

dask_sql/physical/rel/custom/create_model.py

Co-authored-by: Charles Blackmon-Luca <[email protected]>

charlesbluca

Looks like the gpuCI failures here are due to is_cudf_type not behaving as expected; I think the modifications you made to the utility in #984 should be sufficient to resolve here

tests/integration/test_model.py

charlesbluca · 2023-01-27T20:49:02Z

tests/integration/test_model.py

+    # TODO: In this query, we are using cuml.dask.linear_model.LinearRegression
+    # instead of cuml.linear_model.LinearRegression.
+    # Is there any way to assert that we are using the cuML Dask estimator
+    # (and not just the cuML estimator)?


Is this a concern for just this specific model class, or throughout the test suite? In either case, as long as wrap_predict=False then something like

assert "dask" in str(c.schema["root"].models["my_model"][0].__class__).lower()

Should work here

Cool, thanks! The reason I wanted to have a check is because originally this code was part of test_dask_cuml_training_and_prediction which was specifically meant to test cuml.dask.linear_model.LinearRegression functionality.

With this PR, LinearRegression and some others map to the cuml.dask version automatically. I figured it was nicer to map to the Dask version of a model if it's available - what do you think?

I figured it was nicer to map to the Dask version of a model if it's available

That makes sense to me, this shouldn't be too limiting since users can always just specify the non-Dask version explicitly if they need to use it for some reason.

IMO unless there's some implicit fallback behavior in the cuml.dask classes when certain conditions aren't met such that they default to their non-Dask variants, I would find the checks a little superfluous, as the mapping to Dask classes and their import is relatively straightforward and I wouldn't expect it to change unless we specifically modified the mapping; if there's a case where this could change unexpectedly though would be happy to know more about it 🙂 and would recommend adding this check conditionally to check_trained_model so that it can easily be extended to other cases where we might want to assert the Dask class is being used.

dask_sql/physical/utils/ml_classes.py

charlesbluca · 2023-01-30T16:24:05Z

tests/integration/test_model.py

+    # TODO: In this query, we are using cuml.dask.linear_model.LinearRegression
+    # instead of cuml.linear_model.LinearRegression.
+    # Is there any way to assert that we are using the cuML Dask estimator
+    # (and not just the cuML estimator)?


I figured it was nicer to map to the Dask version of a model if it's available

That makes sense to me, this shouldn't be too limiting since users can always just specify the non-Dask version explicitly if they need to use it for some reason.

IMO unless there's some implicit fallback behavior in the cuml.dask classes when certain conditions aren't met such that they default to their non-Dask variants, I would find the checks a little superfluous, as the mapping to Dask classes and their import is relatively straightforward and I wouldn't expect it to change unless we specifically modified the mapping; if there's a case where this could change unexpectedly though would be happy to know more about it 🙂 and would recommend adding this check conditionally to check_trained_model so that it can easily be extended to other cases where we might want to assert the Dask class is being used.

charlesbluca · 2023-01-30T17:27:44Z

dask_sql/physical/utils/ml_classes.py

+        # ImportError: dask-glm >= 0.2.1.dev was not found, please install it to use multi-GPU logistic regression.
+        # "LogisticRegression": "cuml.dask.extended.linear_model.logistic_regression.LogisticRegression",


Would it make sense to throw a TODO or FIXME in here to track this failure? Not sure if there's an upstream issue open around this

Sure, I opened #1015 and linked to it.

Not sure if this is something we should fix on the Dask-SQL side (see #1015 (comment)). Would it make sense to have some sort of try/except logic around this?

Thanks @sarahyurick 🙂 with that additional context, IMO I feel like the best option on our end for now would be to include cuml.dask.extended.linear_model.logistic_regression.LogisticRegression in the mapping and raise the ImportError directly, with the idea being that users would just subsequently install dask-glm into their environment and try rerunning their query.

I suppose if we wanted to be as informative as possible, we could reraise the error with some addendum about using the non-Dask equivalent class if installing dask-glm isn't an option, though I think that message would probably make more sense to add upstream since it would be informative to cuML users at large.

ayushdg

Changes generally lgtm. Just a few clarifying questions

tests/integration/test_model.py

ayushdg

Thanks a lot @sarahyurick. As a followup to this we should update some of the docs to incorporate some of the features added in this pr. https://dask-sql.readthedocs.io/en/latest/machine_learning.html

charlesbluca

Thanks again @sarahyurick 😄

cpu/gpu_classes and tests

ff2c3e5

sarahyurick requested review from ayushdg, charlesbluca and galipremsagar as code owners December 7, 2022 23:13

sarahyurick commented Dec 7, 2022

View reviewed changes

sarahyurick added 3 commits December 7, 2022 15:30

style fix

b685108

edit tests

069caa8

split up tests

f2c5d87

remove failing gpu xgb tests

4eedef7

charlesbluca reviewed Dec 8, 2022

View reviewed changes

dask_sql/physical/rel/custom/create_experiment.py Outdated Show resolved Hide resolved

dask_sql/physical/rel/custom/create_experiment.py Outdated Show resolved Hide resolved

sarahyurick and others added 9 commits December 8, 2022 14:51

Apply suggestions from code review

3f64c01

Co-authored-by: Charles Blackmon-Luca <[email protected]>

edit tests

1077aa6

style fix

e5a6477

minor style fix

549afef

ignore flake8 import errors

72c37ff

maybe?

a300b9d

fixture stuff??

7704ce2

remove fixture stuff lol

ab7cc08

skip python 3.8

8269e56

sarahyurick commented Dec 9, 2022

View reviewed changes

tests/integration/test_model.py Outdated Show resolved Hide resolved

sarahyurick added 5 commits December 15, 2022 13:39

Merge branch 'main' into agnostic_models

bbf4dc6

reorder logic

e43710d

Merge branch 'main' into agnostic_models

9f49f58

update cuml paths

331cee0

Merge branch 'main' into agnostic_models

090d5a9

sarahyurick commented Jan 18, 2023

View reviewed changes

tests/integration/test_model.py Outdated Show resolved Hide resolved

sarahyurick commented Jan 18, 2023

View reviewed changes

tests/integration/test_model.py Outdated Show resolved Hide resolved

sarahyurick commented Jan 26, 2023

View reviewed changes

dask_sql/physical/rel/custom/create_model.py Outdated Show resolved Hide resolved

sarahyurick commented Jan 26, 2023

View reviewed changes

dask_sql/physical/rel/custom/create_experiment.py Outdated Show resolved Hide resolved

sarahyurick commented Jan 26, 2023

View reviewed changes

dask_sql/physical/rel/custom/create_experiment.py Outdated Show resolved Hide resolved

sarahyurick commented Jan 26, 2023

View reviewed changes

dask_sql/physical/rel/custom/create_model.py Outdated Show resolved Hide resolved

Apply suggestions from code review

ad8bf0e

Co-authored-by: Charles Blackmon-Luca <[email protected]>

charlesbluca reviewed Jan 26, 2023

View reviewed changes

tests/integration/test_model.py Outdated Show resolved Hide resolved

sarahyurick and others added 3 commits January 26, 2023 11:37

gpu_timeseries fixture

e1ca596

modify check_trained_models

f61131e

Refactor gpu_client fixture, consolidate model tests

9425286

charlesbluca reviewed Jan 27, 2023

View reviewed changes

sarahyurick and others added 4 commits January 27, 2023 14:05

Merge branch 'main' into agnostic_models

4a30c3c

add dask_cudf=None

23022a0

fix test_predict_with_limit_offset

c96d4e8

update xgboost test

bfefe83

charlesbluca reviewed Jan 30, 2023

View reviewed changes

add_boosting_classes

84cec59

charlesbluca reviewed Jan 30, 2023

View reviewed changes

Merge branch 'main' into agnostic_models

0721c21

sarahyurick mentioned this pull request Jan 30, 2023

[BUG] Import error for cuml.dask.extended.linear_model.logistic_regression.LogisticRegression #1015

Closed

sarahyurick and others added 4 commits January 30, 2023 09:59

link to issue

c293562

Merge branch 'main' into agnostic_models

93ff0a1

logistic regression error

4717bde

fix gpu test

98c42d5

ayushdg reviewed Jan 31, 2023

View reviewed changes

tests/integration/test_model.py Show resolved Hide resolved

tests/integration/test_model.py Show resolved Hide resolved

ayushdg approved these changes Jan 31, 2023

View reviewed changes

sarahyurick mentioned this pull request Jan 31, 2023

[DOC] Update ML docs to reflect recent changes #1022

Closed

charlesbluca approved these changes Jan 31, 2023

View reviewed changes

charlesbluca merged commit 5d8ef43 into dask-contrib:main Jan 31, 2023

sarahyurick mentioned this pull request Feb 13, 2023

Update machine learning documentation #1043

Merged

sarahyurick deleted the agnostic_models branch May 26, 2023 22:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend agnostic machine learning models #962

Backend agnostic machine learning models #962

sarahyurick commented Dec 7, 2022

sarahyurick Dec 7, 2022

charlesbluca Jan 25, 2023

sarahyurick Jan 25, 2023

sarahyurick Dec 7, 2022

sarahyurick Dec 7, 2022

codecov-commenter commented Dec 8, 2022 •

edited

Loading

charlesbluca left a comment

charlesbluca Jan 27, 2023

sarahyurick Jan 27, 2023

charlesbluca Jan 30, 2023

charlesbluca Jan 30, 2023

charlesbluca Jan 30, 2023

sarahyurick Jan 30, 2023

sarahyurick Jan 30, 2023

charlesbluca Jan 31, 2023

ayushdg left a comment

ayushdg left a comment

charlesbluca left a comment

		# ImportError: dask-glm >= 0.2.1.dev was not found, please install it to use multi-GPU logistic regression.
		# "LogisticRegression": "cuml.dask.extended.linear_model.logistic_regression.LogisticRegression",

Backend agnostic machine learning models #962

Backend agnostic machine learning models #962

Conversation

sarahyurick commented Dec 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Dec 8, 2022 • edited Loading

Codecov Report

charlesbluca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ayushdg left a comment

Choose a reason for hiding this comment

ayushdg left a comment

Choose a reason for hiding this comment

charlesbluca left a comment

Choose a reason for hiding this comment

codecov-commenter commented Dec 8, 2022 •

edited

Loading