[BUG] .predict_proba on fitted Pipeline object with a ColumnTransformer step raises exception #4368

EFulmer · 2021-11-16T23:02:28Z

Describe the bug
When using the ColumnTransformer from cuml.experimental.preprocessing in an already fit Pipeline, the methods predict/predict_proba raise exceptions stating that X has a mismatched number of features, even though the data is of the same shape as the DataFrame passed to fit.

Steps/Code to reproduce bug

Here's a minimal example, preserving the types and shape of my real data and the structure of the pipeline (same encoders, imputers, and classifier used):

import cudf
from cuml.ensemble import RandomForestClassifier
from cuml.experimental.preprocessing import ColumnTransformer
from cuml.preprocessing import OneHotEncoder, SimpleImputer, StandardScaler, TargetEncoder
from cuml.pipeline import Pipeline
import cupy


id_vars = ["id"]
id_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="mean", missing_values=cupy.NaN))
    ]
)

categorical_vars = ["cat"]
categorical_transformer = Pipeline(
    [
        ("ordinal", OneHotEncoder(sparse=False)),
        ("imputer", SimpleImputer(strategy="most_frequent")),
    ]
)

numeric_vars = ["num"]
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="mean", missing_values=cupy.NaN)),
        ("scaler", StandardScaler()),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("id", id_transformer, id_vars),
        ("categorical", categorical_transformer, categorical_vars),
        ("numeric", numeric_transformer, numeric_vars),
    ],
    remainder="drop",
)
model = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", RandomForestClassifier(n_bins=2)),  # n_bins set to 2 because of smaller "toy" input set
    ]
)

df_train = cudf.DataFrame(
    [{"id": 1, "cat": "a", "num": 1.0, "target": 0, "extra": 5},
     {"id": 2, "cat": "a", "num": 2.0, "target": 1, "extra": -1},
     {"id": 3, "cat": "b", "num": 3.0, "target": 1, "extra": 100}]
)

X_train = df_train.drop("target", axis=1)
y_train = df_train["target"]

df_test = cudf.DataFrame(
    [{"id": 4, "cat": "b", "num": 2.0, "target": 1, "extra": 17}]
)

X_test = df_test.drop("target", axis=1)
y_test = df_test["target"]

model.fit(X_train, y_train)
print(model.predict_proba(X_test))

The stack trace output is immediately below. Using sklearn equivalents, predict_proba executes without exception.

/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/internals/api_decorators.py:567: UserWarning: To use pickling or GPU-based prediction first train using float32 data to fit the estimator
  ret_val = func(*args, **kwargs)
Traceback (most recent call last):
  File "min_failing_example.py", line 57, in <module>
    model.predict_proba(X)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/sklearn/utils/metaestimators.py", line 113, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)  # noqa
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/sklearn/pipeline.py", line 535, in predict_proba
    Xt = transform.transform(Xt)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 586, in inner_get
    ret_val = func(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/_thirdparty/sklearn/preprocessing/_column_transformer.py", line 934, in transform
    Xs = self._fit_transform(X, None, _transform_one, fitted=True)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/_thirdparty/sklearn/preprocessing/_column_transformer.py", line 806, in _fit_transform
    return Parallel(n_jobs=self.n_jobs)(
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/joblib/parallel.py", line 1044, in __call__
    while self.dispatch_one_batch(iterator):
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/joblib/parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/_thirdparty/sklearn/preprocessing/_column_transformer.py", line 361, in __call__
    return self.function(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/_thirdparty/sklearn/preprocessing/_column_transformer.py", line 287, in _transform_one
    res = transformer.transform(X).to_output('cupy')
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/sklearn/utils/metaestimators.py", line 113, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)  # noqa
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/sklearn/pipeline.py", line 647, in transform
    Xt = transform.transform(Xt)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 586, in inner_get
    ret_val = func(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/_thirdparty/sklearn/preprocessing/_imputation.py", line 415, in transform
    X = self._validate_input(X, in_fit=False)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/_thirdparty/sklearn/preprocessing/_imputation.py", line 280, in _validate_input
    raise ve
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/_thirdparty/sklearn/preprocessing/_imputation.py", line 270, in _validate_input
    X = self._validate_data(X, reset=in_fit,
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/_thirdparty/sklearn/utils/skl_dependencies.py", line 127, in _validate_data
    self._check_n_features(X, reset=reset)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/_thirdparty/sklearn/utils/skl_dependencies.py", line 68, in _check_n_features
    raise ValueError(
ValueError: X has 2 features, but this BaseMetaClass is expecting 1 features as input.

Expected behavior
Calling predict_proba on a fitted pipeline should return an array of predictions.

If there is a larger error with the input to predict_proba or predict, a more descriptive error would be very much appreciated as well.

Environment details (please complete the following information):

Environment location: Google Cloud Platform
Linux Distro/Architecture: Ubuntu 18.04.6 LTS
GPU Model/Driver: Tesla T4 and driver 455.32.00
CUDA: 11.1
Method of cuDF & cuML install: Docker; image rapidsai/rapidsai-core:21.08-cuda11.0-base-ubuntu18.04-py3.8

The text was updated successfully, but these errors were encountered:

beckernick · 2021-12-03T18:01:12Z

Thanks for including a reproducible example.

From an initial triage, there are at least a couple of issues intermingling here, which may partially cause this behavior.

It looks like a cuML Pipeline with a cuML OneHotEncoder fails:

import pandas as pd
from sklearn.pipeline import Pipeline as sk_Pipeline
from sklearn.preprocessing import OneHotEncoder as sk_OneHotEncoder

from cuml.experimental.preprocessing import ColumnTransformer as cu_ColumnTransformer
from cuml.preprocessing import OneHotEncoder as cu_OneHotEncoder
from cuml.pipeline import Pipeline as cu_Pipeline


X_train = pd.DataFrame(
    [{"id": 1, "cat": "a", "num": 1.0, "extra": 5},
     {"id": 2, "cat": "a", "num": 2.0, "extra": -1},
     {"id": 3, "cat": "b", "num": 3.0, "extra": 100}]
)


# skl Pipeline, skl OHE
categorical_vars = ["cat"]
categorical_transformer = sk_Pipeline(
    [
        ("ordinal", sk_OneHotEncoder(sparse=False)),
    ]
)
print(categorical_transformer.fit(X_train[categorical_vars]))

# cuml Pipeline, skl OHE
categorical_transformer = cu_Pipeline(
    [
        ("ordinal", sk_OneHotEncoder(sparse=False)),
    ]
)
print(categorical_transformer.fit(X_train[categorical_vars]))

# cuml Pipeline, cuml OHE
categorical_transformer = cu_Pipeline(
    [
        ("ordinal", cu_OneHotEncoder(sparse=False)),
    ]
)

print(categorical_transformer.fit(X_train[categorical_vars]))

Pipeline(steps=[('ordinal', OneHotEncoder(sparse=False))])
Pipeline(steps=[('ordinal', OneHotEncoder(sparse=False))])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_65047/1591776796.py in <module>
     39 )
     40 
---> 41 print(categorical_transformer.fit(X_train[categorical_vars]))

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    392             if self._final_estimator != "passthrough":
    393                 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 394                 self._final_estimator.fit(Xt, y, **fit_params_last_step)
    395 
    396         return self

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/cuml/internals/api_decorators.py in inner(*args, **kwargs)
    358         def inner(*args, **kwargs):
    359             with self._recreate_cm(func, args):
--> 360                 return func(*args, **kwargs)
    361 
    362         return inner

TypeError: fit() takes 2 positional arguments but 3 were given

It also looks like a cuML ColumnTransformer can fail when a Pipeline has multiple steps:

import pandas as pd
from sklearn.pipeline import Pipeline as sk_Pipeline
from sklearn.impute import SimpleImputer as sk_SimpleImputer
from sklearn.preprocessing import StandardScaler as sk_StandardScaler
from sklearn.compose import ColumnTransformer as sk_ColumnTransformer

from cuml.experimental.preprocessing import ColumnTransformer as cu_ColumnTransformer
from cuml.preprocessing import SimpleImputer as cu_SimpleImputer
from cuml.preprocessing import StandardScaler as cu_StandardScaler
from cuml.pipeline import Pipeline as cu_Pipeline


X_train = pd.DataFrame(
    [{"id": 1, "cat": "a", "num": 1.0, "extra": 5},
     {"id": 2, "cat": "a", "num": 2.0, "extra": -1},
     {"id": 3, "cat": "b", "num": 3.0, "extra": 100}]
)

## all cuml except ColumnTransformer
numeric_vars = ["num"]
numeric_transformer = cu_Pipeline(
    steps=[
        ("imputer", cu_SimpleImputer(strategy="mean")),
        ("scaler", cu_StandardScaler()),
    ]
)

preprocessor = sk_ColumnTransformer(
    transformers=[
        ("numeric", numeric_transformer, numeric_vars),
    ],
)

preprocessor.fit_transform(X_train) # works
preprocessor.fit(X_train); preprocessor.transform(X_train) # works
array([[-1.22474487],
       [ 0.        ],
       [ 1.22474487]])

## cuml ColumnTransformer with single step pipeline
numeric_vars = ["num"]
numeric_transformer = cu_Pipeline(
    steps=[
        # ("imputer", cu_SimpleImputer(strategy="mean")),
        ("scaler2", cu_StandardScaler()),
    ]
)

preprocessor = cu_ColumnTransformer(
    transformers=[
        ("numeric", numeric_transformer, numeric_vars),
    ],
)

preprocessor.fit_transform(X_train) # works
preprocessor.fit(X_train); preprocessor.transform(X_train) # works
array([[-1.22474487],
       [ 0.        ],
       [ 1.22474487]])

## cuml ColumnTransformer with two step pipeline (also happens with SimpleImputer)
numeric_vars = ["num"]
numeric_transformer = cu_Pipeline(
    steps=[
        ("scaler1", cu_StandardScaler()),
        ("scaler2", cu_StandardScaler()),
    ]
)

preprocessor = cu_ColumnTransformer(
    transformers=[
        ("numeric", numeric_transformer, numeric_vars),
    ],
)

preprocessor.fit_transform(X_train) # works
preprocessor.fit(X_train); preprocessor.transform(X_train) # works
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_65651/4058867824.py in <module>
     15 
     16 preprocessor.fit_transform(X_train) # works
---> 17 preprocessor.fit(X_train); preprocessor.transform(X_train) # works

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/cuml/internals/api_decorators.py in inner_get(*args, **kwargs)
    584 
    585                 # Call the function
--> 586                 ret_val = func(*args, **kwargs)
    587 
    588             return cm.process_return(ret_val)

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/cuml/_thirdparty/sklearn/preprocessing/_column_transformer.py in transform(self, X)
    932                 "data given during fit."
    933             )
--> 934         Xs = self._fit_transform(X, None, _transform_one, fitted=True)
    935         self._validate_output(Xs)
    936 

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/cuml/_thirdparty/sklearn/preprocessing/_column_transformer.py in _fit_transform(self, X, y, func, fitted)
    804             self._iter(fitted=fitted, replace_strings=True))
    805         try:
--> 806             return Parallel(n_jobs=self.n_jobs)(
    807                 delayed(func)(
    808                     transformer=clone(trans) if not fitted else trans,

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
   1041             # remaining jobs.
   1042             self._iterating = False
-> 1043             if self.dispatch_one_batch(iterator):
   1044                 self._iterating = self._original_iterator is not None
   1045 

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    859                 return False
    860             else:
--> 861                 self._dispatch(tasks)
    862                 return True
    863 

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/joblib/parallel.py in _dispatch(self, batch)
    777         with self._lock:
    778             job_idx = len(self._jobs)
--> 779             job = self._backend.apply_async(batch, callback=cb)
    780             # A job can complete so quickly than its callback is
    781             # called before we get here, causing self._jobs to

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    206     def apply_async(self, func, callback=None):
    207         """Schedule a func to be run"""
--> 208         result = ImmediateResult(func)
    209         if callback:
    210             callback(result)

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    570         # Don't delay the application, to avoid keeping the input
    571         # arguments in memory
--> 572         self.results = batch()
    573 
    574     def get(self):

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/joblib/parallel.py in __call__(self)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/joblib/parallel.py in <listcomp>(.0)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/cuml/_thirdparty/sklearn/preprocessing/_column_transformer.py in __call__(self, *args, **kwargs)
    359     def __call__(self, *args, **kwargs):
    360         _global_settings_data.shared_state = self.config
--> 361         return self.function(*args, **kwargs)
    362 
    363 

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/cuml/_thirdparty/sklearn/preprocessing/_column_transformer.py in _transform_one(transformer, X, y, weight, **fit_params)
    285 
    286 def _transform_one(transformer, X, y, weight, **fit_params):
--> 287     res = transformer.transform(X).to_output('cupy')
    288     # if we have a weight for this transformer, multiply output
    289     if weight is None:

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    111 
    112             # lambda, but not partial, allows help() to work with update_wrapper
--> 113             out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)  # noqa
    114         else:
    115 

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/sklearn/pipeline.py in transform(self, X)
    645         Xt = X
    646         for _, _, transform in self._iter():
--> 647             Xt = transform.transform(Xt)
    648         return Xt
    649 

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/cuml/internals/api_decorators.py in inner_get(*args, **kwargs)
    584 
    585                 # Call the function
--> 586                 ret_val = func(*args, **kwargs)
    587 
    588             return cm.process_return(ret_val)

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/cuml/_thirdparty/sklearn/preprocessing/_data.py in transform(self, X, copy)
    793         copy = copy if copy is not None else self.copy
    794 
--> 795         X = self._validate_data(X, reset=False,
    796                                 accept_sparse=['csr', 'csc'], copy=copy,
    797                                 estimator=self, dtype=FLOAT_DTYPES,

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/cuml/_thirdparty/sklearn/utils/skl_dependencies.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    109                     f"requires y to be passed, but the target y is None."
    110                 )
--> 111             X = check_array(X, **check_params)
    112             out = X
    113         else:

~/conda/envs/rapids-21.12/lib/python3.8/site-packages/cuml/thirdparty_adapters/adapters.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    248             raise ValueError("Not enough samples")
    249 
--> 250     if ensure_min_features > 0 and hasshape and array.ndim == 2:
    251         n_features = array.shape[1]
    252         if n_features < ensure_min_features:

AttributeError: 'CumlArray' object has no attribute 'ndim'

github-actions · 2022-01-02T18:02:40Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2022-04-02T18:02:55Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

…th a ColumnTransformer step (#4774) This PR fixes a subtle bug in check_array of cuml.thirdparty_adapters.adapters which is the primary cause for the bug. Fix #4368. Authors: - https://github.com/VamsiTallam95 - Ray Douglass (https://github.com/raydouglass) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #4774

…th a ColumnTransformer step (rapidsai#4774) This PR fixes a subtle bug in check_array of cuml.thirdparty_adapters.adapters which is the primary cause for the bug. Fix rapidsai#4368. Authors: - https://github.com/VamsiTallam95 - Ray Douglass (https://github.com/raydouglass) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4774

EFulmer added ? - Needs Triage Need team to review and classify bug Something isn't working labels Nov 16, 2021

beckernick mentioned this issue Dec 6, 2021

[BUG] cuML Pipeline fit fails with cuML OneHotEncoder but not sklearn OneHotEncoder #4426

Closed

github-actions bot added the inactive-30d label Jan 2, 2022

github-actions bot added the inactive-90d label Apr 2, 2022

VamsiTallam95 mentioned this issue Jun 8, 2022

Fixes exception when using predict_proba on fitted Pipeline object with a ColumnTransformer step #4774

Merged

rapids-bot bot closed this as completed in #4774 Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] .predict_proba on fitted Pipeline object with a ColumnTransformer step raises exception #4368

[BUG] .predict_proba on fitted Pipeline object with a ColumnTransformer step raises exception #4368

EFulmer commented Nov 16, 2021

beckernick commented Dec 3, 2021 •

edited

Loading

github-actions bot commented Jan 2, 2022

github-actions bot commented Apr 2, 2022

[BUG] .predict_proba on fitted Pipeline object with a ColumnTransformer step raises exception #4368

[BUG] .predict_proba on fitted Pipeline object with a ColumnTransformer step raises exception #4368

Comments

EFulmer commented Nov 16, 2021

beckernick commented Dec 3, 2021 • edited Loading

github-actions bot commented Jan 2, 2022

github-actions bot commented Apr 2, 2022

beckernick commented Dec 3, 2021 •

edited

Loading