StackingEnsemble and AggregatingEnsemble crash during fitting due to missing data #17

yvchao · 2022-05-25T16:14:09Z

Describe the bug

When the optimal imputer is selected by Adjutorium, StackingEnsemble and AggregatingEnsemble failed due to missing data checking in upstream implementation.

Example to reproduce

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import sys
import random
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

from adjutorium.studies.classifiers import ClassifierStudy
from adjutorium.utils.serialization import load_model_from_file, save_model_to_file
from adjutorium.utils.tester import evaluate_estimator
import adjutorium.logger as log

X, Y = load_breast_cancer(return_X_y=True, as_frame=True)

# Simulate missingness
total_len = len(X)

for col in ["mean texture", "mean compactness"]:
    indices = random.sample(range(0, total_len), 10)
    X.loc[indices, col] = np.nan

dataset = X.copy()
dataset["target"] = Y

workspace = Path("workspace")
workspace.mkdir(parents=True, exist_ok=True)

study_name = "classification_example_imputation"

study = ClassifierStudy(
    study_name=study_name,
    dataset=dataset,
    target="target",
    num_iter=1,
    num_study_iter=1,
    timeout=1, 
    imputers = ["mean", "ice", "median"],
    classifiers=["logistic_regression", "lda"],
    feature_scaling = [], # feature preprocessing is disabled
    score_threshold=0.4,
    workspace=workspace,
)

log.add(sys.stderr,level = 'INFO')

study.run()

Result

Information below can be found in the log.

...
[2022-05-25T16:53:43.553802+0100][45426][INFO] StackingEnsemble failed Input contains NaN, infinity or a value too large for dtype('float64').
[2022-05-25T16:53:43.579949+0100][45426][INFO] AggregatingEnsemble failed Input contains NaN, infinity or a value too large for dtype('float64').
...

Note

This is due to the input validation in the upstream module combo

    ...
    def fit(self, X, y):
        """Fit classifier.
        Parameters
        ----------
        X : numpy array of shape (n_samples, n_features)
            The input samples.
        y : numpy array of shape (n_samples,), optional (default=None)
            The ground truth of the input samples (labels).
        """

        # Validate inputs X and y
        X, y = check_X_y(X, y)
        X = check_array(X)
        self._set_n_classes(y)
    ...

The StackingEnsemble and AggregatingEnsemble crash at this line even though the imputer is included in the pipeline.
The input data should be imputed before provided to these ensembles. Alternatively, this behavior could be overrode with a customized implementation.

The text was updated successfully, but these errors were encountered:

bcebere mentioned this issue Nov 28, 2022

Bugfixing & improvements #40

Merged

bcebere closed this as completed in #40 Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StackingEnsemble and AggregatingEnsemble crash during fitting due to missing data #17

StackingEnsemble and AggregatingEnsemble crash during fitting due to missing data #17

yvchao commented May 25, 2022

StackingEnsemble and AggregatingEnsemble crash during fitting due to missing data #17

StackingEnsemble and AggregatingEnsemble crash during fitting due to missing data #17

Comments

yvchao commented May 25, 2022

Describe the bug

Example to reproduce

Result

Note