Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StackingEnsemble and AggregatingEnsemble crash during fitting due to missing data #17

Closed
yvchao opened this issue May 25, 2022 · 0 comments · Fixed by #40
Closed

StackingEnsemble and AggregatingEnsemble crash during fitting due to missing data #17

yvchao opened this issue May 25, 2022 · 0 comments · Fixed by #40

Comments

@yvchao
Copy link

yvchao commented May 25, 2022

Describe the bug

When the optimal imputer is selected by Adjutorium, StackingEnsemble and AggregatingEnsemble failed due to missing data checking in upstream implementation.

Example to reproduce

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import sys
import random
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

from adjutorium.studies.classifiers import ClassifierStudy
from adjutorium.utils.serialization import load_model_from_file, save_model_to_file
from adjutorium.utils.tester import evaluate_estimator
import adjutorium.logger as log

X, Y = load_breast_cancer(return_X_y=True, as_frame=True)

# Simulate missingness
total_len = len(X)

for col in ["mean texture", "mean compactness"]:
    indices = random.sample(range(0, total_len), 10)
    X.loc[indices, col] = np.nan

dataset = X.copy()
dataset["target"] = Y

workspace = Path("workspace")
workspace.mkdir(parents=True, exist_ok=True)

study_name = "classification_example_imputation"

study = ClassifierStudy(
    study_name=study_name,
    dataset=dataset,
    target="target",
    num_iter=1,
    num_study_iter=1,
    timeout=1, 
    imputers = ["mean", "ice", "median"],
    classifiers=["logistic_regression", "lda"],
    feature_scaling = [], # feature preprocessing is disabled
    score_threshold=0.4,
    workspace=workspace,
)

log.add(sys.stderr,level = 'INFO')

study.run()

Result

Information below can be found in the log.

...
[2022-05-25T16:53:43.553802+0100][45426][INFO] StackingEnsemble failed Input contains NaN, infinity or a value too large for dtype('float64').
[2022-05-25T16:53:43.579949+0100][45426][INFO] AggregatingEnsemble failed Input contains NaN, infinity or a value too large for dtype('float64').
...

Note

This is due to the input validation in the upstream module combo

    ...
    def fit(self, X, y):
        """Fit classifier.
        Parameters
        ----------
        X : numpy array of shape (n_samples, n_features)
            The input samples.
        y : numpy array of shape (n_samples,), optional (default=None)
            The ground truth of the input samples (labels).
        """

        # Validate inputs X and y
        X, y = check_X_y(X, y)
        X = check_array(X)
        self._set_n_classes(y)
    ...

The StackingEnsemble and AggregatingEnsemble crash at this line even though the imputer is included in the pipeline.
The input data should be imputed before provided to these ensembles. Alternatively, this behavior could be overrode with a customized implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant