Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change HP Name & Include Text example #1410

Merged
merged 14 commits into from
Mar 2, 2022
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/f1_binary.classification_dense/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/f1_binary.classification_sparse/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/f1_multiclass.classification_dense/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/f1_multiclass.classification_sparse/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/mean_squared_error_regression_dense/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/r2_regression_dense/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/r2_regression_sparse/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/recall_binary.classification_dense/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/recall_binary.classification_sparse/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion autosklearn/metalearning/files/roc_auc_binary.classification_dense/configurations.csv
100755 → 100644

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,13 @@
class BagOfWordEncoder(AutoSklearnPreprocessingAlgorithm):
def __init__(
self,
ngram_range: int = 1,
ngram_upper_bound: int = 1,
min_df_choice: str = "min_df_absolute",
min_df_absolute: int = 0,
min_df_relative: float = 0.01,
random_state: Optional[Union[int, np.random.RandomState]] = None,
) -> None:
self.ngram_range = ngram_range
self.ngram_upper_bound = ngram_upper_bound
self.random_state = random_state
self.min_df_choice = min_df_choice
self.min_df_absolute = min_df_absolute
Expand All @@ -46,13 +46,13 @@ def fit(
if self.min_df_choice == "min_df_absolute":
self.preprocessor = CountVectorizer(
min_df=self.min_df_absolute,
ngram_range=(1, self.ngram_range),
ngram_range=(1, self.ngram_upper_bound),
)

elif self.min_df_choice == "min_df_relative":
self.preprocessor = CountVectorizer(
min_df=self.min_df_relative,
ngram_range=(1, self.ngram_range),
ngram_range=(1, self.ngram_upper_bound),
)

else:
Expand Down Expand Up @@ -98,8 +98,8 @@ def get_hyperparameter_search_space(
dataset_properties: Optional[DATASET_PROPERTIES_TYPE] = None,
) -> ConfigurationSpace:
cs = ConfigurationSpace()
hp_ngram_range = CSH.UniformIntegerHyperparameter(
name="ngram_range", lower=1, upper=3, default_value=1
hp_ngram_upper_bound = CSH.UniformIntegerHyperparameter(
name="ngram_upper_bound", lower=1, upper=3, default_value=1
)
hp_min_df_choice_bow = CSH.CategoricalHyperparameter(
"min_df_choice", choices=["min_df_absolute", "min_df_relative"]
Expand All @@ -112,7 +112,7 @@ def get_hyperparameter_search_space(
)
cs.add_hyperparameters(
[
hp_ngram_range,
hp_ngram_upper_bound,
hp_min_df_choice_bow,
hp_min_df_absolute_bow,
hp_min_df_relative_bow,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,13 @@
class BagOfWordEncoder(AutoSklearnPreprocessingAlgorithm):
def __init__(
self,
ngram_range: int = 1,
ngram_upper_bound: int = 1,
min_df_choice: str = "min_df_absolute",
min_df_absolute: int = 0,
min_df_relative: float = 0.01,
random_state: Optional[Union[int, np.random.RandomState]] = None,
) -> None:
self.ngram_range = ngram_range
self.ngram_upper_bound = ngram_upper_bound
self.random_state = random_state
self.min_df_choice = min_df_choice
self.min_df_absolute = min_df_absolute
Expand All @@ -40,7 +40,8 @@ def fit(

for feature in X.columns:
vectorizer = CountVectorizer(
min_df=self.min_df_absolute, ngram_range=(1, self.ngram_range)
min_df=self.min_df_absolute,
ngram_range=(1, self.ngram_upper_bound),
).fit(X[feature])
self.preprocessor[feature] = vectorizer

Expand All @@ -50,7 +51,8 @@ def fit(

for feature in X.columns:
vectorizer = CountVectorizer(
min_df=self.min_df_relative, ngram_range=(1, self.ngram_range)
min_df=self.min_df_relative,
ngram_range=(1, self.ngram_upper_bound),
).fit(X[feature])
self.preprocessor[feature] = vectorizer
else:
Expand Down Expand Up @@ -102,8 +104,8 @@ def get_hyperparameter_search_space(
dataset_properties: Optional[DATASET_PROPERTIES_TYPE] = None,
) -> ConfigurationSpace:
cs = ConfigurationSpace()
hp_ngram_range = CSH.UniformIntegerHyperparameter(
name="ngram_range", lower=1, upper=3, default_value=1
hp_ngram_upper_bound = CSH.UniformIntegerHyperparameter(
name="ngram_upper_bound", lower=1, upper=3, default_value=1
)
hp_min_df_choice_bow = CSH.CategoricalHyperparameter(
"min_df_choice", choices=["min_df_absolute", "min_df_relative"]
Expand All @@ -116,7 +118,7 @@ def get_hyperparameter_search_space(
)
cs.add_hyperparameters(
[
hp_ngram_range,
hp_ngram_upper_bound,
hp_min_df_choice_bow,
hp_min_df_absolute_bow,
hp_min_df_relative_bow,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,14 @@
class TfidfEncoder(AutoSklearnPreprocessingAlgorithm):
def __init__(
self,
ngram_range: int = 1,
ngram_upper_bound: int = 1,
use_idf: bool = True,
min_df_choice: str = "min_df_absolute",
min_df_absolute: int = 0,
min_df_relative: float = 0.01,
random_state: Optional[Union[int, np.random.RandomState]] = None,
) -> None:
self.ngram_range = ngram_range
self.ngram_upper_bound = ngram_upper_bound
self.random_state = random_state
self.use_idf = use_idf
self.min_df_choice = min_df_choice
Expand All @@ -50,14 +50,14 @@ def fit(
self.preprocessor = TfidfVectorizer(
min_df=self.min_df_absolute,
use_idf=self.use_idf,
ngram_range=(1, self.ngram_range),
ngram_range=(1, self.ngram_upper_bound),
)

elif self.min_df_choice == "min_df_relative":
self.preprocessor = TfidfVectorizer(
min_df=self.min_df_relative,
use_idf=self.use_idf,
ngram_range=(1, self.ngram_range),
ngram_range=(1, self.ngram_upper_bound),
)

else:
Expand Down Expand Up @@ -103,8 +103,8 @@ def get_hyperparameter_search_space(
dataset_properties: Optional[DATASET_PROPERTIES_TYPE] = None,
) -> ConfigurationSpace:
cs = ConfigurationSpace()
hp_ngram_range = CSH.UniformIntegerHyperparameter(
name="ngram_range", lower=1, upper=3, default_value=1
hp_ngram_upper_bound = CSH.UniformIntegerHyperparameter(
name="ngram_upper_bound", lower=1, upper=3, default_value=1
)
hp_use_idf = CSH.CategoricalHyperparameter("use_idf", choices=[False, True])
hp_min_df_choice = CSH.CategoricalHyperparameter(
Expand All @@ -118,7 +118,7 @@ def get_hyperparameter_search_space(
)
cs.add_hyperparameters(
[
hp_ngram_range,
hp_ngram_upper_bound,
hp_use_idf,
hp_min_df_choice,
hp_min_df_absolute,
Expand Down
139 changes: 82 additions & 57 deletions examples/40_advanced/example_text_preprocessing.py
Original file line number Diff line number Diff line change
@@ -1,79 +1,104 @@
# -*- encoding: utf-8 -*-
"""
==================
Text Preprocessing
Text preprocessing
==================
This example shows, how to use text features in *auto-sklearn*. *auto-sklearn* can automatically
encode text features if they are provided as string type in a pandas dataframe.

For processing text features you need a pandas dataframe and set the desired
text columns to string and the categorical columns to category.
The following example shows how to fit a simple NLP problem with
*auto-sklearn*.

*auto-sklearn* text embedding creates a bag of words count.
For an introduction to text preprocessing you can follow these links:
1. https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
2. https://machinelearningmastery.com/clean-text-machine-learning-python/
"""
from pprint import pprint

import pandas as pd
import sklearn.metrics
import sklearn.datasets
from sklearn.datasets import fetch_20newsgroups

import autosklearn.classification

############################################################################
# Data Loading
# ============

X, y = sklearn.datasets.fetch_openml(data_id=40945, return_X_y=True)

# by default, the columns which should be strings are not formatted as such
print(f"{X.info()}\n")

# manually convert these to string columns
X = X.astype(
{
"name": "string",
"ticket": "string",
"cabin": "string",
"boat": "string",
"home.dest": "string",
}
)

# now *auto-sklearn* handles the string columns with its text feature preprocessing pipeline

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X, y, random_state=1
)
newsgroups_train = fetch_20newsgroups(subset="train", random_state=42, shuffle=True)
newsgroups_test = fetch_20newsgroups(subset="test")

# load train data
df_train = pd.DataFrame({"X": [], "y": []})

for idx, (text, target) in enumerate(
zip(newsgroups_train.data, newsgroups_train.target)
):
df_train = pd.concat(
[
df_train,
pd.DataFrame(
{"X": text, "y": newsgroups_train.target_names[target]}, index=[idx]
),
]
)

# explicitly label text column as string
X_train = df_train.astype({"X": "string", "y": "category"})

# show all 20 labels
print(list(newsgroups_train.target_names))

# reduce the example to only 5 labels
five_newsgroups_labels = list(newsgroups_train.target_names)[:5]

X_train = X_train[~X_train["y"].isin(five_newsgroups_labels)]
y_train = X_train.pop("y")

# load test data
df_test = pd.DataFrame({"X": [], "y": []})

for idx, (text, target) in enumerate(zip(newsgroups_test.data, newsgroups_test.target)):
df_test = pd.concat(
[
df_train,
pd.DataFrame(
{"X": text, "y": newsgroups_train.target_names[int(target)]},
index=[idx],
),
]
)

# explicitly label text column as string
X_test = df_test.astype({"X": "string", "y": "category"})
X_test = X_test[~X_test["y"].isin(five_newsgroups_labels)]
y_test = X_test.pop("y")

cls = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=30,
# Bellow two flags are provided to speed up calculations
# Not recommended for a real implementation
initial_configurations_via_metalearning=0,
smac_scenario_args={"runcount_limit": 1},
############################################################################
# Build and fit a classifier
# ==========================

automl = autosklearn.classification.AutoSklearnClassifier(
# set the time high enough text preprocessing can create many new features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does 20 newsgroup work in the setting on the left? That would be preferable for running this example in the github actions.

Copy link
Contributor

@mfeurer mfeurer Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also use a smaller dataset? You can use the following script to scan on OpenML for datasets containing string data:

import openml

datasets = openml.datasets.list_datasets()
for did in datasets:
    try:
        dataset = openml.datasets.get_dataset(did, download_data=False, download_qualities=False)
        for feat in dataset.features:
            if dataset.features[feat].data_type == 'string':
                print(did, dataset.name)
                break
    except Exception as e:
        print(e)
        continue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the example yields ~80% acc. on the test set. Selecting random would be 5% for 20 labels. Therefore i would say that the example works. But it also runs 300 sec. which are 5 min. So if that is to long i can search another dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant, would the example work when you restrict it to use only a single configuration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a parameter for setting autosklearn to it or is that max_time == timer per model ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would read through the entire API and manual now that you have a bit more familiarity, to know what's possible and what's not
https://automl.github.io/auto-sklearn/master/api.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been there in the previous version of the example: smac_scenario_args={"runcount_limit": 1}

time_left_for_this_task=300,
per_run_time_limit=30,
tmp_folder="/tmp/autosklearn_text_example_tmp",
)
automl.fit(X_train, y_train, dataset_name="20_Newsgroups")

cls.fit(X_train, y_train, X_test, y_test)

predictions = cls.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))

############################################################################
# View the models found by auto-sklearn
# =====================================

X, y = sklearn.datasets.fetch_openml(data_id=40945, return_X_y=True, as_frame=True)
X = X.select_dtypes(exclude=["object"])
print(automl.leaderboard())

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X, y, random_state=1
)
############################################################################
# Print the final ensemble constructed by auto-sklearn
# ====================================================

cls = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=30,
# Bellow two flags are provided to speed up calculations
# Not recommended for a real implementation
initial_configurations_via_metalearning=0,
smac_scenario_args={"runcount_limit": 1},
)
pprint(automl.show_models(), indent=4)

cls.fit(X_train, y_train, X_test, y_test)
###########################################################################
# Get the Score of the final ensemble
# ===================================

predictions = cls.predict(X_test)
print(
"Accuracy score without text preprocessing",
sklearn.metrics.accuracy_score(y_test, predictions),
)
predictions = automl.predict(X_test)
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions))

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ def test_fit_transform(self):
}
).astype({"col1": "string", "col2": "string"})
BOW_fitted = BOW(
ngram_range=1,
ngram_upper_bound=1,
min_df_choice="min_df_absolute",
min_df_absolute=0,
min_df_relative=0,
Expand All @@ -46,7 +46,7 @@ def test_fit_transform(self):
np.testing.assert_array_equal(Yt, Y)

BOW_fitted = BOW_distinct(
ngram_range=1,
ngram_upper_bound=1,
min_df_choice="min_df_absolute",
min_df_absolute=0,
min_df_relative=0,
Expand All @@ -69,7 +69,7 @@ def test_transform(self):
}
).astype({"col1": "string", "col2": "string"})
X_t = BOW(
ngram_range=1,
ngram_upper_bound=1,
min_df_choice="min_df_absolute",
min_df_absolute=0,
min_df_relative=0,
Expand All @@ -81,7 +81,7 @@ def test_transform(self):
np.testing.assert_array_equal(X_t.toarray(), y)

X_t = BOW_distinct(
ngram_range=1,
ngram_upper_bound=1,
min_df_choice="min_df_absolute",
min_df_absolute=0,
min_df_relative=0,
Expand All @@ -103,7 +103,7 @@ def test_check_shape(self):
}
).astype({"col1": "string", "col2": "string"})
X_t = BOW(
ngram_range=1,
ngram_upper_bound=1,
min_df_choice="min_df_absolute",
min_df_absolute=0,
min_df_relative=0,
Expand All @@ -113,7 +113,7 @@ def test_check_shape(self):
self.assertEqual(X_t.shape, (2, 5))

X_t = BOW_distinct(
ngram_range=1,
ngram_upper_bound=1,
min_df_choice="min_df_absolute",
min_df_absolute=0,
min_df_relative=0,
Expand All @@ -130,7 +130,7 @@ def test_check_nan(self):
}
).astype({"col1": "string", "col2": "string"})
X_t = BOW(
ngram_range=1,
ngram_upper_bound=1,
min_df_choice="min_df_absolute",
min_df_absolute=0,
min_df_relative=0,
Expand All @@ -140,7 +140,7 @@ def test_check_nan(self):
self.assertEqual(X_t.shape, (3, 5))

X_t = BOW_distinct(
ngram_range=1,
ngram_upper_bound=1,
min_df_choice="min_df_absolute",
min_df_absolute=0,
min_df_relative=0,
Expand Down