-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change HP Name & Include Text example #1410
Changes from 11 commits
45b3b7e
e3ef23f
666086d
db62290
cdaeed5
83bfb75
efadf85
e7d5db4
328dcae
7b1112f
f96b758
38e5e2f
93d2164
bac27b9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,79 +1,104 @@ | ||
# -*- encoding: utf-8 -*- | ||
""" | ||
================== | ||
Text Preprocessing | ||
Text preprocessing | ||
================== | ||
This example shows, how to use text features in *auto-sklearn*. *auto-sklearn* can automatically | ||
encode text features if they are provided as string type in a pandas dataframe. | ||
|
||
For processing text features you need a pandas dataframe and set the desired | ||
text columns to string and the categorical columns to category. | ||
The following example shows how to fit a simple NLP problem with | ||
*auto-sklearn*. | ||
|
||
*auto-sklearn* text embedding creates a bag of words count. | ||
For an introduction to text preprocessing you can follow these links: | ||
1. https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html | ||
2. https://machinelearningmastery.com/clean-text-machine-learning-python/ | ||
""" | ||
from pprint import pprint | ||
|
||
import pandas as pd | ||
import sklearn.metrics | ||
import sklearn.datasets | ||
from sklearn.datasets import fetch_20newsgroups | ||
|
||
import autosklearn.classification | ||
|
||
############################################################################ | ||
# Data Loading | ||
# ============ | ||
|
||
X, y = sklearn.datasets.fetch_openml(data_id=40945, return_X_y=True) | ||
|
||
# by default, the columns which should be strings are not formatted as such | ||
print(f"{X.info()}\n") | ||
|
||
# manually convert these to string columns | ||
X = X.astype( | ||
{ | ||
"name": "string", | ||
"ticket": "string", | ||
"cabin": "string", | ||
"boat": "string", | ||
"home.dest": "string", | ||
} | ||
) | ||
|
||
# now *auto-sklearn* handles the string columns with its text feature preprocessing pipeline | ||
|
||
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split( | ||
X, y, random_state=1 | ||
) | ||
newsgroups_train = fetch_20newsgroups(subset="train", random_state=42, shuffle=True) | ||
newsgroups_test = fetch_20newsgroups(subset="test") | ||
|
||
# load train data | ||
df_train = pd.DataFrame({"X": [], "y": []}) | ||
|
||
for idx, (text, target) in enumerate( | ||
zip(newsgroups_train.data, newsgroups_train.target) | ||
): | ||
df_train = pd.concat( | ||
[ | ||
df_train, | ||
pd.DataFrame( | ||
{"X": text, "y": newsgroups_train.target_names[target]}, index=[idx] | ||
), | ||
] | ||
) | ||
|
||
# explicitly label text column as string | ||
X_train = df_train.astype({"X": "string", "y": "category"}) | ||
|
||
# show all 20 labels | ||
print(list(newsgroups_train.target_names)) | ||
|
||
# reduce the example to only 5 labels | ||
five_newsgroups_labels = list(newsgroups_train.target_names)[:5] | ||
|
||
X_train = X_train[~X_train["y"].isin(five_newsgroups_labels)] | ||
y_train = X_train.pop("y") | ||
|
||
# load test data | ||
df_test = pd.DataFrame({"X": [], "y": []}) | ||
|
||
for idx, (text, target) in enumerate(zip(newsgroups_test.data, newsgroups_test.target)): | ||
df_test = pd.concat( | ||
[ | ||
df_train, | ||
pd.DataFrame( | ||
{"X": text, "y": newsgroups_train.target_names[int(target)]}, | ||
index=[idx], | ||
), | ||
] | ||
) | ||
|
||
# explicitly label text column as string | ||
X_test = df_test.astype({"X": "string", "y": "category"}) | ||
X_test = X_test[~X_test["y"].isin(five_newsgroups_labels)] | ||
y_test = X_test.pop("y") | ||
|
||
cls = autosklearn.classification.AutoSklearnClassifier( | ||
time_left_for_this_task=30, | ||
# Bellow two flags are provided to speed up calculations | ||
# Not recommended for a real implementation | ||
initial_configurations_via_metalearning=0, | ||
smac_scenario_args={"runcount_limit": 1}, | ||
############################################################################ | ||
# Build and fit a classifier | ||
# ========================== | ||
|
||
automl = autosklearn.classification.AutoSklearnClassifier( | ||
# set the time high enough text preprocessing can create many new features | ||
time_left_for_this_task=300, | ||
per_run_time_limit=30, | ||
tmp_folder="/tmp/autosklearn_text_example_tmp", | ||
) | ||
automl.fit(X_train, y_train, dataset_name="20_Newsgroups") | ||
|
||
cls.fit(X_train, y_train, X_test, y_test) | ||
|
||
predictions = cls.predict(X_test) | ||
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions)) | ||
|
||
############################################################################ | ||
# View the models found by auto-sklearn | ||
# ===================================== | ||
|
||
X, y = sklearn.datasets.fetch_openml(data_id=40945, return_X_y=True, as_frame=True) | ||
X = X.select_dtypes(exclude=["object"]) | ||
print(automl.leaderboard()) | ||
|
||
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split( | ||
X, y, random_state=1 | ||
) | ||
############################################################################ | ||
# Print the final ensemble constructed by auto-sklearn | ||
# ==================================================== | ||
|
||
cls = autosklearn.classification.AutoSklearnClassifier( | ||
time_left_for_this_task=30, | ||
# Bellow two flags are provided to speed up calculations | ||
# Not recommended for a real implementation | ||
initial_configurations_via_metalearning=0, | ||
smac_scenario_args={"runcount_limit": 1}, | ||
) | ||
pprint(automl.show_models(), indent=4) | ||
|
||
cls.fit(X_train, y_train, X_test, y_test) | ||
########################################################################### | ||
# Get the Score of the final ensemble | ||
# =================================== | ||
|
||
predictions = cls.predict(X_test) | ||
print( | ||
"Accuracy score without text preprocessing", | ||
sklearn.metrics.accuracy_score(y_test, predictions), | ||
) | ||
predictions = automl.predict(X_test) | ||
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions)) |
Large diffs are not rendered by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does 20 newsgroup work in the setting on the left? That would be preferable for running this example in the github actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should also use a smaller dataset? You can use the following script to scan on OpenML for datasets containing string data:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the example yields ~80% acc. on the test set. Selecting random would be 5% for 20 labels. Therefore i would say that the example works. But it also runs 300 sec. which are 5 min. So if that is to long i can search another dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant, would the example work when you restrict it to use only a single configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a parameter for setting autosklearn to it or is that max_time == timer per model ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would read through the entire API and manual now that you have a bit more familiarity, to know what's possible and what's not
https://automl.github.io/auto-sklearn/master/api.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has been there in the previous version of the example:
smac_scenario_args={"runcount_limit": 1}