Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove pipeline_parameters and custom_hyperparameters and replace with search_parameters #3373

Merged
merged 42 commits into from
Mar 24, 2022
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
02d717e
initial impl:
bchen1116 Mar 14, 2022
85e0513
update release notes
bchen1116 Mar 14, 2022
ddc5bb6
fix notebook
bchen1116 Mar 14, 2022
5cd9917
fix docs
bchen1116 Mar 14, 2022
8c47ffb
add test
bchen1116 Mar 14, 2022
70f8e5c
update test
bchen1116 Mar 15, 2022
e7fcb92
update impl
bchen1116 Mar 15, 2022
85b5374
Merge branch 'main' into bc_search_parameters
bchen1116 Mar 15, 2022
f21f012
update docs
bchen1116 Mar 15, 2022
2d97b9f
Merge branch 'main' into bc_search_parameters
bchen1116 Mar 15, 2022
ad0e0fb
update implementation to use tuner to suggest
bchen1116 Mar 15, 2022
e770926
Merge branch 'bc_search_parameters' of github.com:alteryx/evalml into…
bchen1116 Mar 15, 2022
828a8f4
make changes to how automl algo handles pipelines
bchen1116 Mar 16, 2022
e1dda1f
update test
bchen1116 Mar 16, 2022
eaa0663
Merge branch 'main' into bc_search_parameters
bchen1116 Mar 16, 2022
ce0db41
rerun test
bchen1116 Mar 16, 2022
844b042
update docstring
bchen1116 Mar 16, 2022
cb26e74
fix import
bchen1116 Mar 16, 2022
6e4a4f7
address codecov
bchen1116 Mar 16, 2022
204c0d8
Merge branch 'main' into bc_search_parameters
bchen1116 Mar 17, 2022
74094c0
Merge branch 'main' into bc_search_parameters
bchen1116 Mar 21, 2022
a760c8a
Merge branch 'main' into bc_search_parameters
bchen1116 Mar 21, 2022
b975315
merging changes
bchen1116 Mar 22, 2022
4050ed8
pipeline parameters
bchen1116 Mar 22, 2022
49432a2
lint
bchen1116 Mar 22, 2022
d68c940
testing
bchen1116 Mar 22, 2022
26860c6
test commit hook
bchen1116 Mar 22, 2022
9b2d682
Merge branch 'main' into bc_search_parameters
bchen1116 Mar 22, 2022
e953b06
Merge branch 'main' into bc_search_parameters
bchen1116 Mar 22, 2022
a7545cf
Merge branch 'main' of github.com:alteryx/evalml
bchen1116 Mar 22, 2022
be7b26a
update with comments
bchen1116 Mar 22, 2022
e557fdf
lint
bchen1116 Mar 22, 2022
fc1d36a
update tuner
bchen1116 Mar 23, 2022
a34d027
update
bchen1116 Mar 23, 2022
3a39f0e
Merge branch 'main' of github.com:alteryx/evalml
bchen1116 Mar 23, 2022
5703bbe
merge
bchen1116 Mar 23, 2022
e75fec9
fix docs
bchen1116 Mar 23, 2022
e053263
update with comments
bchen1116 Mar 23, 2022
b52d9d1
lint
bchen1116 Mar 23, 2022
73952c8
fix test
bchen1116 Mar 23, 2022
2993bcb
rerun test
bchen1116 Mar 24, 2022
2847634
Merge branch 'main' into bc_search_parameters
bchen1116 Mar 24, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,11 @@
* Enhancements
* Added ``TimeSeriesFeaturizer`` into ARIMA-based pipelines :pr:`3313`
* Added caching capability for ensemble training during ``AutoMLSearch`` :pr:`3257`
* Replaced ``pipeline_parameters`` and ``custom_hyperparameters`` with ``search_parameters`` in ``AutoMLSearch`` :pr:`3373`
bchen1116 marked this conversation as resolved.
Show resolved Hide resolved
* Added new error code for zero unique values in ``NoVarianceDataCheck`` :pr:`3372`
* Fixes
* Fixed ``get_pipelines`` to reset pipeline threshold for binary cases :pr:`3360`
* Simplified internal ``AutoMLSearch`` API to rely on ``search_parameters`` :pr:`3373`
bchen1116 marked this conversation as resolved.
Show resolved Hide resolved
* Changes
* Update maintainers :pr:`3365`
* Documentation Changes
Expand Down
56 changes: 39 additions & 17 deletions docs/source/user_guide/automl.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -477,11 +477,11 @@
"metadata": {},
"source": [
"## Limiting the AutoML Search Space\n",
"The AutoML search algorithm first trains each component in the pipeline with their default values. After the first iteration, it then tweaks the parameters of these components using the pre-defined hyperparameter ranges that these components have. To limit the search over certain hyperparameter ranges, you can specify a `custom_hyperparameters` argument with your `AutoMLSearch` parameters. These parameters will limit the hyperparameter search space. \n",
"The AutoML search algorithm first trains each component in the pipeline with their default values. After the first iteration, it then tweaks the parameters of these components using the pre-defined hyperparameter ranges that these components have. To limit the search over certain hyperparameter ranges, you can specify a `search_parameters` argument with your `AutoMLSearch` parameters. These parameters will limit the hyperparameter search space or pipeline parameter space. \n",
"\n",
"Hyperparameter ranges can be found through the [API reference](https://evalml.alteryx.com/en/stable/api_reference.html) for each component. Parameter arguments must be specified as dictionaries, but the associated values can be single values or `skopt.space` Real, Integer, Categorical values.\n",
"Hyperparameter ranges can be found through the [API reference](https://evalml.alteryx.com/en/stable/api_reference.html) for each component. Parameter arguments must be specified as dictionaries, but the associated values must be `skopt.space` Real, Integer, Categorical values for setting hyperparameter values.\n",
bchen1116 marked this conversation as resolved.
Show resolved Hide resolved
"\n",
"If however you'd like to specify certain values for the initial batch of the AutoML search algorithm, you can use the `pipeline_parameters` argument. This will set the initial batch's component parameters to the values passed by this argument."
"If however you'd like to specify certain values for the initial batch of the AutoML search algorithm, you can use the `search_parameters` argument with non `skopt.space` objects. This will set the initial batch's component parameters to the values passed by this argument."
]
},
{
Expand All @@ -499,30 +499,52 @@
"X, y = load_fraud(n_rows=1000)\n",
"\n",
"# example of setting parameter to just one value\n",
"custom_hyperparameters = {'Imputer': {\n",
"search_parameters = {'Imputer': {\n",
" 'numeric_impute_strategy': 'mean'\n",
"}}\n",
"\n",
"\n",
"# limit the numeric impute strategy to include only `median` and `most_frequent`\n",
"# `mean` is the default value for this argument, but it doesn't need to be included in the specified hyperparameter range for this to work\n",
"custom_hyperparameters = {'Imputer': {\n",
"search_parameters = {'Imputer': {\n",
" 'numeric_impute_strategy': Categorical(['median', 'most_frequent'])\n",
"}}\n",
"# set the initial batch numeric impute strategy strategy to 'median'\n",
"pipeline_parameters = {'Imputer': {\n",
" 'numeric_impute_strategy': 'median'\n",
"}}\n",
"\n",
"# using this custom hyperparameter means that our Imputer components in these pipelines will only search through\n",
"# 'median' and 'most_frequent' strategies for 'numeric_impute_strategy', and the initial batch parameter will be\n",
"# set to 'median'\n",
"# 'median' and 'most_frequent' strategies for 'numeric_impute_strategy'\n",
"automl_constrained = AutoMLSearch(X_train=X, y_train=y, problem_type='binary', \n",
" pipeline_parameters=pipeline_parameters,\n",
" custom_hyperparameters=custom_hyperparameters, \n",
" search_parameters=search_parameters,\n",
" verbose=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`search_parameters` can set both hyperparameter ranges and pipeline parameters. To set the hyperparameter space, an `skopt.space` Integer, Real, or Categorical object must be used. All other values will be associated to setting the pipeline parameters directly.\n",
bchen1116 marked this conversation as resolved.
Show resolved Hide resolved
"\n",
"Let's walk through some examples to explain this. For instance,\n",
"```python\n",
"search_parameters = {'Imputer': {\n",
" 'numeric_impute_strategy': 'mean'\n",
"}}\n",
"```\n",
"then in the initial search, the algorithm would use `mean` as the impute strategy in batch 1. However, since `Imputer.numeric_impute_strategy` has a valid hyperparameter range, if the algorithm suggests a different strategy, it can and will change this value. To limit this to using `mean` only for the duration of the search, it is necessary to use the `skopt.space`:\n",
"```python\n",
"search_parameters = {'Imputer': {\n",
" 'numeric_impute_strategy': Categorical(['mean'])\n",
"}}\n",
"```\n",
"\n",
"However, if a value has no hyperparameter range associated, then the algorithm will use this value as the only parameter. For instance,\n",
"```python\n",
"search_parameters = {'Label Encoder': {\n",
" 'positive_label': True\n",
"}}\n",
"```\n",
"Since `Label Encoder.positive_label` has no associated hyperparameter range, the algorithm will use this parameter for the entire duration of the search."
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -579,16 +601,16 @@
"# for the oversampler, we don't want to oversample this class, so class 0 (majority) will have a ratio of 1 to itself\n",
"# for the minority class 1, we want to oversample it to have a minority/majority ratio of 0.5, which means we want minority to have 1/2 the samples as the minority\n",
"sampler_ratio_dict = {0: 1, 1: 0.5}\n",
"pipeline_parameters = {\"Oversampler\": {\"sampler_balanced_ratio\": sampler_ratio_dict}}\n",
"automl_auto_ratio_dict = AutoMLSearch(X_train=X, y_train=y, problem_type='binary', pipeline_parameters=pipeline_parameters, automl_algorithm='iterative')\n",
"search_parameters = {\"Oversampler\": {\"sampler_balanced_ratio\": sampler_ratio_dict}}\n",
"automl_auto_ratio_dict = AutoMLSearch(X_train=X, y_train=y, problem_type='binary', search_parameters=search_parameters, automl_algorithm='iterative')\n",
"automl_auto_ratio_dict.allowed_pipelines[-1]\n",
"\n",
"# Undersampler case\n",
"# we don't want to undersample this class, so class 1 (minority) will have a ratio of 1 to itself\n",
"# for the majority class 0, we want to undersample it to have a minority/majority ratio of 0.5, which means we want majority to have 2x the samples as the minority\n",
"# sampler_ratio_dict = {0: 0.5, 1: 1}\n",
"# pipeline_parameters = {\"Oversampler\": {\"sampler_balanced_ratio\": sampler_ratio_dict}}\n",
"# automl_auto_ratio_dict = AutoMLSearch(X_train=X, y_train=y, problem_type='binary', pipeline_parameters=pipeline_parameters)\n"
"# search_parameters = {\"Oversampler\": {\"sampler_balanced_ratio\": sampler_ratio_dict}}\n",
"# automl_auto_ratio_dict = AutoMLSearch(X_train=X, y_train=y, problem_type='binary', search_parameters=search_parameters)\n"
]
},
{
Expand Down
86 changes: 75 additions & 11 deletions evalml/automl/automl_algorithm/automl_algorithm.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
"""Base class for the AutoML algorithms which power EvalML."""
import inspect
from abc import ABC, abstractmethod

from skopt.space import Categorical, Integer, Real

from evalml.exceptions import PipelineNotFoundError
from evalml.pipelines.utils import _make_stacked_ensemble_pipeline
from evalml.problem_types import is_multiclass
Expand All @@ -22,7 +25,7 @@ class AutoMLAlgorithm(ABC):

Args:
allowed_pipelines (list(class)): A list of PipelineBase subclasses indicating the pipelines allowed in the search. The default of None indicates all pipelines for this problem type are allowed.
custom_hyperparameters (dict): Custom hyperparameter ranges specified for pipelines to iterate over.
search_parameters (dict): Search parameter ranges specified for pipelines to iterate over.
tuner_class (class): A subclass of Tuner, to be used to find parameters for each pipeline. The default of None indicates the SKOptTuner will be used.
text_in_ensembling (boolean): If True and ensembling is True, then n_jobs will be set to 1 to avoid downstream sklearn stacking issues related to nltk. Defaults to None.
random_seed (int): Seed for the random number generator. Defaults to 0.
Expand All @@ -31,7 +34,7 @@ class AutoMLAlgorithm(ABC):
def __init__(
self,
allowed_pipelines=None,
custom_hyperparameters=None,
search_parameters=None,
tuner_class=None,
text_in_ensembling=False,
random_seed=0,
Expand All @@ -45,9 +48,27 @@ def __init__(
self.text_in_ensembling = text_in_ensembling
self.n_jobs = n_jobs
self._selected_cols = None
self.search_parameters = search_parameters or {}
self._hyperparameters = {}
self._pipeline_parameters = {}

# seperate out the parameter and hyperparameter values
for key, value in self.search_parameters.items():
hyperparam = {}
param = {}
for name, parameters in value.items():
if isinstance(parameters, (Integer, Categorical, Real)):
hyperparam[name] = parameters
else:
param[name] = parameters
if hyperparam:
self._hyperparameters[key] = hyperparam
if param:
self._pipeline_parameters[key] = param

for pipeline in self.allowed_pipelines:
pipeline_hyperparameters = pipeline.get_hyperparameter_ranges(
custom_hyperparameters
self._hyperparameters
)
self._tuners[pipeline.name] = self._tuner_class(
pipeline_hyperparameters, random_seed=self.random_seed
Expand All @@ -64,14 +85,57 @@ def next_batch(self):
list[PipelineBase]: A list of instances of PipelineBase subclasses, ready to be trained and evaluated.
"""

@abstractmethod
def _transform_parameters(self, pipeline, proposed_parameters):
"""Given a pipeline parameters dict, make sure pipeline_params, custom_hyperparameters, n_jobs are set properly.
"""Given a pipeline parameters dict, make sure pipeline_parameters, custom_hyperparameters, n_jobs are set properly.

Arguments:
pipeline (PipelineBase): The pipeline object to update the parameters.
proposed_parameters (dict): Parameters to use when updating the pipeline.
"""
parameters = {}
if "pipeline" in self._pipeline_parameters:
parameters["pipeline"] = self._pipeline_parameters["pipeline"]

for (
name,
component_instance,
) in pipeline.component_graph.component_instances.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 This code block is doing two things:

  1. Getting random values from the skopt spaces so that the parameters used in the first batch are in the space the tuner is tuning over
  2. Making sure the the _pipeline_parameters are correctly added to the parameters so that Drop Columns etc get the right parameters

I think this would be simpler if 1 was a tuner method, like get_starting_parameters ?

component_class = type(component_instance)
component_parameters = proposed_parameters.get(name, {})
init_params = inspect.signature(component_class.__init__).parameters
# For first batch, pass the pipeline params to the components that need them
if component_parameters != {}:
print(component_parameters)
if name in self.search_parameters and name not in component_parameters:
# only write the value if the name is not existing in the proposed parameters
for param_name, value in self.search_parameters[name].items():
if isinstance(value, (Integer, Real)):
# get a random value in the space
component_parameters[param_name] = value.rvs(
random_state=self.random_seed
)[0]
elif isinstance(value, Categorical):
# Categorical
component_parameters[param_name] = value.rvs(
random_state=self.random_seed
)
else:
# we set the pipeline parameter value directly
component_parameters[param_name] = value
# Inspects each component and adds the following parameters when needed
if "n_jobs" in init_params:
component_parameters["n_jobs"] = self.n_jobs
try:
bchen1116 marked this conversation as resolved.
Show resolved Hide resolved
if "number_features" in init_params:
component_parameters["number_features"] = self.number_features
except AttributeError:
continue
if "pipeline" in self.search_parameters:
for param_name, value in self.search_parameters["pipeline"].items():
if param_name in init_params:
component_parameters[param_name] = value
parameters[name] = component_parameters
return parameters

def add_result(self, score_to_minimize, pipeline, trained_pipeline_results):
"""Register results from evaluating a pipeline.
Expand Down Expand Up @@ -131,31 +195,31 @@ def _create_ensemble(self):

def _set_additional_pipeline_params(self):
drop_columns = (
self._pipeline_params["Drop Columns Transformer"]["columns"]
if "Drop Columns Transformer" in self._pipeline_params
self.search_parameters["Drop Columns Transformer"]["columns"]
if "Drop Columns Transformer" in self.search_parameters
else None
)
index_and_unknown_columns = list(
self.X.ww.select(["index", "unknown"], return_schema=True).columns
)
unknown_columns = list(self.X.ww.select("unknown", return_schema=True).columns)
if len(index_and_unknown_columns) > 0 and drop_columns is None:
self._pipeline_params["Drop Columns Transformer"] = {
self.search_parameters["Drop Columns Transformer"] = {
"columns": index_and_unknown_columns
}
if len(unknown_columns):
self.logger.info(
f"Removing columns {unknown_columns} because they are of 'Unknown' type"
)
kina_columns = self._pipeline_params.get("pipeline", {}).get(
kina_columns = self.search_parameters.get("pipeline", {}).get(
"known_in_advance", []
)
if kina_columns:
no_kin_columns = [c for c in self.X.columns if c not in kina_columns]
kin_name = "Known In Advance Pipeline - Select Columns Transformer"
no_kin_name = "Not Known In Advance Pipeline - Select Columns Transformer"
self._pipeline_params[kin_name] = {"columns": kina_columns}
self._pipeline_params[no_kin_name] = {"columns": no_kin_columns}
self.search_parameters[kin_name] = {"columns": kina_columns}
self.search_parameters[no_kin_name] = {"columns": no_kin_columns}

def _filter_estimators(
self,
Expand Down
Loading