Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Our own Partial Dependence Implementation #2834

Merged
merged 5 commits into from
Sep 29, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Release Notes
* Added support for linting jupyter notebooks and clearing the executed cells and empty cells :pr:`2829` :pr:`2837`
* Added "DROP_ROWS" action to output of ``OutliersDataCheck.validate()`` :pr:`2820`
* Added the ability of ``AutoMLSearch`` to accept a ``SequentialEngine`` instance as engine input :pr:`2838`
* Added our own partial dependence implementation :pr:`2834`
* Fixes
* Fixed bug where ``calculate_permutation_importance`` was not calculating the right value for pipelines with target transformers :pr:`2782`
* Fixed bug where transformed target values were not used in ``fit`` for time series pipelines :pr:`2780`
Expand Down
251 changes: 251 additions & 0 deletions evalml/model_understanding/_partial_dependence.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
"""Partial dependence implementation.

Borrows from sklearn "brute" calculation but with our
own modification to better handle mixed data types in the grid
as well as EvalML pipelines.
"""
import numpy as np
import pandas as pd
from scipy.stats.mstats import mquantiles

from evalml.problem_types import is_regression


def _range_for_dates(X_dt, percentiles, grid_resolution):
"""Compute the range of values used in partial dependence for datetime features.

Interpolate between the percentiles of the dates converted to unix
timestamps.

Args:
X_dt (pd.DataFrame): Datetime features in original data. We currently
only support X_dt having a single column.
percentiles (tuple float): Percentiles to interpolate between.
grid_resolution (int): Number of points in range.

Returns:
pd.Series: Range of dates between percentiles.
"""
timestamps = np.array(
[X_dt - pd.Timestamp("1970-01-01")] // np.timedelta64(1, "s")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why, but I fixated on this. I think probably because I come from a natural science background...but is it worth us leaving the reference date and the quantum of time as variable? I don't think any of our common or current use cases extend to people doing time series modeling on like a chemical reaction timescale (~milli/microseconds). But I can definitely see pharma customers being interested in it.

Let me know what you think. I don't think we necessarily have to do the work here, but it might be nice to at least talk about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic point @chukarsten ! I think what this is getting at is making our custom_range internal parameter public. I think there can be value in letting users specify how the grid for their features is computed!

I will file a separate issue for tracking that.

).reshape(-1, 1)
timestamps = pd.DataFrame(timestamps)
grid, values = _grid_from_X(
timestamps,
percentiles=percentiles,
grid_resolution=grid_resolution,
custom_range={},
)
grid_dates = pd.to_datetime(pd.Series(grid.squeeze()), unit="s")
return grid_dates


def _grid_from_X(X, percentiles, grid_resolution, custom_range):
"""Create cartesian product of all the columns of input dataframe X.

Args:
X (pd.DataFrame): Input data
percentiles (tuple float): Percentiles to use as endpoints of the grid
for each feature.
grid_resolution (int): How many points to interpolate between percentiles.
custom_range (dict[str, np.ndarray]): Mapping from column name in X to
range of values to use in partial dependence. If custom_range is specified,
the percentile + interpolation procedure is skipped and the values in custom_range
are used.

Returns:
pd.DataFrame: Cartesian product of input columns of X.
"""
values = []
for feature in X.columns:
if feature in custom_range:
# Use values in the custom range
feature_range = custom_range[feature]
if not isinstance(feature_range, (np.ndarray, pd.Series)):
feature_range = np.array(feature_range)
if feature_range.ndim != 1:
raise ValueError(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok if this isn't covered. It's impossible to trigger this as a user because custom_range is not a public parameter. But I'd like to keep this check in case we refactor this in the future. Helped catch a couple of bugs during development.

"Grid for feature {} is not a one-dimensional array. Got {}"
" dimensions".format(feature, feature_range.ndim)
)
axis = feature_range
else:
uniques = np.unique(X.loc[:, feature])
if uniques.shape[0] < grid_resolution:
# feature has low resolution use unique vals
axis = uniques
else:
# create axis based on percentiles and grid resolution
emp_percentiles = mquantiles(
X.loc[:, feature], prob=percentiles, axis=0
)
if np.allclose(emp_percentiles[0], emp_percentiles[1]):
raise ValueError(
"percentiles are too close to each other, "
"unable to build the grid. Please choose percentiles "
"that are further apart."
)
axis = np.linspace(
emp_percentiles[0],
emp_percentiles[1],
num=grid_resolution,
endpoint=True,
)
values.append(axis)

return _cartesian(values), values


def _cartesian(arrays):
"""Create cartesian product of elements of arrays list.

Stored in a dataframe to allow mixed types like dates/str/numeric.

Args:
arrays (list(np.ndarray)): Arrays.

Returns:
pd.DataFrame: Cartesian product of arrays.
"""
arrays = [np.asarray(x) for x in arrays]
shape = (len(x) for x in arrays)

ix = np.indices(shape)
ix = ix.reshape(len(arrays), -1).T

out = pd.DataFrame()

for n, arr in enumerate(arrays):
out[n] = arrays[n][ix[:, n]]

return out
Comment on lines +110 to +121
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the same as https://github.com/scikit-learn/scikit-learn/blob/844b4be24d20fc42cc13b957374c718956a0db39/sklearn/utils/extmath.py#L655 except we return a dataframe, and since it's a public sklearn method, we could just import it--whatcha think? Also totally down to take their impl 😂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea. I know as we first tredged through partial dependence that we borrowed a lot. Perhaps a bit more from some private methods than I would like, but it was necessary. If we can refactor to use their public methods, that's great.

Copy link
Contributor Author

@freddyaboulton freddyaboulton Sep 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin Great point. Originally I wanted to use their method but the problem is that numpy arrays cannot handle mixed-types very well. So if we want to have a grid of categoricals and datetimes, the conversion storing it in a numpy array won't really work.

image

There may be a way around it I'm not seeing (maybe this) but IMO that's a nice to have as opposed to a requirement?



def _partial_dependence_calculation(pipeline, grid, features, X):
"""Do the partial dependence calculation once the grid is computed.

Args:
pipeline (PipelineBase): pipeline.
grid (pd.DataFrame): Grid of features to compute the partial dependence on.
features (list(str)): Column names of input data
X (pd.DataFrame): Input data.

Returns:
Tuple (np.ndarray, np.ndarray): averaged and individual predictions for
all points in the grid.
"""
predictions = []
averaged_predictions = []

if is_regression(pipeline.problem_type):
prediction_method = pipeline.predict
else:
prediction_method = pipeline.predict_proba

X_eval = X.copy()
for _, new_values in grid.iterrows():
for i, variable in enumerate(features):
X_eval.loc[:, variable] = new_values[i]

pred = prediction_method(X_eval)

predictions.append(pred)
# average over samples
averaged_predictions.append(np.mean(pred, axis=0))

n_samples = X.shape[0]

# reshape to (n_instances, n_points) for binary/regression
# reshape to (n_classes, n_instances, n_points) for multiclass
predictions = np.array(predictions).T
if is_regression(pipeline.problem_type) and predictions.ndim == 2:
predictions = predictions.reshape(n_samples, -1)
elif predictions.shape[0] == 2:
predictions = predictions[1]
predictions = predictions.reshape(n_samples, -1)

# reshape averaged_predictions to (1, n_points) for binary/regression
# reshape averaged_predictions to (n_classes, n_points) for multiclass.
averaged_predictions = np.array(averaged_predictions).T
if is_regression(pipeline.problem_type) and averaged_predictions.ndim == 1:
averaged_predictions = averaged_predictions.reshape(1, -1)
elif averaged_predictions.shape[0] == 2:
averaged_predictions = averaged_predictions[1]
averaged_predictions = averaged_predictions.reshape(1, -1)

return averaged_predictions, predictions


def _partial_dependence(
pipeline,
X,
features,
percentiles=(0.05, 0.95),
grid_resolution=100,
kind="average",
custom_range=None,
):
"""Compute the partial dependence for features of X.

Args:
pipeline (PipelineBase): pipeline.
X (pd.DataFrame): Holdout data
features (list(str)): Column names of X to compute the partial dependence for.
percentiles (tuple float): Percentiles to use in range calculation for a given
feature.
grid_resolution: Number of points in range of values used for each feature in
partial dependence calculation.
kind (str): The type of predictions to return.
custom_range (dict[str, np.ndarray]): Mapping from column name in X to
range of values to use in partial dependence. If custom_range is specified,
the percentile + interpolation procedure is skipped and the values in custom_range
are used.

Returns:
dict with 'average', 'individual', 'values' keys. 'values' is a list of
the values used in the partial dependence for each feature.
'average' and 'individual' are averaged and individual predictions for
each point in the grid.
"""
if grid_resolution <= 1:
raise ValueError("'grid_resolution' must be strictly greater than 1.")

custom_range = custom_range or {}
custom_range = {
feature: custom_range.get(feature)
for feature in features
if feature in custom_range
}
grid, values = _grid_from_X(
X.loc[:, features],
percentiles,
grid_resolution,
custom_range,
)
averaged_predictions, predictions = _partial_dependence_calculation(
pipeline,
grid,
features,
X,
)

# reshape predictions to
# (n_outputs, n_instances, n_values_feature_0, n_values_feature_1, ...)
predictions = predictions.reshape(-1, X.shape[0], *[val.shape[0] for val in values])

# reshape averaged_predictions to
# (n_outputs, n_values_feature_0, n_values_feature_1, ...)
averaged_predictions = averaged_predictions.reshape(
-1, *[val.shape[0] for val in values]
)

if kind == "average":
return {"average": averaged_predictions, "values": values}
elif kind == "individual":
return {"individual": predictions, "values": values}
else: # kind='both'
return {
"average": averaged_predictions,
"individual": predictions,
"values": values,
}
Loading