-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Our own Partial Dependence Implementation #2834
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,251 @@ | ||
"""Partial dependence implementation. | ||
|
||
Borrows from sklearn "brute" calculation but with our | ||
own modification to better handle mixed data types in the grid | ||
as well as EvalML pipelines. | ||
""" | ||
import numpy as np | ||
import pandas as pd | ||
from scipy.stats.mstats import mquantiles | ||
|
||
from evalml.problem_types import is_regression | ||
|
||
|
||
def _range_for_dates(X_dt, percentiles, grid_resolution): | ||
"""Compute the range of values used in partial dependence for datetime features. | ||
|
||
Interpolate between the percentiles of the dates converted to unix | ||
timestamps. | ||
|
||
Args: | ||
X_dt (pd.DataFrame): Datetime features in original data. We currently | ||
only support X_dt having a single column. | ||
percentiles (tuple float): Percentiles to interpolate between. | ||
grid_resolution (int): Number of points in range. | ||
|
||
Returns: | ||
pd.Series: Range of dates between percentiles. | ||
""" | ||
timestamps = np.array( | ||
[X_dt - pd.Timestamp("1970-01-01")] // np.timedelta64(1, "s") | ||
).reshape(-1, 1) | ||
timestamps = pd.DataFrame(timestamps) | ||
grid, values = _grid_from_X( | ||
timestamps, | ||
percentiles=percentiles, | ||
grid_resolution=grid_resolution, | ||
custom_range={}, | ||
) | ||
grid_dates = pd.to_datetime(pd.Series(grid.squeeze()), unit="s") | ||
return grid_dates | ||
|
||
|
||
def _grid_from_X(X, percentiles, grid_resolution, custom_range): | ||
"""Create cartesian product of all the columns of input dataframe X. | ||
|
||
Args: | ||
X (pd.DataFrame): Input data | ||
percentiles (tuple float): Percentiles to use as endpoints of the grid | ||
for each feature. | ||
grid_resolution (int): How many points to interpolate between percentiles. | ||
custom_range (dict[str, np.ndarray]): Mapping from column name in X to | ||
range of values to use in partial dependence. If custom_range is specified, | ||
the percentile + interpolation procedure is skipped and the values in custom_range | ||
are used. | ||
|
||
Returns: | ||
pd.DataFrame: Cartesian product of input columns of X. | ||
""" | ||
values = [] | ||
for feature in X.columns: | ||
if feature in custom_range: | ||
# Use values in the custom range | ||
feature_range = custom_range[feature] | ||
if not isinstance(feature_range, (np.ndarray, pd.Series)): | ||
feature_range = np.array(feature_range) | ||
if feature_range.ndim != 1: | ||
raise ValueError( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm ok if this isn't covered. It's impossible to trigger this as a user because |
||
"Grid for feature {} is not a one-dimensional array. Got {}" | ||
" dimensions".format(feature, feature_range.ndim) | ||
) | ||
axis = feature_range | ||
else: | ||
uniques = np.unique(X.loc[:, feature]) | ||
if uniques.shape[0] < grid_resolution: | ||
# feature has low resolution use unique vals | ||
axis = uniques | ||
else: | ||
# create axis based on percentiles and grid resolution | ||
emp_percentiles = mquantiles( | ||
X.loc[:, feature], prob=percentiles, axis=0 | ||
) | ||
if np.allclose(emp_percentiles[0], emp_percentiles[1]): | ||
raise ValueError( | ||
"percentiles are too close to each other, " | ||
"unable to build the grid. Please choose percentiles " | ||
"that are further apart." | ||
) | ||
axis = np.linspace( | ||
emp_percentiles[0], | ||
emp_percentiles[1], | ||
num=grid_resolution, | ||
endpoint=True, | ||
) | ||
values.append(axis) | ||
|
||
return _cartesian(values), values | ||
|
||
|
||
def _cartesian(arrays): | ||
"""Create cartesian product of elements of arrays list. | ||
|
||
Stored in a dataframe to allow mixed types like dates/str/numeric. | ||
|
||
Args: | ||
arrays (list(np.ndarray)): Arrays. | ||
|
||
Returns: | ||
pd.DataFrame: Cartesian product of arrays. | ||
""" | ||
arrays = [np.asarray(x) for x in arrays] | ||
shape = (len(x) for x in arrays) | ||
|
||
ix = np.indices(shape) | ||
ix = ix.reshape(len(arrays), -1).T | ||
|
||
out = pd.DataFrame() | ||
|
||
for n, arr in enumerate(arrays): | ||
out[n] = arrays[n][ix[:, n]] | ||
|
||
return out | ||
Comment on lines
+110
to
+121
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems to be the same as https://github.com/scikit-learn/scikit-learn/blob/844b4be24d20fc42cc13b957374c718956a0db39/sklearn/utils/extmath.py#L655 except we return a dataframe, and since it's a public sklearn method, we could just import it--whatcha think? Also totally down to take their impl 😂 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a great idea. I know as we first tredged through partial dependence that we borrowed a lot. Perhaps a bit more from some private methods than I would like, but it was necessary. If we can refactor to use their public methods, that's great. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @angela97lin Great point. Originally I wanted to use their method but the problem is that numpy arrays cannot handle mixed-types very well. So if we want to have a grid of categoricals and datetimes, the conversion storing it in a numpy array won't really work. There may be a way around it I'm not seeing (maybe this) but IMO that's a nice to have as opposed to a requirement? |
||
|
||
|
||
def _partial_dependence_calculation(pipeline, grid, features, X): | ||
"""Do the partial dependence calculation once the grid is computed. | ||
|
||
Args: | ||
pipeline (PipelineBase): pipeline. | ||
grid (pd.DataFrame): Grid of features to compute the partial dependence on. | ||
features (list(str)): Column names of input data | ||
X (pd.DataFrame): Input data. | ||
|
||
Returns: | ||
Tuple (np.ndarray, np.ndarray): averaged and individual predictions for | ||
all points in the grid. | ||
""" | ||
predictions = [] | ||
averaged_predictions = [] | ||
|
||
if is_regression(pipeline.problem_type): | ||
prediction_method = pipeline.predict | ||
else: | ||
prediction_method = pipeline.predict_proba | ||
|
||
X_eval = X.copy() | ||
for _, new_values in grid.iterrows(): | ||
for i, variable in enumerate(features): | ||
X_eval.loc[:, variable] = new_values[i] | ||
|
||
pred = prediction_method(X_eval) | ||
|
||
predictions.append(pred) | ||
# average over samples | ||
averaged_predictions.append(np.mean(pred, axis=0)) | ||
|
||
n_samples = X.shape[0] | ||
|
||
# reshape to (n_instances, n_points) for binary/regression | ||
# reshape to (n_classes, n_instances, n_points) for multiclass | ||
predictions = np.array(predictions).T | ||
if is_regression(pipeline.problem_type) and predictions.ndim == 2: | ||
predictions = predictions.reshape(n_samples, -1) | ||
elif predictions.shape[0] == 2: | ||
predictions = predictions[1] | ||
predictions = predictions.reshape(n_samples, -1) | ||
|
||
# reshape averaged_predictions to (1, n_points) for binary/regression | ||
# reshape averaged_predictions to (n_classes, n_points) for multiclass. | ||
averaged_predictions = np.array(averaged_predictions).T | ||
if is_regression(pipeline.problem_type) and averaged_predictions.ndim == 1: | ||
averaged_predictions = averaged_predictions.reshape(1, -1) | ||
elif averaged_predictions.shape[0] == 2: | ||
averaged_predictions = averaged_predictions[1] | ||
averaged_predictions = averaged_predictions.reshape(1, -1) | ||
|
||
return averaged_predictions, predictions | ||
|
||
|
||
def _partial_dependence( | ||
pipeline, | ||
X, | ||
features, | ||
percentiles=(0.05, 0.95), | ||
grid_resolution=100, | ||
kind="average", | ||
custom_range=None, | ||
): | ||
"""Compute the partial dependence for features of X. | ||
|
||
Args: | ||
pipeline (PipelineBase): pipeline. | ||
X (pd.DataFrame): Holdout data | ||
features (list(str)): Column names of X to compute the partial dependence for. | ||
percentiles (tuple float): Percentiles to use in range calculation for a given | ||
feature. | ||
grid_resolution: Number of points in range of values used for each feature in | ||
partial dependence calculation. | ||
kind (str): The type of predictions to return. | ||
custom_range (dict[str, np.ndarray]): Mapping from column name in X to | ||
range of values to use in partial dependence. If custom_range is specified, | ||
the percentile + interpolation procedure is skipped and the values in custom_range | ||
are used. | ||
|
||
Returns: | ||
dict with 'average', 'individual', 'values' keys. 'values' is a list of | ||
the values used in the partial dependence for each feature. | ||
'average' and 'individual' are averaged and individual predictions for | ||
each point in the grid. | ||
""" | ||
if grid_resolution <= 1: | ||
raise ValueError("'grid_resolution' must be strictly greater than 1.") | ||
|
||
custom_range = custom_range or {} | ||
custom_range = { | ||
feature: custom_range.get(feature) | ||
for feature in features | ||
if feature in custom_range | ||
} | ||
grid, values = _grid_from_X( | ||
X.loc[:, features], | ||
percentiles, | ||
grid_resolution, | ||
custom_range, | ||
) | ||
averaged_predictions, predictions = _partial_dependence_calculation( | ||
pipeline, | ||
grid, | ||
features, | ||
X, | ||
) | ||
|
||
# reshape predictions to | ||
# (n_outputs, n_instances, n_values_feature_0, n_values_feature_1, ...) | ||
predictions = predictions.reshape(-1, X.shape[0], *[val.shape[0] for val in values]) | ||
|
||
# reshape averaged_predictions to | ||
# (n_outputs, n_values_feature_0, n_values_feature_1, ...) | ||
averaged_predictions = averaged_predictions.reshape( | ||
-1, *[val.shape[0] for val in values] | ||
) | ||
|
||
if kind == "average": | ||
return {"average": averaged_predictions, "values": values} | ||
elif kind == "individual": | ||
return {"individual": predictions, "values": values} | ||
else: # kind='both' | ||
return { | ||
"average": averaged_predictions, | ||
"individual": predictions, | ||
"values": values, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why, but I fixated on this. I think probably because I come from a natural science background...but is it worth us leaving the reference date and the quantum of time as variable? I don't think any of our common or current use cases extend to people doing time series modeling on like a chemical reaction timescale (~milli/microseconds). But I can definitely see pharma customers being interested in it.
Let me know what you think. I don't think we necessarily have to do the work here, but it might be nice to at least talk about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic point @chukarsten ! I think what this is getting at is making our
custom_range
internal parameter public. I think there can be value in letting users specify how the grid for their features is computed!I will file a separate issue for tracking that.