Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Our own Partial Dependence Implementation #2834

Merged
merged 5 commits into from
Sep 29, 2021

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Sep 23, 2021

Pull Request Description

Fixes #2502
Fixes #2475

Same run-time as main for model understanding tests:
main: 9m 59s
this branch: 9m 14s

Plots match between this branch and main

This branch

image

Main

image

This branch

image

Main

image


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@freddyaboulton freddyaboulton force-pushed the 2502-our-own-partial-dependence branch from 457a2e3 to 91b7a18 Compare September 23, 2021 15:15
@codecov
Copy link

codecov bot commented Sep 23, 2021

Codecov Report

Merging #2834 (635fcf4) into main (06d7df7) will decrease coverage by 0.1%.
The diff coverage is 99.1%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2834     +/-   ##
=======================================
- Coverage   99.8%   99.8%   -0.0%     
=======================================
  Files        302     303      +1     
  Lines      28148   28226     +78     
=======================================
+ Hits       28070   28145     +75     
- Misses        78      81      +3     
Impacted Files Coverage Δ
evalml/model_understanding/_partial_dependence.py 98.8% <98.8%> (ø)
evalml/model_understanding/graphs.py 100.0% <100.0%> (ø)
...del_understanding_tests/test_partial_dependence.py 99.3% <100.0%> (+0.1%) ⬆️
evalml/pipelines/components/utils.py 98.4% <0.0%> (-1.6%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 06d7df7...635fcf4. Read the comment docs.

if not isinstance(feature_range, (np.ndarray, pd.Series)):
feature_range = np.array(feature_range)
if feature_range.ndim != 1:
raise ValueError(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok if this isn't covered. It's impossible to trigger this as a user because custom_range is not a public parameter. But I'd like to keep this check in case we refactor this in the future. Helped catch a couple of bugs during development.

@@ -653,6 +652,11 @@ def partial_dependence(
is_datetime = [_is_feature_of_type(features, X, ww.logical_types.Datetime)]

if isinstance(features, (list, tuple)):
if any(is_datetime) and len(features) > 1:
Copy link
Contributor Author

@freddyaboulton freddyaboulton Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There used to be two if isinstance(features, (list, tuple)) checks. Consolidating into one now.

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is epic 👏!

I left some nitpicky comments but nothing blocking. Great work @freddyaboulton!

Also, the speedups are a cherry on top :)

pl, X, features=("amount", "provider"), grid_resolution=5
)
assert not dep2way.isna().any().any()
# Minus 1 in the columns because there is `class_label`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, not minus?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!!

)
assert not dep2way.isna().any().any()
# Minus 1 in the columns because there is `class_label`
assert dep2way.shape == (5, X["provider"].dropna().nunique() + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omega nitpick, but I think it'd be a good idea to set grid_resolution as a variable and use it above / here, whereassert dep2way.shape == (grid_resolution_variable,...)? Just so its more clear where this 5 value is coming from :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree!

Comment on lines +110 to +121
arrays = [np.asarray(x) for x in arrays]
shape = (len(x) for x in arrays)

ix = np.indices(shape)
ix = ix.reshape(len(arrays), -1).T

out = pd.DataFrame()

for n, arr in enumerate(arrays):
out[n] = arrays[n][ix[:, n]]

return out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the same as https://github.com/scikit-learn/scikit-learn/blob/844b4be24d20fc42cc13b957374c718956a0db39/sklearn/utils/extmath.py#L655 except we return a dataframe, and since it's a public sklearn method, we could just import it--whatcha think? Also totally down to take their impl 😂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea. I know as we first tredged through partial dependence that we borrowed a lot. Perhaps a bit more from some private methods than I would like, but it was necessary. If we can refactor to use their public methods, that's great.

Copy link
Contributor Author

@freddyaboulton freddyaboulton Sep 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin Great point. Originally I wanted to use their method but the problem is that numpy arrays cannot handle mixed-types very well. So if we want to have a grid of categoricals and datetimes, the conversion storing it in a numpy array won't really work.

image

There may be a way around it I'm not seeing (maybe this) but IMO that's a nice to have as opposed to a requirement?

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow @freddyaboulton , this is amazing. I am really impressed. I feel like you cleaned up the code substantially, improve performance and enhanced functionality. This is a great PR. I had a question about the handling of the times, but that isn't blocking.

pd.Series: Range of dates between percentiles.
"""
timestamps = np.array(
[X_dt - pd.Timestamp("1970-01-01")] // np.timedelta64(1, "s")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why, but I fixated on this. I think probably because I come from a natural science background...but is it worth us leaving the reference date and the quantum of time as variable? I don't think any of our common or current use cases extend to people doing time series modeling on like a chemical reaction timescale (~milli/microseconds). But I can definitely see pharma customers being interested in it.

Let me know what you think. I don't think we necessarily have to do the work here, but it might be nice to at least talk about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic point @chukarsten ! I think what this is getting at is making our custom_range internal parameter public. I think there can be value in letting users specify how the grid for their features is computed!

I will file a separate issue for tracking that.

prediction_method = pipeline.predict_proba

for _, new_values in grid.iterrows():
X_eval = X.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to copy this each time? Does it make more sense to just rebuild the new dataframe with a concat or something at the end? If it's just as performant, then whatever, this makes sense and is clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. I think we can move it out the loop. Will test it out!

@freddyaboulton
Copy link
Contributor Author

@chukarsten @angela97lin Thank you so much for the reviews! I didn't think this would get into the coming release.

Kicked off perf tests out of paranoia to make sure none of the datasets error out on partial dependence. Will merge if those look good.

@freddyaboulton freddyaboulton force-pushed the 2502-our-own-partial-dependence branch from fa4c109 to 0756eb7 Compare September 28, 2021 22:52
@freddyaboulton
Copy link
Contributor Author

Perf tests here and they look good to me!

@freddyaboulton freddyaboulton merged commit e257b1b into main Sep 29, 2021
@chukarsten chukarsten mentioned this pull request Oct 1, 2021
@freddyaboulton freddyaboulton deleted the 2502-our-own-partial-dependence branch May 13, 2022 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement our own partial dependence method Partial dependence errors with column with string and NaN values
3 participants