Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuml.experimental SHAP improvements #3433

Merged
merged 23 commits into from
Feb 10, 2021

Conversation

dantegd
Copy link
Member

@dantegd dantegd commented Jan 29, 2021

Closes #1739

Addresses most items of #3224

@dantegd dantegd added 2 - In Progress Currenty a work in progress CUDA / C++ CUDA issue Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function breaking Breaking change labels Jan 29, 2021
@dantegd dantegd requested review from a team as code owners January 29, 2021 17:03
Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Only small comments

python/cuml/experimental/explainer/kernel_shap.pyx Outdated Show resolved Hide resolved
python/cuml/experimental/explainer/kernel_shap.pyx Outdated Show resolved Hide resolved
python/cuml/experimental/explainer/permutation_shap.pyx Outdated Show resolved Hide resolved
python/cuml/experimental/explainer/permutation_shap.pyx Outdated Show resolved Hide resolved
@codecov-io
Copy link

codecov-io commented Feb 1, 2021

Codecov Report

Merging #3433 (3d7f6ec) into branch-0.18 (550121b) will increase coverage by 0.06%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18    #3433      +/-   ##
===============================================
+ Coverage        71.48%   71.55%   +0.06%     
===============================================
  Files              207      212       +5     
  Lines            16748    17082     +334     
===============================================
+ Hits             11973    12223     +250     
- Misses            4775     4859      +84     
Impacted Files Coverage Δ
python/cuml/neighbors/ann.pxd 68.23% <0.00%> (-18.83%) ⬇️
python/cuml/common/timing_utils.py 42.85% <0.00%> (-7.15%) ⬇️
.../dask/feature_extraction/text/tfidf_transformer.py 37.50% <0.00%> (-6.05%) ⬇️
python/cuml/dask/preprocessing/label.py 34.00% <0.00%> (-4.89%) ⬇️
python/cuml/dask/neighbors/nearest_neighbors.py 25.97% <0.00%> (-4.52%) ⬇️
python/cuml/dask/naive_bayes/naive_bayes.py 37.68% <0.00%> (-4.43%) ⬇️
python/cuml/dask/cluster/kmeans.py 50.00% <0.00%> (-4.00%) ⬇️
python/cuml/dask/decomposition/base.py 36.58% <0.00%> (-2.95%) ⬇️
...ython/cuml/feature_extraction/_tfidf_vectorizer.py 85.36% <0.00%> (-2.87%) ⬇️
...ython/cuml/dask/neighbors/kneighbors_classifier.py 19.80% <0.00%> (-2.53%) ⬇️
... and 71 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c9c8619...3d7f6ec. Read the comment docs.

@dantegd
Copy link
Member Author

dantegd commented Feb 2, 2021

rerun tests

3 similar comments
@dantegd
Copy link
Member Author

dantegd commented Feb 3, 2021

rerun tests

@dantegd
Copy link
Member Author

dantegd commented Feb 3, 2021

rerun tests

@dantegd
Copy link
Member Author

dantegd commented Feb 3, 2021

rerun tests

Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I had only small suggestions, mostly stuff that can be deferred to future PRs since this is still experimental. My only concern is that the number of variations supported in datatypes (sklearn model with pandas background data or cuml with f-ordered numpy or ...) makes it hard to test all paths of the base shap initialization. Let's look at codecov there for additional test ideas and be open to simplifying the options if necessary.


void shap_main_effect_dataset "ML::Explainer::shap_main_effect_dataset"(
const handle_t& handle,
float* dataset,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(in the underlying API) should dataset be const?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataset is where the output is generated, maybe I should change the name to avoid the confusion?

----------
model : function
Function that takes a matrix of samples (n_samples, n_features) and
computes the output for those samples with shape (n_samples). Function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, bummer so there is no way to use the tags api because we need to take the function rather than the model

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already use the tags API by getting the owning object of the function (if it exists) and getting tags from that:

def get_tag_from_model_func(func, tag, default=None):

python/cuml/experimental/explainer/base.pyx Outdated Show resolved Hide resolved
default=np.float32)
else:
if dtype in [np.float32, np.float64]:
self.dtype = np.dtype(dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity why do you have to convert to np.dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was doing the wrong order of things, I use the dtype function of numpy so that we accept string description of the dtypes without additional work

python/cuml/experimental/explainer/base.pyx Outdated Show resolved Hide resolved
python/cuml/experimental/explainer/base.pyx Outdated Show resolved Hide resolved
Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good - just some doc and test suggestions. I think there is still a california_housing test coming? We could split that to the next PR too.

@@ -213,13 +198,17 @@ class SHAPBase():
)
)

# public attribute saved as NumPy for compatibility with the legacy
# SHAP potting functions
self.expected_value = cp.asnumpy(self._expected_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, but it's a deviation from our standard approach... can you add a docstring to explain this? Can be a follow up PR.

Also would be really good to have a test of compatibility with SHAP plotting so we never break this (again, follow on PR ok)

@dantegd
Copy link
Member Author

dantegd commented Feb 8, 2021

rerun tests

Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-approving with some small suggestions/questions. Looks great!

// gemv, which could cause a very sporadic race condition in Pascal and
// Turing GPUs that caused it to give the wrong results. Details:
// https://github.com/rapidsai/cuml/issues/1739
rmm::device_uvector<math_t> tmp_vector(n_cols, stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about something like tmp_gemv_result or otherwise indicating its use?

def output_list_shap_values(X, dimensions, output_type):
if output_type == 'cupy':
if dimensions == 1:
return X[0]
else:
return X
res = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super picky but this seems like either a list comprehension or just list(X) would be nicer

@@ -399,6 +416,11 @@ def test_l1_regularization(exact_tests_dataset, l1_type):
0.00088981]
]

housing_regression_result = np.array(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this obtained by running shap? Would be good to note in a comment what you did to get it and what version you used.

@JohnZed
Copy link
Contributor

JohnZed commented Feb 9, 2021

rerun tests

@JohnZed
Copy link
Contributor

JohnZed commented Feb 9, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 8082f3b into rapidsai:branch-0.18 Feb 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currenty a work in progress breaking Breaking change CUDA / C++ CUDA issue Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function libcuml
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Sporadic OLS pytest fail in test_linear_regression_model_default
4 participants