Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding in shapley values #222

Closed
wants to merge 17 commits into from
Closed

adding in shapley values #222

wants to merge 17 commits into from

Conversation

jindongmin
Copy link

We added a function to compute shapely values for classifiers as stated in #219 and #221

Copy link
Contributor

@adamovanja adamovanja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi all, this fucntionality will be a great addition to q2-sample-classifier ⭐ .
Some initial thoughts from my side:

  • could you fix the linting errors?
  • could you add unit tests for the new functionality (e.g. testing that shapely values can be calculated for all classifiers supported by q2-sample-classifier or if they can't a proper error message is raised)?
  • could you add the fitting citations to SHAP to citations.bib?

function=shapely_values,
inputs={**inputs, 'sample_estimator': SampleEstimator[Classifier]},
parameters={},
outputs=[('shap', SampleData[Probabilities])],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason for choosing this QIIME2 type over FeatureData[Importance] here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we first wrote this, it wasn't clear if the dimensions would work out with FeatureData[Importance] -- Shapely values are of dimensions D x N for D features and N samples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understood 👍🏼 I think in this case it could make sense to add a new type specifically for shapely values (as they are distinct from predicted sample probabilities and feature importances).

q2_sample_classifier/classify.py Outdated Show resolved Hide resolved
q2_sample_classifier/plugin_setup.py Outdated Show resolved Hide resolved
@lizgehret
Copy link
Member

Hey @jindongmin and @mortonjt - thanks for submitting this new method! Should we get this added to our upcoming release (pending updates for passing CI on lint/testing)? Or should we add to a future release if this will still be in development over the next month?

@jindongmin
Copy link
Author

jindongmin commented Nov 15, 2022

Hey @jindongmin and @mortonjt - thanks for submitting this new method! Should we get this added to our upcoming release (pending updates for passing CI on lint/testing)? Or should we add to a future release if this will still be in development over the next month?

Hello @lizgehret, there is an issue when I tried to add 'pip install shap' in https://github.com/qiime2/q2-sample-classifier/blob/master/.github/workflows/ci.yml#L26, the error message is as following. Is this because I don't have access to modify this? It would be great if you could help with this. Thanks!

 ! [remote rejected] master -> master (refusing to allow a Personal Access Token to create or update workflow `.github/workflows/ci.yml` without `workflow` scope)
error: failed to push some refs to 'https://github.com/jindongmin/q2-sample-classifier.git' 

@lizgehret
Copy link
Member

Hello @lizgehret, there is an issue when I tried to add 'pip install shap' in https://github.com/qiime2/q2-sample-classifier/blob/master/.github/workflows/ci.yml#L26, the error message is as following. Is this because I don't have access to modify this? It would be great if you could help with this. Thanks!

 ! [remote rejected] master -> master (refusing to allow a Personal Access Token to create or update workflow `.github/workflows/ci.yml` without `workflow` scope)
error: failed to push some refs to 'https://github.com/jindongmin/q2-sample-classifier.git' 

Hi @jindongmin - sorry for the delayed response! You'll actually want to add any additional dependencies in this file:
https://github.com/qiime2/q2-sample-classifier/blob/master/ci/recipe/meta.yaml

Try adding shap in there under the appropriate section (either run or test - depending on if you need this dependency to run this plugin with your updates, or if you just need it to run your tests). Let me know if you have any questions!

@lizgehret
Copy link
Member

Hey @jindongmin just wanted to check in on this - were you successful in adding the shap dependency into the meta.yaml file? Is there anything you need from us? This PR won't make it into the 2023.2 release, but we can add it to the 2023.4 cycle if that feels feasible to you.

@jindongmin
Copy link
Author

Hey @jindongmin just wanted to check in on this - were you successful in adding the shap dependency into the meta.yaml file? Is there anything you need from us? This PR won't make it into the 2023.2 release, but we can add it to the 2023.4 cycle if that feels feasible to you.

Hi @lizgehret, yes the shap dependency was added. Thank you for your help!

paper/references.bib Outdated Show resolved Hide resolved
@jindongmin jindongmin changed the title adding in shapely values adding in shapley values Feb 24, 2023
@ebolyen
Copy link
Member

ebolyen commented Apr 21, 2023

Do we have a good resolution for @adamovanja's comment on the semantic type?

@lizgehret
Copy link
Member

Hey all just FYI I'm going to bump this to 2023.8 since our PR cutoff date for 2023.5 is May 5th.

@lizgehret
Copy link
Member

Bumping to 2023.11 since we're still waiting on updates regarding @adamovanja's comment regarding semantic type (see link above).

@mortonjt
Copy link

mortonjt commented Sep 27, 2023 via email

@gregcaporaso
Copy link
Member

Ok, thanks for letting us know @mortonjt.

@adamovanja, are folks on your end motivated enough to keep working on this, or should we close out this PR and issue?

@adamovanja
Copy link
Contributor

Hi @gregcaporaso, No capacity here to work on this right now, so feel free to close this PR.

@mortonjt
Copy link

Hi all, I know that this has closed a while ago. However, I'm finding myself revisiting this PR, since the q2-sample-classifier repo is quite good (in terms of off-the-shelf hyper-parameter tuning, its my go-to).

I'm currently finding myself maintaining the Shapley value code that @jindongmin has pushed in. It is quite useful, not completely trivial to reload and implement the Shap modules given all of the preprocessing steps (i.e. the DictVectorizer). I think others would benefit from this as well, @nbokulich and I have discussed this many years ago as being a nice addition.

While the utility is clear, I think it is becoming clear to us that adding new types does introduce additional development burden that none of us are well-equipped to allocate time to. The good news here is that this isn't a necessity, the Probabilities type is semantically well-suited, since these feature importances are technically delta probabilities.

I understand your position accepting new code to maintain. But I am also prepared to continue to push commits on my branch to make sure that it is update to date to the latest q2-sample-classifier. Let me know if you are willing to reconsider not requiring the new type to be added, and I push a new PR to add this functionality in.

@nbokulich
Copy link
Member

Hi @mortonjt I agree that this would still be a useful addition to q2-sample-classifier.

Could you describe what the SHAP probabilities are in a little more detail? SampleData[Probabilities] describes the probability that a given sample belongs to a given class. So I agree with @adamovanja 's earlier assessment that this is not a fit. But I am not sure what the SHAP probabilities are then. Would these be instead Feature probabiltiies, i.e., the probability of observing feature x in class y?

Based on my current understanding, I think a new type is warranted. However, the format is effectively the same (a 2d numeric dataframe without any specific header requirements), and defining new formats is usually what takes more time to code IMO. Creating a new type should take less time if an existing format can be used.

@mortonjt
Copy link

mortonjt commented Aug 20, 2024

I was going to go back to my original comment and edit :)

Following up on the new type discussion, has there been updates with lowering the barrier for new types? What would it take to create a new tabular 2D type without the need to create new formats, new transformers, ...

Regarding the SHAP probabilities, it is the delta between the classification probability with (and without) the feature, but averaged over all possible feature subsets that includes the feature. See equation 4 here : https://arxiv.org/pdf/1705.07874

In contrast, the feature important only looks at the total set of features, and computes the classification probability delta on the total set, and the total set with a given feature excluded. Shapley is a generalization of this that makes it more robust and have more accurate attributions.

@gregcaporaso
Copy link
Member

@mortonjt, I'm not sure if it's what you're looking for in terms of lowering the barrier, but this is much better documented then it ever has been in the past:
https://develop.qiime2.org/en/latest/plugins/tutorials/add-artifact-class.html

@mortonjt
Copy link

Thanks @gregcaporaso for the link. This documentation does help -- but my original question is about having "cookie cutter" that can define new types with standard formats (i.e. pandas data frame) with a single line of code without introducing new transformer methods.

I did raise this issue a few years ago here to further lower the barrier to creating new types with standard formats : qiime2/qiime2#583 (comment)
Much of the barrier is less so about the amount of code, but rather the unique challenges of developing against a growing q2 environment.

I still like the idea of the typing system in q2, but based on my previous experiences developing q2 packages, it does take some time to develop a new package, even for an experienced developer. The fastest that I've developed a new package was q2-aldex2, it took me about 1 week. These timescales are hard to justify given competing priorities. But I can revisit allocating additional resources to push this PR through if some additional work is done to lower the barriers for creating these new types.

@gregcaporaso
Copy link
Member

@mortonjt, what you describe is something we're working toward in the context of the q2-stats project. While some details will likely change, a new API will be added that enables creation of new tabular data types that could be used as follows:

q2_types.register_tabular_type(  
    plugin,
    semantic_type_name='MyStatsTable', 
    columns={'subject': str, 'p-value': float, 'my-test-statistic': float, ...}  
)

The three required inputs:

  • plugin: your plugin object, as typically created in plugin_setup.py (eg)
  • semantic_type_name: the name of the new type to be created and registered
  • columns: dict mapping the column headers that should be present in tables of this type to their data types (allowed data types will likely be those supported by pandas)

Optional inputs will include things like a description of the indexing scheme, labels to use when displaying column headers to users, and so on.

We expect that there will be a draft version of this available through q2-stats in the 2024.10 release, followed by a migration to q2-types for the 2025.4 release (ideally with addition to development QIIME 2 environments soon after 2024.10).

@nbokulich
Copy link
Member

@mortonjt in the meantime (until the new q2-stats API is ready), you could just use the tabular data format that is already defined in q2-sample-classifier and used by the SampleData[Probabilities] type, unless if the SHAP format has different requirements (e.g., specific column names?). This should save a lot of time, as you would just need to register a new type (FeatureData[SHAPProbabilities]???) that links to that format, and add the necessary tests for that type only.

@mortonjt
Copy link

Thanks @gregcaporaso for the update. This is indeed very exciting progress and it may be able to lower the development burdens we encountered in the past.

@nbokulich I think this make sense as a viable option. My schedule is unpredictable over the next 2-3 weeks, but let me see what I can do in the meantime. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants