-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding in shapley values #222
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi all, this fucntionality will be a great addition to q2-sample-classifier
⭐ .
Some initial thoughts from my side:
- could you fix the linting errors?
- could you add unit tests for the new functionality (e.g. testing that shapely values can be calculated for all classifiers supported by
q2-sample-classifier
or if they can't a proper error message is raised)? - could you add the fitting citations to SHAP to
citations.bib
?
function=shapely_values, | ||
inputs={**inputs, 'sample_estimator': SampleEstimator[Classifier]}, | ||
parameters={}, | ||
outputs=[('shap', SampleData[Probabilities])], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason for choosing this QIIME2 type over FeatureData[Importance]
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we first wrote this, it wasn't clear if the dimensions would work out with FeatureData[Importance]
-- Shapely values are of dimensions D x N
for D
features and N
samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
understood 👍🏼 I think in this case it could make sense to add a new type specifically for shapely values (as they are distinct from predicted sample probabilities and feature importances).
Hey @jindongmin and @mortonjt - thanks for submitting this new method! Should we get this added to our upcoming release (pending updates for passing CI on lint/testing)? Or should we add to a future release if this will still be in development over the next month? |
Hello @lizgehret, there is an issue when I tried to add 'pip install shap' in https://github.com/qiime2/q2-sample-classifier/blob/master/.github/workflows/ci.yml#L26, the error message is as following. Is this because I don't have access to modify this? It would be great if you could help with this. Thanks!
|
Hi @jindongmin - sorry for the delayed response! You'll actually want to add any additional dependencies in this file: Try adding |
Hey @jindongmin just wanted to check in on this - were you successful in adding the |
Hi @lizgehret, yes the shap dependency was added. Thank you for your help! |
Co-authored-by: Liz Gehret <[email protected]>
Co-authored-by: Liz Gehret <[email protected]>
Co-authored-by: Liz Gehret <[email protected]>
Do we have a good resolution for @adamovanja's comment on the semantic type? |
Hey all just FYI I'm going to bump this to 2023.8 since our PR cutoff date for 2023.5 is May 5th. |
Bumping to 2023.11 since we're still waiting on updates regarding @adamovanja's comment regarding semantic type (see link above). |
Hi, there are no plans to address the semantic type issue, so feel free to
take over this pull request.
…On Wed, Sep 27, 2023 at 2:37 PM Liz Gehret ***@***.***> wrote:
Bumping to 2023.11 since we're still waiting on updates regarding
@adamovanja <https://github.com/adamovanja>'s comment regarding semantic
type (see link above).
—
Reply to this email directly, view it on GitHub
<#222 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA75VXPYJGOVZB5ERJKPZMDX4RW43ANCNFSM6AAAAAARZ3Q6RI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Ok, thanks for letting us know @mortonjt. @adamovanja, are folks on your end motivated enough to keep working on this, or should we close out this PR and issue? |
Hi @gregcaporaso, No capacity here to work on this right now, so feel free to close this PR. |
Hi all, I know that this has closed a while ago. However, I'm finding myself revisiting this PR, since the q2-sample-classifier repo is quite good (in terms of off-the-shelf hyper-parameter tuning, its my go-to). I'm currently finding myself maintaining the Shapley value code that @jindongmin has pushed in. It is quite useful, not completely trivial to reload and implement the Shap modules given all of the preprocessing steps (i.e. the DictVectorizer). I think others would benefit from this as well, @nbokulich and I have discussed this many years ago as being a nice addition. While the utility is clear, I think it is becoming clear to us that adding new types does introduce additional development burden that none of us are well-equipped to allocate time to. The good news here is that this isn't a necessity, the Probabilities type is semantically well-suited, since these feature importances are technically delta probabilities. I understand your position accepting new code to maintain. But I am also prepared to continue to push commits on my branch to make sure that it is update to date to the latest q2-sample-classifier. Let me know if you are willing to reconsider not requiring the new type to be added, and I push a new PR to add this functionality in. |
Hi @mortonjt I agree that this would still be a useful addition to q2-sample-classifier. Could you describe what the SHAP probabilities are in a little more detail? Based on my current understanding, I think a new type is warranted. However, the format is effectively the same (a 2d numeric dataframe without any specific header requirements), and defining new formats is usually what takes more time to code IMO. Creating a new type should take less time if an existing format can be used. |
I was going to go back to my original comment and edit :) Following up on the new type discussion, has there been updates with lowering the barrier for new types? What would it take to create a new tabular 2D type without the need to create new formats, new transformers, ... Regarding the SHAP probabilities, it is the delta between the classification probability with (and without) the feature, but averaged over all possible feature subsets that includes the feature. See equation 4 here : https://arxiv.org/pdf/1705.07874 In contrast, the feature important only looks at the total set of features, and computes the classification probability delta on the total set, and the total set with a given feature excluded. Shapley is a generalization of this that makes it more robust and have more accurate attributions. |
@mortonjt, I'm not sure if it's what you're looking for in terms of lowering the barrier, but this is much better documented then it ever has been in the past: |
Thanks @gregcaporaso for the link. This documentation does help -- but my original question is about having "cookie cutter" that can define new types with standard formats (i.e. pandas data frame) with a single line of code without introducing new transformer methods. I did raise this issue a few years ago here to further lower the barrier to creating new types with standard formats : qiime2/qiime2#583 (comment) I still like the idea of the typing system in q2, but based on my previous experiences developing q2 packages, it does take some time to develop a new package, even for an experienced developer. The fastest that I've developed a new package was q2-aldex2, it took me about 1 week. These timescales are hard to justify given competing priorities. But I can revisit allocating additional resources to push this PR through if some additional work is done to lower the barriers for creating these new types. |
@mortonjt, what you describe is something we're working toward in the context of the q2-stats project. While some details will likely change, a new API will be added that enables creation of new tabular data types that could be used as follows: q2_types.register_tabular_type(
plugin,
semantic_type_name='MyStatsTable',
columns={'subject': str, 'p-value': float, 'my-test-statistic': float, ...}
) The three required inputs:
Optional inputs will include things like a description of the indexing scheme, labels to use when displaying column headers to users, and so on. We expect that there will be a draft version of this available through q2-stats in the 2024.10 release, followed by a migration to q2-types for the 2025.4 release (ideally with addition to development QIIME 2 environments soon after 2024.10). |
@mortonjt in the meantime (until the new q2-stats API is ready), you could just use the tabular data format that is already defined in q2-sample-classifier and used by the |
Thanks @gregcaporaso for the update. This is indeed very exciting progress and it may be able to lower the development burdens we encountered in the past. @nbokulich I think this make sense as a viable option. My schedule is unpredictable over the next 2-3 weeks, but let me see what I can do in the meantime. Thanks. |
We added a function to compute shapely values for classifiers as stated in #219 and #221