Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wangs semantic similarity method #183

Closed
ThHarbig opened this issue Sep 9, 2020 · 22 comments
Closed

Wangs semantic similarity method #183

ThHarbig opened this issue Sep 9, 2020 · 22 comments

Comments

@ThHarbig
Copy link

ThHarbig commented Sep 9, 2020

Hi,

I'm working on a web application which uses goatools functions in its backend and I'd like to provide further semantic similarity methods to the users. I just wanted to ask how hard and feasible it would be to implement Wangs semantic similarity in goatools. I thought about doing it myself and create a pull request but I do not know if it is feasible because in Wangs method all terms in the DAG contribute to the semantics of a term.

Thanks!

@dvklopfenstein
Copy link
Collaborator

This is a fantastic idea. Let me look into it.

Do you have additional information?

We are also looking into implementing Yang's add-on for using the terms below the terms of interest and have alpha code developed that is looking good, but right now we are too busy to add it due to the surrounding tests and notebooks that would also need to be added.

Please excuse the delay in response, I have been busy working finishing a publication and thesis.

GREAT idea about Wang's semantic similarity. I will check it out.

@ThHarbig
Copy link
Author

That sounds good! Yangs semantic similarity method sounds good, since I'm especially looking for methods that are not IC based. Wangs semantic similarity is described nicely in this documentation of the R package GOSemSim (https://www.bioconductor.org/packages/release/bioc/vignettes/GOSemSim/inst/doc/GOSemSim.html#wang-method) and in the original publication (https://doi.org/10.1093/bioinformatics/btm087). Thanks!

@dvklopfenstein
Copy link
Collaborator

dvklopfenstein commented Oct 15, 2020

Hello! I've got the first cut of Wang's semantic similarity. It is not yet ready for prime time, but will be soon. The current test passed on the data in Wang's GODag in Fig 1, with the expected results being from Wang's Table 1 for svalues and the semantic similarity value in Wang's section 2.1.

I still need to add functionality for using alternate GO IDs and plan to add a special plotting class for pairs of GO IDs, which will be useful to researchers and will help us in debugging tests.

So.... The effort to add Wang's semantic similarity is well underway.

Thank you so much for opening this issue. What a great idea.

@ThHarbig
Copy link
Author

That's awesome, thank you so much! I'm excited to try it on my data.

@ejmolinelli
Copy link

This is incredible, as I was in need of such a tool just a few days ago. Thanks for all your work! I'd be interested in beta testing.

@ejmolinelli
Copy link

I also saw this repo, which hasn't been updated or maintained in about 2 years, but I've been using this so far.
https://github.com/mojaie/pygosemsim

@ejmolinelli
Copy link

Also, there is an updated method to Wang's original similarity metric, as described here: https://pubmed.ncbi.nlm.nih.gov/26356015/

It addresses two issues with the original score; (1) the need for empirical weights, and (2) computational cost for many pairwise term scores.

@dvklopfenstein
Copy link
Collaborator

@ejmolinelli , Thanks so much for the link. I'll take a look at it.

@dvklopfenstein
Copy link
Collaborator

dvklopfenstein commented Oct 27, 2020

I have been testing the new GOATOOLS Wang semantic similarity by comparing our values to the values generated by pygosemsim. I would have liked to compare the Wang values to Bioconductor's GOSemSim, but was not sure how to get the go-basic.obo that they used. If anybody knows how to do this, I will compare test our Wang values against theirs.

Our speed is a bit faster than pygosemsim overall.

I believe there is a mistake in pygosemsim: The only way to match GOATOOLS Wang values to the pygosemsim Wang values was to:

  1. Get the GO ancestors by traversing up all of the optional relationships, even if only the part_of relationship is requested by the user. All optional relationships includes regulates, negatively_regulates, and positively_regulates.

  2. And then set the edge weights, such that the *regulates weights are set to zero:
    edge_weights = {
    'is_a': 0.8,
    'part_of': 0.6,
    'regulates': 0.0,
    'negatively_regulates': 0.0,
    'positively_regulates': 0.0,
    }

This is NOT the same as only using the part_of relationships to get the ancestors and then using the same edge_weights as above with the regulates relationships set to 0.0.

The correct way is to get ancestors by traversing up only the relationships that are specifically requested by the user, as is done in GOATOOLS's get_go2ancestors, not by traversing up all relationships and then zeroing out the edge weights for the Wang S-value calculations, as seems to happen in pygosemsim.

I will be submitting Wang's semantic similarity calculated in GOATOOLS soon, with tests and documentation

@dvklopfenstein
Copy link
Collaborator

Another note: In pygosemsim, the function round is used multiple times, which is troublesome...

The troubles of Python's round function are reported in https://stackoverflow.com/questions/13479163/round-float-to-x-decimals.

And in May 2020, https://github.com/mdickinson, wishes for the deprecation of the two-argument form of round in Python here: micropython/micropython#3516 (comment)

@ThHarbig
Copy link
Author

ThHarbig commented Oct 28, 2020

I have been testing the new GOATOOLS Wang semantic similarity by comparing our values to the values generated by pygosemsim. I would have liked to compare the Wang values to Bioconductor's GOSemSim, but was not sure how to get the go-basic.obo that they used. If anybody knows how to do this, I will compare test our Wang values against theirs.

Our speed is a bit faster than pygosemsim overall.

I believe there is a mistake in pygosemsim: The only way to match GOATOOLS Wang values to the pygosemsim Wang values was to:

  1. Get the GO ancestors by traversing up all of the optional relationships, even if only the part_of relationship is requested by the user. All optional relationships includes regulates, negatively_regulates, and positively_regulates.
  2. And then set the edge weights, such that the *regulates weights are set to zero:
    edge_weights = {
    'is_a': 0.8,
    'part_of': 0.6,
    'regulates': 0.0,
    'negatively_regulates': 0.0,
    'positively_regulates': 0.0,
    }

This is NOT the same as only using the part_of relationships to get the ancestors and then using the same edge_weights as above with the regulates relationships set to 0.0.

The correct way is to get ancestors by traversing up only the relationships that are specifically requested by the user, as is done in GOATOOLS's get_go2ancestors, not by traversing up all relationships and then zeroing out the edge weights for the Wang S-value calculations, as seems to happen in pygosemsim.

I will be submitting Wang's semantic similarity calculated in GOATOOLS soon, with tests and documentation

Thanks for implementing this and for finding the flaws in pygosemsim! For goSemSim: do you need the go-basic.obo to ensure that you are using the same and that your results are comparable? goSemSim is using GO.db, another R package for the graph structure (https://bioconductor.org/packages/release/data/annotation/html/GO.db.html). Therefore you don' need to provide a go-basic.obo for the actual computations of semantic similarity. You just need a species, but since Wangs method is not IC based it shouldn't matter which one you use.

@dvklopfenstein
Copy link
Collaborator

For goSemSim: do you need the go-basic.obo to ensure that you are using the same and that your results are comparable?

Yes, I use the same go-basic.obo for both GOATOOLS and pygosemsim.

I would need the version used in R's GO.db package, of which their documentation says this:

Mappings were based on data provided by: Gene Ontology http://current.geneontology.org/ontology/gobasic.obo With a date stamp from the source of: 2020-05-02

I looked for the 2020-05-02 go-basic.obo on the gene ontology website, but only found source files that are run through a program to generate a go-basic.obo file. I don't believe that I have access to that program.

Another item to consider is we also don't know how R's program to store the GO DAG in R's GO.db works and can't really know if we would be comparing the exact same data. So it looks like we will not be able to compare to Bioconductor's GoSemSim.

That is a shame, because it is always useful to compare results. Regardless, the GOATOOLS implementation is working well. It matches the small amount of data in the Wang paper and compares well to pygosemsim, if in the test we use the two points mentioned in my comment above. I will submit it soon...

dvklopfenstein added a commit that referenced this issue Nov 24, 2020
    #183
2. Changed code to workaround new formats in Gene Ontology Consortium's annotations
   https://github.com/geneontology/go-annotation/issues/3373
   geneontology/go-annotation#3523
3. Moved reldepth calculations into its own module to support Wang's method and to give researcher ability to calc reldepths with subset of relationships
  geneontology/go-annotation#3523
@dvklopfenstein
Copy link
Collaborator

dvklopfenstein commented Dec 1, 2020

I have implemented Wang's semantic similarity, documented it with examples in a Juypter notebook, and added tests.

Please give it a try and let us know what you think. I am using another method in my thesis, but think Wang might be a good step in "future work." Thank you @ejmolinelli again for the link for the update on Wang's semantic similarity. This also looks like a step in the right direction and I would like to implement it.

@ThHarbig
Copy link
Author

ThHarbig commented Dec 3, 2020

Thank you so much! I'll give it a try. The link to the notebook is not correct but I found it anyway! :)

@ThHarbig
Copy link
Author

ThHarbig commented Dec 3, 2020

How can I access these changes? They're not released yet and I also cannot access them via the development version.

@dvklopfenstein
Copy link
Collaborator

@tanghaibao , can you release a new version of GOATOOLS so that the newly implemented Wang's Semantic Similarity can be easily available to all?

@ThHarbig , I also corrected the link in the comment. Thanks for giving us the heads up.

@tanghaibao
Copy link
Owner

@dvklopfenstein

Updated to v1.0.12 on PyPI.

@ThHarbig
Copy link
Author

ThHarbig commented Dec 4, 2020

I still cannot find reference "semsim" after upgrading to v1.0.12

@dvklopfenstein
Copy link
Collaborator

@ThHarbig , Thank you so much for commenting so quickly. You are correct.

The semsim directory is new. I neglected to add it to the setup.py file. I'll write a new test to ensure all directories in the goatools package are also in the setup.py so we don't have this situation happen again.

@dvklopfenstein
Copy link
Collaborator

@tanghaibao, can you release a new version of GOATOOLS?

I added the missing new semantic similarity directories to the setup.py file and added a new test to ensure that we won't have this problem again.

@ThHarbig , thank you so much for taking your time to report this issue. We should not see it again with the addition of the new test.

@tanghaibao
Copy link
Owner

@dvklopfenstein

Updated to v1.0.13 on PyPI.

@AlejandraGC
Copy link

Hello. I am starting to work with goatools, which python version is more friendly with this package? I am using (at the beginning) pycharm. Any suggestions? Thank you for the work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants