Wangs semantic similarity method #183

ThHarbig · 2020-09-09T12:37:16Z

Hi,

I'm working on a web application which uses goatools functions in its backend and I'd like to provide further semantic similarity methods to the users. I just wanted to ask how hard and feasible it would be to implement Wangs semantic similarity in goatools. I thought about doing it myself and create a pull request but I do not know if it is feasible because in Wangs method all terms in the DAG contribute to the semantics of a term.

Thanks!

dvklopfenstein · 2020-09-24T21:50:51Z

This is a fantastic idea. Let me look into it.

Do you have additional information?

We are also looking into implementing Yang's add-on for using the terms below the terms of interest and have alpha code developed that is looking good, but right now we are too busy to add it due to the surrounding tests and notebooks that would also need to be added.

Please excuse the delay in response, I have been busy working finishing a publication and thesis.

GREAT idea about Wang's semantic similarity. I will check it out.

ThHarbig · 2020-09-25T07:55:51Z

That sounds good! Yangs semantic similarity method sounds good, since I'm especially looking for methods that are not IC based. Wangs semantic similarity is described nicely in this documentation of the R package GOSemSim (https://www.bioconductor.org/packages/release/bioc/vignettes/GOSemSim/inst/doc/GOSemSim.html#wang-method) and in the original publication (https://doi.org/10.1093/bioinformatics/btm087). Thanks!

dvklopfenstein · 2020-10-15T01:20:04Z

Hello! I've got the first cut of Wang's semantic similarity. It is not yet ready for prime time, but will be soon. The current test passed on the data in Wang's GODag in Fig 1, with the expected results being from Wang's Table 1 for svalues and the semantic similarity value in Wang's section 2.1.

I still need to add functionality for using alternate GO IDs and plan to add a special plotting class for pairs of GO IDs, which will be useful to researchers and will help us in debugging tests.

So.... The effort to add Wang's semantic similarity is well underway.

Thank you so much for opening this issue. What a great idea.

ThHarbig · 2020-10-15T13:46:35Z

That's awesome, thank you so much! I'm excited to try it on my data.

ejmolinelli · 2020-10-19T17:13:47Z

This is incredible, as I was in need of such a tool just a few days ago. Thanks for all your work! I'd be interested in beta testing.

ejmolinelli · 2020-10-19T17:19:40Z

I also saw this repo, which hasn't been updated or maintained in about 2 years, but I've been using this so far.
https://github.com/mojaie/pygosemsim

ejmolinelli · 2020-10-19T21:27:41Z

Also, there is an updated method to Wang's original similarity metric, as described here: https://pubmed.ncbi.nlm.nih.gov/26356015/

It addresses two issues with the original score; (1) the need for empirical weights, and (2) computational cost for many pairwise term scores.

dvklopfenstein · 2020-10-19T22:03:44Z

@ejmolinelli , Thanks so much for the link. I'll take a look at it.

dvklopfenstein · 2020-10-27T23:56:52Z

I have been testing the new GOATOOLS Wang semantic similarity by comparing our values to the values generated by pygosemsim. I would have liked to compare the Wang values to Bioconductor's GOSemSim, but was not sure how to get the go-basic.obo that they used. If anybody knows how to do this, I will compare test our Wang values against theirs.

Our speed is a bit faster than pygosemsim overall.

I believe there is a mistake in pygosemsim: The only way to match GOATOOLS Wang values to the pygosemsim Wang values was to:

Get the GO ancestors by traversing up all of the optional relationships, even if only the part_of relationship is requested by the user. All optional relationships includes regulates, negatively_regulates, and positively_regulates.
And then set the edge weights, such that the *regulates weights are set to zero:
edge_weights = {
'is_a': 0.8,
'part_of': 0.6,
'regulates': 0.0,
'negatively_regulates': 0.0,
'positively_regulates': 0.0,
}

This is NOT the same as only using the part_of relationships to get the ancestors and then using the same edge_weights as above with the regulates relationships set to 0.0.

The correct way is to get ancestors by traversing up only the relationships that are specifically requested by the user, as is done in GOATOOLS's get_go2ancestors, not by traversing up all relationships and then zeroing out the edge weights for the Wang S-value calculations, as seems to happen in pygosemsim.

I will be submitting Wang's semantic similarity calculated in GOATOOLS soon, with tests and documentation

dvklopfenstein · 2020-10-28T10:38:36Z

Another note: In pygosemsim, the function round is used multiple times, which is troublesome...

The troubles of Python's round function are reported in https://stackoverflow.com/questions/13479163/round-float-to-x-decimals.

And in May 2020, https://github.com/mdickinson, wishes for the deprecation of the two-argument form of round in Python here: micropython/micropython#3516 (comment)

ThHarbig · 2020-10-28T12:43:03Z

I have been testing the new GOATOOLS Wang semantic similarity by comparing our values to the values generated by pygosemsim. I would have liked to compare the Wang values to Bioconductor's GOSemSim, but was not sure how to get the go-basic.obo that they used. If anybody knows how to do this, I will compare test our Wang values against theirs.

Our speed is a bit faster than pygosemsim overall.

I believe there is a mistake in pygosemsim: The only way to match GOATOOLS Wang values to the pygosemsim Wang values was to:

Get the GO ancestors by traversing up all of the optional relationships, even if only the part_of relationship is requested by the user. All optional relationships includes regulates, negatively_regulates, and positively_regulates.

And then set the edge weights, such that the *regulates weights are set to zero:
edge_weights = {
'is_a': 0.8,
'part_of': 0.6,
'regulates': 0.0,
'negatively_regulates': 0.0,
'positively_regulates': 0.0,
}

This is NOT the same as only using the part_of relationships to get the ancestors and then using the same edge_weights as above with the regulates relationships set to 0.0.

The correct way is to get ancestors by traversing up only the relationships that are specifically requested by the user, as is done in GOATOOLS's get_go2ancestors, not by traversing up all relationships and then zeroing out the edge weights for the Wang S-value calculations, as seems to happen in pygosemsim.

I will be submitting Wang's semantic similarity calculated in GOATOOLS soon, with tests and documentation

Thanks for implementing this and for finding the flaws in pygosemsim! For goSemSim: do you need the go-basic.obo to ensure that you are using the same and that your results are comparable? goSemSim is using GO.db, another R package for the graph structure (https://bioconductor.org/packages/release/data/annotation/html/GO.db.html). Therefore you don' need to provide a go-basic.obo for the actual computations of semantic similarity. You just need a species, but since Wangs method is not IC based it shouldn't matter which one you use.

dvklopfenstein · 2020-10-28T16:56:02Z

For goSemSim: do you need the go-basic.obo to ensure that you are using the same and that your results are comparable?

Yes, I use the same go-basic.obo for both GOATOOLS and pygosemsim.

I would need the version used in R's GO.db package, of which their documentation says this:

Mappings were based on data provided by: Gene Ontology http://current.geneontology.org/ontology/gobasic.obo With a date stamp from the source of: 2020-05-02

I looked for the 2020-05-02 go-basic.obo on the gene ontology website, but only found source files that are run through a program to generate a go-basic.obo file. I don't believe that I have access to that program.

Another item to consider is we also don't know how R's program to store the GO DAG in R's GO.db works and can't really know if we would be comparing the exact same data. So it looks like we will not be able to compare to Bioconductor's GoSemSim.

That is a shame, because it is always useful to compare results. Regardless, the GOATOOLS implementation is working well. It matches the small amount of data in the Wang paper and compares well to pygosemsim, if in the test we use the two points mentioned in my comment above. I will submit it soon...

#183

#183 2. Changed code to workaround new formats in Gene Ontology Consortium's annotations https://github.com/geneontology/go-annotation/issues/3373 geneontology/go-annotation#3523 3. Moved reldepth calculations into its own module to support Wang's method and to give researcher ability to calc reldepths with subset of relationships geneontology/go-annotation#3523

dvklopfenstein · 2020-12-01T20:06:33Z

I have implemented Wang's semantic similarity, documented it with examples in a Juypter notebook, and added tests.

Please give it a try and let us know what you think. I am using another method in my thesis, but think Wang might be a good step in "future work." Thank you @ejmolinelli again for the link for the update on Wang's semantic similarity. This also looks like a step in the right direction and I would like to implement it.

ThHarbig · 2020-12-03T13:04:03Z

Thank you so much! I'll give it a try. The link to the notebook is not correct but I found it anyway! :)

ThHarbig · 2020-12-03T17:05:03Z

How can I access these changes? They're not released yet and I also cannot access them via the development version.

dvklopfenstein · 2020-12-03T18:20:35Z

@tanghaibao , can you release a new version of GOATOOLS so that the newly implemented Wang's Semantic Similarity can be easily available to all?

@ThHarbig , I also corrected the link in the comment. Thanks for giving us the heads up.

tanghaibao · 2020-12-03T20:34:39Z

@dvklopfenstein

Updated to v1.0.12 on PyPI.

ThHarbig · 2020-12-04T11:50:02Z

I still cannot find reference "semsim" after upgrading to v1.0.12

dvklopfenstein · 2020-12-04T14:51:13Z

@ThHarbig , Thank you so much for commenting so quickly. You are correct.

The semsim directory is new. I neglected to add it to the setup.py file. I'll write a new test to ensure all directories in the goatools package are also in the setup.py so we don't have this situation happen again.

dvklopfenstein · 2020-12-04T15:55:19Z

@tanghaibao, can you release a new version of GOATOOLS?

I added the missing new semantic similarity directories to the setup.py file and added a new test to ensure that we won't have this problem again.

@ThHarbig , thank you so much for taking your time to report this issue. We should not see it again with the addition of the new test.

tanghaibao · 2020-12-04T16:07:26Z

@dvklopfenstein

Updated to v1.0.13 on PyPI.

AlejandraGC · 2020-12-04T20:41:43Z

Hello. I am starting to work with goatools, which python version is more friendly with this package? I am using (at the beginning) pycharm. Any suggestions? Thank you for the work!

dvklopfenstein added a commit that referenced this issue Sep 28, 2020

Add obo for Wang's Fig 1 for testing #183

038976f

dvklopfenstein added a commit that referenced this issue Oct 15, 2020

First implementation of Wangs semantic similarity method. #183

ef14f99

dvklopfenstein added a commit that referenced this issue Oct 19, 2020

Add support for alt GO IDs when using Wangs semantic similarity. #183

0890974

dvklopfenstein added a commit that referenced this issue Nov 24, 2020

Added generic function needed for Wang's semantic similarity

a98cdd5

#183

dvklopfenstein added a commit that referenced this issue Nov 24, 2020

Speed up Wang's semantic similarity calculations.

b7b63bb

#183

dvklopfenstein added a commit that referenced this issue Nov 30, 2020

Creating documentation for Wang's Semantic Similarity. #183

218bafb

dvklopfenstein added a commit that referenced this issue Dec 1, 2020

Add link to example of Wang's semantic similarity. #183

8dc3dec

dvklopfenstein mentioned this issue Dec 4, 2020

Add test to ensure PACKAGES in setup.py show all subdirs in goatools dirs #190

Closed

dvklopfenstein added a commit that referenced this issue Dec 4, 2020

Add test to ensure PACKAGES in setup.py are complete. #183 #190

d9cdbb3

tanghaibao closed this as completed May 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wangs semantic similarity method #183

Wangs semantic similarity method #183

ThHarbig commented Sep 9, 2020

dvklopfenstein commented Sep 24, 2020

ThHarbig commented Sep 25, 2020

dvklopfenstein commented Oct 15, 2020 •

edited

Loading

ThHarbig commented Oct 15, 2020

ejmolinelli commented Oct 19, 2020

ejmolinelli commented Oct 19, 2020

ejmolinelli commented Oct 19, 2020

dvklopfenstein commented Oct 19, 2020

dvklopfenstein commented Oct 27, 2020 •

edited

Loading

dvklopfenstein commented Oct 28, 2020

ThHarbig commented Oct 28, 2020 •

edited

Loading

dvklopfenstein commented Oct 28, 2020

dvklopfenstein commented Dec 1, 2020 •

edited

Loading

ThHarbig commented Dec 3, 2020

ThHarbig commented Dec 3, 2020

dvklopfenstein commented Dec 3, 2020

tanghaibao commented Dec 3, 2020

ThHarbig commented Dec 4, 2020

dvklopfenstein commented Dec 4, 2020

dvklopfenstein commented Dec 4, 2020

tanghaibao commented Dec 4, 2020

AlejandraGC commented Dec 4, 2020

Wangs semantic similarity method #183

Wangs semantic similarity method #183

Comments

ThHarbig commented Sep 9, 2020

dvklopfenstein commented Sep 24, 2020

ThHarbig commented Sep 25, 2020

dvklopfenstein commented Oct 15, 2020 • edited Loading

ThHarbig commented Oct 15, 2020

ejmolinelli commented Oct 19, 2020

ejmolinelli commented Oct 19, 2020

ejmolinelli commented Oct 19, 2020

dvklopfenstein commented Oct 19, 2020

dvklopfenstein commented Oct 27, 2020 • edited Loading

dvklopfenstein commented Oct 28, 2020

ThHarbig commented Oct 28, 2020 • edited Loading

dvklopfenstein commented Oct 28, 2020

dvklopfenstein commented Dec 1, 2020 • edited Loading

ThHarbig commented Dec 3, 2020

ThHarbig commented Dec 3, 2020

dvklopfenstein commented Dec 3, 2020

tanghaibao commented Dec 3, 2020

ThHarbig commented Dec 4, 2020

dvklopfenstein commented Dec 4, 2020

dvklopfenstein commented Dec 4, 2020

tanghaibao commented Dec 4, 2020

AlejandraGC commented Dec 4, 2020

dvklopfenstein commented Oct 15, 2020 •

edited

Loading

dvklopfenstein commented Oct 27, 2020 •

edited

Loading

ThHarbig commented Oct 28, 2020 •

edited

Loading

dvklopfenstein commented Dec 1, 2020 •

edited

Loading