Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove (don't load) models that are a single node #202

Open
ukemi opened this issue Sep 12, 2022 · 12 comments
Open

Remove (don't load) models that are a single node #202

ukemi opened this issue Sep 12, 2022 · 12 comments

Comments

@ukemi
Copy link

ukemi commented Sep 12, 2022

In some cases, a Reactome pathway doesn't have any reactions that are directly associated with it. Instead it has a collection of subpathways under it. In those cases, the parent gets imported as a single node with nothing else associated with it. We should not load these.
eg R-HSA-71291

@deustp01
Copy link
Collaborator

Here's a weedy suggestion for how to proceed. Agreed that a pathway with no content except subpathways yields a GO-CAM with no informative content, but before simply discarding these it would be prudent to get a list of all the single-node pathways for manual inspection to confirm that nothing is lost. I expect that everything on the list will be OK for removal - even where a curator has made one of these as a placeholder and plan to fill in individual reaction children along with the pathway children, when and if that happens the pathway will then pass the rule proposed here and will get loaded OK.

And a naive question. Is the Reactome event hierarchy somehow preserved in the exported GO-CAM structure? I guess that it is not and in that case, these empty grouping pathways do not have a useful linking role.

@ukemi
Copy link
Author

ukemi commented Sep 13, 2022

And a naive question. Is the Reactome event hierarchy somehow preserved in the exported GO-CAM structure? I guess that it is not and in that case, these empty grouping pathways do not have a useful linking role.

This is a great question that has now set me thinking. The Reactome hierarchy is not preserved because of the inability to discriminate is_a and Part_of (another thing that I think we could brainstorm about at a face-2-face. I think you had some good ideas about this). However, let's say that there is a Reactome pathway that has no reactions, but only pathways as children. If the parent pathway has an asserted GO BP term mapped and none of the children do, it would be safe to put the parent pathway on generic children. It doesn't matter is the child is a subclass or a part of the parent because we won't represent that. The parent BP will just go to the new top node of the model. I'm not sure how many of these exist, but I think I've seen some.

To follow up, Peter sent an e-mail to Guanming:

On the pathways2GO side, it would be really useful to make this distinction – for example, “Glucose metabolism” is_a “Metabolism of carbohydrates”, and “Glycolysis” is_a “Glucose_metabolism”, but both the pathway “Regulation of Glucokinase by Glucokinase Regulatory Protein” and the reactionlikeEvent “HK1,2,3,GCK phosphorylate Glc to form G6P” are parts_of “Glycolysis”. Right now all pathways are connected to their contained events by a single relation, hasEvent. At the level of the data model, how hard or dangerous would it be to replace this single relation with two, so that pathways can have either is_a or part_of relationships or both to their contained events?
If this change at the level of the data model seems OK, then we can begin to think about how to handle the legacy clean-up of existing pathways. This will certainly be a very big job and if the data model change is OK, then David and I can work with curators to look for ways to make it as easy as possible. I guess / hope that most pathways will contain only one kind (is_a or part_of) children but we will need to look very carefully.

who replied:

If you recall, is_a relationship existed originally in our old data model, probably about 15 years ago or longer. At certain time, in order to keep our model simpler, basically we lumped both has_a (called hasComponent if I remember it correctly) and is_a (isMember?) relationships into this hasEvent slot. Now hasEvent is overloaded with both meanings.
It is doable to spin off hasEvent into another isEvent relationship for some containing pathways. However, this may bring in a lot of headaches for both visualization (e.g. showing isA pathways and reactions container there differently from hasA pathways) and data analysis (e.g. pathway enrichment analysis: how to split isA pathway from other). So it is really a can of worms.

to which Peter replied:

One idea, not really worked through, to mention and add to the GitHub ticket for future discussion before I forget.
The first suggestion was “top-down”: annotate a parent event to indicate kinds of children. That breaks current Reactome web displays and data mining as Guanming said. A fairly clunky alternative might be “bottom-up”: an event has an optional slot to indicate the pathways of which it is an instance and another to indicate the pathways of which it is a part.

Guanming:

The bottom-up approach may still bring us a quite of lot to handle at the Reactome side in the perspective of software tools: 1). How to visualize newly added is_a pathways in the web site, how to exclude or include them for data analysis (e.g. gene set enrichment analysis), how to export them in other formats (e.g. gene sets for MSigDB, NCBI, BioPAX, etc).

Peter:

I’m imagining that the bottom-up annotations of reactions would be in addition to the top-down “hasEvent” annotations of pathways, and data analysis and web layout tools could ignore them. We would need to capture them in BioPAX, though.

Guanming:

To the best of my knowledge, I don’t think BioPAX supports is_a relationship or make distinguishing between has_part and is_a. One way we may try is to use GO’s is_a and is_part relationships by overlapping them onto Reactome’s events.

@deustp01
Copy link
Collaborator

Now deferred from things to do in connection with GO-CAM build from Reactome 82 - make this a headache for another day

@ukemi
Copy link
Author

ukemi commented Nov 11, 2022

QA for @ukemi and @deustp01. Once this is done, it should only eliminate the grouping pathways that have no reactions associated with them.

@dustine32
Copy link
Collaborator

I'm trying to knock this ticket out as part of getting the comprehensive list of "done" Reactome GO-CAMs with no molecular event placeholder activities. When applying the criteria "contains no molecular events," I kept seeing these empty parent pathway models containing only a single BP node in the results. Simply blocking the write-out of these model files would clean up these results.

So I applied this filter and did a before-after comparison to get the list of models (351 total) that would be removed, attached below:
pathways_dropped_no_activities.txt

@ukemi @deustp01 Please take a look and let me know if you catch anything wrong with this list. Thanks!

@ukemi
Copy link
Author

ukemi commented Oct 3, 2024

@deustp01

@deustp01
Copy link
Collaborator

deustp01 commented Oct 3, 2024

To track checking, made a Google Doc, "Getting rid of models that are a single node" in the "Getting rid of molecular events" folder.

@deustp01
Copy link
Collaborator

deustp01 commented Oct 4, 2024

I've now checked the list.

Bottom line - 350 pathways should be blocked as Dustin proposes and one should be edited out of existence in Reactome.

Weedy details -

Most of the pathways on the "Getting rid of models that are a single node" list have only one or more other pathways as children (indicated with a simple “no” in column 2 in the list / table). While they provide useful grouping information for a future GO-CAM structure that shows causal relationships among pathways, blocking their write-out now is correct.

A subset of these have only a single pathway as a child (“no” in column 2, “has only one pathway child” in column 3). Dustin should block these just like the simple-no ones. They are flagged because in Reactome as in GO it does not make much sense for a higher-level grouping term to have only a single child, so a useful side-effect of this review of candidates for blocking is a list of pathways in Reactome that are candidates for rearrangement / merging to eliminate unneeded steps in the Reactome pathway hierarchy.

One pathway, R-HSA-1630316, contains a single reaction child but this can be fixed on the Reactome side by putting the reaction in a different pathway, which actually it belongs to.

A number of pathways are composed of multiple drug (one of the participants is flagged as a drug) or stealth-drug (one of the participants is a set all of whose members are drugs) reactions. The GO-CAM script correctly suppresses the generation of activity units from these reactions, resulting in an empty pathway. These too should be blocked at the write-out stage. I’ve flagged them separately (“YES” in column 2 and drug verbiage in column 3) because I guess that Dustin may want to reinforce the say-no-to-drugs tests to filter them out more elegantly?

@ukemi
Copy link
Author

ukemi commented Oct 7, 2024

While they provide useful grouping information for a future GO-CAM structure that shows causal relationships among pathways, blocking their write-out now is correct.

This is indeed true. One task to make use of this in GO-CAM would be to try to figure out a method for determining if the child pathways are is_a or part_of children of the grouping pathway. Right now it is a mix.

@deustp01
Copy link
Collaborator

deustp01 commented Oct 7, 2024

figure out a method for determining if the child pathways are is_a or part_of children of the grouping pathway.

While we've been aware of the distinction, we never had to deal with it within the Reactome event hierarchy, so now there's a big legacy clean-up problem. My hunch is that it would be straightforward for a human curator to make the classification correctly, but tedious. @dustine32 do you see any hope here for a script that could sort out the two classes of pathway reliably, and identify pathways that have both is_a and part_of children, because I expect we have some - there's nothing to prohibit them?

@dustine32
Copy link
Collaborator

do you see any hope here for a script that could sort out the two classes of pathway reliably, and identify pathways that have both is_a and part_of children

Sorry @ukemi @deustp01, I just noticed this request as I came to announce I was ready to merge the single-node pathway fix code. For the two aspects of this request:

  1. These two classes of pathway (I assume, the single-node, zero activity pathways) would be (1) contains only a single child pathway and (2) contains multiple pathway children, right?
  2. Discerning pathway hierarchy relationships of is_a vs part_of - I do not see any distinction of these relations in the BioPAX file nor on the Reactome site. Could you point me to wherever you're seeing these relationship types and I can try tracing from there?

Also, I think this still means I'm good to start merging the single-node pathway filtering code into master. I'll at least open a PR.

@ukemi
Copy link
Author

ukemi commented Oct 31, 2024

This might be something worth looking at in New York. I have looked before and discriminating between is_a and Part_of was not obvious. However, one thing I didn't do was look at preceding reaction relationships. Another possibility is to look at the asserted BPs that are on the pathways and interrogate the ontology for relationships, but I'm not sure this will work. I think there are a lot of 'partial' pathways that are asserted to be the pathway. We might want to have a look at some of these in NYC. They are not asserted, so you won't see them in the BioPax.
Here is an example:
All of the subpathways of "Signaling by Receptor Tyrosine Kinases" (R-HSA-9006934) are subtypes of the parent.
But the children of "Signaling by EGFR" (R-HSA-177929) are almost all part_of.

dustine32 added a commit that referenced this issue Oct 31, 2024
…hwys

Skip model writeout if no events or functions; for #202
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

3 participants