You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! Thanks for creating this wonderful library. I followed the instructions at https://github.com/tanghaibao/goatools/blob/main/notebooks/cell_cycle.ipynb to get GO-associated genes, but soon found out that there are some discrpancies between the results (in terms of # of ontology terms I get and also # of genes) derived here and from the website. I used go.obo (downloaded from here) instead of go-basic.obo to avoid data discrepancies.
I first noticed the number of GO terms hit are different. There are too many "cell cycle" hits so as an example, let's change "cell cycle" to "lipid metabolic". There are 9 hits on the website but 14 via the library. I manually examined the discrepancies and it looks like the library is capturing a lot of "<something>lipid metabolic" which the website does not, but cannot explain why the library hits "regulation of lipid metabolic process" (GO:0019216) and "membrane lipid metabolic process" (GO:0006643) but not "regulation of membrane lipid metabolic process" (GO:1905038).
Then, I changed "lipid metabolic" to " lipid metabolic" (adding a blank in the front) to avoid hitting the "<sth>lipid" stuff. Unexpectedly, there are only 6 hit terms left for the library (still 9 for the website). Weirdly, the library hit ether lipid metabolic process (GO:0046485) but not neutral lipid metabolic process (GO:0006638). There are more examples and I just couldn't understand why.
Yet again, to maximize replicability, I directly fed GO:0007049 (cell cycle) into srchhelp.add_children_gos() and I also went into that term through the website and ensured additional filtering (screenshot attached below). However, there are several hundred (unique) genes derived from the website but only 41 are derived from the library.
The code for feeding cell cycle directly is basically,
withopen(fout_allgos, "w") aslog:
# Search for 'cell cycle' in GO terms# gos_cc_all = srchhelp.get_matching_gos(cell_cycle_all, prt=log)# Find any GOs matching 'cell cycle-independent' (e.g., "lysosome")# gos_no_cc = srchhelp.get_matching_gos(cell_cycle_not, gos=gos_cc_all, prt=log)# Remove GO terms that are not "cell cycle" GOs# gos = gos_cc_all.difference(gos_no_cc)# Add children GOs of cell cycle GOs# gos_all = srchhelp.add_children_gos(gos)gos_all=srchhelp.add_children_gos({'GO:0007049'})
# Get Entrez GeneIDs for cell cycle GOsgeneids=srchhelp.get_items(gos_all)
I would appreciate any help! I am new to GO so I might be doing many things wrongly. Please feel free to correct anything.
The text was updated successfully, but these errors were encountered:
jasperhyp
changed the title
Discrepancy between using GoSearch and direct search on geneontology.org
Discrepancies between using GoSearch and direct search on geneontology.org
Jul 11, 2022
Hi! Thanks for creating this wonderful library. I followed the instructions at https://github.com/tanghaibao/goatools/blob/main/notebooks/cell_cycle.ipynb to get GO-associated genes, but soon found out that there are some discrpancies between the results (in terms of # of ontology terms I get and also # of genes) derived here and from the website. I used
go.obo
(downloaded from here) instead ofgo-basic.obo
to avoid data discrepancies.I first noticed the number of GO terms hit are different. There are too many "cell cycle" hits so as an example, let's change "cell cycle" to "lipid metabolic". There are 9 hits on the website but 14 via the library. I manually examined the discrepancies and it looks like the library is capturing a lot of "<something>lipid metabolic" which the website does not, but cannot explain why the library hits "regulation of lipid metabolic process" (GO:0019216) and "membrane lipid metabolic process" (GO:0006643) but not "regulation of membrane lipid metabolic process" (GO:1905038).
Then, I changed "lipid metabolic" to " lipid metabolic" (adding a blank in the front) to avoid hitting the "<sth>lipid" stuff. Unexpectedly, there are only 6 hit terms left for the library (still 9 for the website). Weirdly, the library hit ether lipid metabolic process (GO:0046485) but not neutral lipid metabolic process (GO:0006638). There are more examples and I just couldn't understand why.
Yet again, to maximize replicability, I directly fed
GO:0007049
(cell cycle) intosrchhelp.add_children_gos()
and I also went into that term through the website and ensured additional filtering (screenshot attached below). However, there are several hundred (unique) genes derived from the website but only 41 are derived from the library.The code for feeding cell cycle directly is basically,
I would appreciate any help! I am new to GO so I might be doing many things wrongly. Please feel free to correct anything.
The text was updated successfully, but these errors were encountered: