Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New taxon format causing our GAF readers to crash: Curie(namespace='NCBITaxon', identity='1280') #287

Closed
dvklopfenstein opened this issue Oct 22, 2020 · 7 comments
Assignees

Comments

@dvklopfenstein
Copy link

Hello,

Thank you for the great annotations. They are extremely helpful.

Our GPAD reader tests are FAILing due to a taxon in goa_human.gpad, downloaded just now from http://current.geneontology.org/annotations, having this format in 526 lines:

Curie(namespace='NCBITaxon', identity='<some integer'>)

rather than the format of either:

  • taxon:9606
  • NCBITaxon:9606

Should we support this new format, which will make reading GPAD files slower, or can the 526 GPAD file taxon lines be changed from:

  • Curie(namespace='NCBITaxon', identity='<some integer'>)

to:

  • NCBITaxon:NNNN

cc: @tanghaibao @JudoWill @dvklopfenstein

@tanghaibao
Copy link

tanghaibao commented Oct 23, 2020

@dvklopfenstein

Does it only show up in the goa_human.gpad, or this occurs in other gpad as well?

Parsing this wouldn't slow you down too much. If parsing your original pattern taxon:int fails, then you'd match the Curie pattern as the last attempt. If it occurs in only 526 lines this wouldn't strongly impact your run time.

But yeah I do think it's worth bringing this up here just to see if this still matches the gpad file spec / standards.

Haibao

@dvklopfenstein
Copy link
Author

dvklopfenstein commented Oct 23, 2020

Hi Haiboa,

Our parsers sometimes find small errors in the gpad files. I report this just in case it is an error and will mess up parsers other than our own which may silently not process the data correctly.

The last time we found an error reading a gpad file, it turned out to be a showstopper bug with missing data in the annotation files (https://github.com/geneontology/go-annotation/issues/2885).

So seeing a new format in the gpad files, I think it is best to report it. Of course, we can always change the parser to read the new format. But just in case it will affect other researchers, it is best to report it.

@dvklopfenstein dvklopfenstein changed the title Taxon format causing our GAF readers to crash: Curie(namespace='NCBITaxon', identity='1280') New taxon format causing our GAF readers to crash: Curie(namespace='NCBITaxon', identity='1280') Nov 21, 2020
dvklopfenstein referenced this issue in tanghaibao/goatools Nov 24, 2020
    #183
2. Changed code to workaround new formats in Gene Ontology Consortium's annotations
   https://github.com/geneontology/go-annotation/issues/3373
   geneontology/go-annotation#3523
3. Moved reldepth calculations into its own module to support Wang's method and to give researcher ability to calc reldepths with subset of relationships
  geneontology/go-annotation#3523
@ValWood
Copy link

ValWood commented Nov 29, 2020

This one also needs migrating to the helpdesk:

@thomaspd @cmungall @pgaudet

@pgaudet pgaudet transferred this issue from geneontology/go-annotation Nov 30, 2020
@suzialeksander
Copy link
Collaborator

Hi @dvklopfenstein,

Thank you for reporting this. According to the current GPAD file format (you can find this at http://geneontology.org/docs/gene-product-association-data-gpad-format/ ), we expect files to only have the taxon number without the other characters. I will pass this info on to either resolve the issue upstream and/or add in a check for this column.

@suzialeksander
Copy link
Collaborator

@dougli1sqrd would this be a Rules issue

or

@kltm @pgaudet is this an issue from upstream files- although most of the lines are attributed to UniProt, ParkinsonsUK-UCL, BHF-UCL, CAFA, ARUK-UCL, & DIBU have the incorrect format in goa_human.gpad

@kltm
Copy link
Member

kltm commented Dec 2, 2020

Exploration for this issue is underway at biolink/ontobio#489 .

@suzialeksander
Copy link
Collaborator

Great, thanks. @dvklopfenstein, thank you again for reporting this and I apologise that it was buried before we could address the issue. If you find any other issues in the future, please feel free to report them in this repo. I am closing this ticket to avoid duplicating /biolink/ontobio#489 - if you would follow the ontobio ticket, it sounds like they are working on a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants