Bio::Tools::GFF doesn't write eukaryotic multi-exon genes correctly #369

krobison13 · 2022-04-07T18:10:49Z

When writing GFF, the same frame is assigned to every range in a multi-exon gene rather than correctly assigning 0,1 or 2 to specify the frame

Twitter note including image of the offending loop

hyphaltip · 2022-04-08T02:27:34Z

thanks for reporting a bug rather than just twitter rant!

maybe others can look at this too @cjfields -- its been 15+ years with that code - I think this is more about assumptions about features vs locations here that are obscuring the problem you describe.

If one is reading and writing multi-exons as individual features which is the typical way the frame is encoded this all works as planned - but if a single feature is encoded as a split-location - frame isn't encoded in a multi-location genbank file location necessarily.

probably If you wanted it to be computed from the data that might be helpful but it also make assumptions about the Generic feature being a CDS. This goes back to pre-GFF3 when the assumptions about how parent/child relationships were encoded and there were multiple interpretations of how to do this from gff1->gff2->gff2.5 /gtf etc.

I think much better validators and correctors for GFF (perhaps http://genometools.org/ ) have implemented a more dedicated logic.

maybe you can show input data that you used - are you are converting genbank to GFF and expecting frame to be computed and the assumption that it is a CDS with a frame to be carried through?

krobison13 · 2022-04-08T14:47:51Z

The zip file has a simple Genbank-formatted entry and a simple program that exposes the problem -- the correct sequence of frames is 0,0,2,1,0

gff-bug-reveal.zip

cjfields · 2022-04-08T15:47:13Z

Yeah I agree w/ @hyphaltip , I suspect there's bit rot from prior logical assumptions that have changed over time. I also vaguely recall Bio::Tools::GFF was to be deprecated in preference to Bio::DB::SeqFeature, though I'm not sure that is still the case.

Would it be worth looking into Bio::Tools::GFF or should we check Bio::DB::SeqFeature? If @scottcain around, maybe he would know? I think there was a GenBank-to-GFF conversion script for Bio::DB::SeqFeature (maybe within the GBrowse2 code?), we could check to see if if gives the correct frames.

scottcain · 2022-04-08T17:02:17Z

Ugh. Bio::Tools::GFF was old and janky a long time ago and should probably be marked as such, since I don't think it is likely to have improved with age. It is hard to remember the logic that went into that bit of code (I don't recall if I wrote it--I hope not--but I certainly might have!). I think @cjfields is right about there having been a GB to GFF3 script, but I don't recall where it lived.

There is a script with GBrowse, https://github.com/GMOD/GBrowse/blob/master/bin/load_genbank.pl, but it loads into a Bio::DB::GFF database (so, GFF2 and mysql or postgres). I don't have the time to do the code archeology to determine if it handles strand better.

hyphaltip · 2022-04-12T01:17:19Z

Chris Mungall wrote a gbk to gff script that used feature or name overlap to assign genes mRNA CDS to common parent group. It should be in scripts folder. I think Tools::GFF predates you Scott and was before we had really the same workflow and full feature implementations. I think split location support was an add on previously it round tripped features where locations were explicitly start/stop only. A more db style interface with Lincoln’s DB::GFF was one solution . now I would use NCBI tbl Format more aggressively and map to GFF / GBK / / ASN.1 from there anyways. Jason

On Fri, Apr 8, 2022 at 10:02 AM Scott Cain ***@***.***> wrote: Ugh. Bio::Tools::GFF was old and janky a long time ago and should probably be marked as such, since I don't think it is likely to have improved with age. It is hard to remember the logic that went into that bit of code (I don't recall if I wrote it--I hope not--but I certainly might have!). I think @cjfields <https://github.com/cjfields> is right about there having been a GB to GFF3 script, but I don't recall where it lived. There is a script with GBrowse, https://github.com/GMOD/GBrowse/blob/master/bin/load_genbank.pl, but it loads into a Bio::DB::GFF database (so, GFF2 and mysql or postgres). I don't have the time to do the code archeology to determine if it handles strand better. — Reply to this email directly, view it on GitHub <#369 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAL5O6BKOMI6UK57PLTAHDVEBRCHANCNFSM5S2GDZ3A> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Sent from Gmail Mobile Jason Stajich - ***@***.***

carandraug · 2022-04-12T10:22:55Z

I think @cjfields is right about there having been a GB to GFF3 script, but I don't recall where it lived.

There is bin/bp_genbank2gff3 which is in this repo and part of the BioPerl distribution.

There was also a bin/bp_genbank2gff which was moved to the Bio-DB-GFF distributio.

cjfields · 2022-09-12T22:01:22Z

Apologies to @krobison13 about the wait, but all of us 'old-timers' are pretty time constrained these days.

Coming back around to this, I think we should deprecate Bio::Tools::GFF particularly if there are better options, but we should definitely point in the right direction regardless what we decide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bio::Tools::GFF doesn't write eukaryotic multi-exon genes correctly #369

Bio::Tools::GFF doesn't write eukaryotic multi-exon genes correctly #369

krobison13 commented Apr 7, 2022

hyphaltip commented Apr 8, 2022 •

edited

Loading

krobison13 commented Apr 8, 2022

cjfields commented Apr 8, 2022

scottcain commented Apr 8, 2022

hyphaltip commented Apr 12, 2022 via email

carandraug commented Apr 12, 2022

cjfields commented Sep 12, 2022

Bio::Tools::GFF doesn't write eukaryotic multi-exon genes correctly #369

Bio::Tools::GFF doesn't write eukaryotic multi-exon genes correctly #369

Comments

krobison13 commented Apr 7, 2022

hyphaltip commented Apr 8, 2022 • edited Loading

krobison13 commented Apr 8, 2022

cjfields commented Apr 8, 2022

scottcain commented Apr 8, 2022

hyphaltip commented Apr 12, 2022 via email

carandraug commented Apr 12, 2022

cjfields commented Sep 12, 2022

hyphaltip commented Apr 8, 2022 •

edited

Loading