Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support detection of RBS across linear chromosome breakpoint ? #65

Open
gbouras13 opened this issue Dec 11, 2024 · 11 comments
Open

support detection of RBS across linear chromosome breakpoint ? #65

gbouras13 opened this issue Dec 11, 2024 · 11 comments
Labels
enhancement New feature or request

Comments

@gbouras13
Copy link

Hi @althonos ,

I hope you've been well mate! I have an issue/discussion to raise based on some related discussions here on my circular contig reorientation tool Dnaapler gbouras13/dnaapler#90 . The default and desired behaviour of dnaapler is to reorient the chromosome to begin with the start codon of dnaA at coordinate 1 (there is some unrelated discussion in that issue about issues with reorientation that you can ignore :) ).

However, @oschwengers makes the very good point that for bacterial chromosomes that are re-oriented this way, when pyrodigal is run (e.g. in annotation with bakta), it may fail to call the dnaA gene properly, as the RBS will be on the other end of the (linearised) circular contig.

I am personally not a fan of changing dnaapler so that it offsets starting with dnaA with e.g. 50 or 100bps like Oliver is suggesting (due to the diversity in RBSs and intergenic distances before dnaA across all bacteria), and therefore I propose that the best answer would be for prodigal to detect RBSs across edges (if you haven't already implemented such functionality) - is this possible in you view (and of course would you want to implement it)? And if not, do you have other suggestions?

George

@oschwengers
Copy link

Thanks @gbouras13 for opening this here, and hi @althonos !
Let me first comment that I haven't systematically tested this, so I cannot tell whether this is a systematic error in Pyrodigal. But anecdotally I can tell that I've seen many cases where chromosomes and even more often plasmids seem to have truncated dnaA/repA genes predicted at position 1. Looking closer at them, we often see that they actuallyhave intact aa sequences but Pyrodigal fails to call them properly due to a missing RBS that is located a couple of bases upstream - in theses cases of a linearised circular sequence located at the opposing end of the sequence.

Having this fixed or improved would be a huge plus since this affects all re-oriented replicons (which is part of many standard workflows), especially the very important dnaA and repA genes, and in general others as well.

@althonos althonos added the enhancement New feature or request label Dec 11, 2024
@althonos
Copy link
Owner

Hi @gbouras13, hi @oschwengers !

That sounds like a reasonable request, and something that should be able to be handled when running in closed mode, or probably with and additional flag so that it stays backwards compatible. The RBS site detection is executed for every start codon independently so I'm pretty positive (although I'd need to actually check the code more in details) that this can be done in a way that wraps over sequence starts and ends.

Would you have an example genome you can point me to that I can use for testing? I'm not sure when I can get that started, between the thesis writing (urgh) and the other projects I have running, but if it's straightforward enough I can probably one shot it a day I have time.

@gbouras13
Copy link
Author

I'll leave an example to @oschwengers as he has the experience on that front with bakta - and no stress at all, I hope the writeup is going well (I have no doubt it is even if you might think otherwise)!

George

@oschwengers
Copy link

Thanks @althonos for taking on this. I screened a couple of test genomes for Bakta and found this interesting Sinorhizobium meliloti SM11 genome from RefSeq: GCF_000218265.1.

It contains 1 chromosome and 2 plasmids, all complete:

$ grep '>' GCF_000218265.1.fna 
>NC_017325.1 Sinorhizobium meliloti SM11, complete sequence
>NC_017327.1 Sinorhizobium meliloti SM11 plasmid pSmeSM11c, complete sequence
>NC_017326.1 Sinorhizobium meliloti SM11 plasmid pSmeSM11d, complete sequence

Due to Bakta, all of them contain dnaA or repA genes at position 1 (the plasmids have further repAs in betwen):

$ grep -P "(dnaA|repA)" GCF_000218265.1.tsv 
NC_017325.1     cds     1       1524    +       KOLOGA_00001    dnaA    chromosomal replication initiator protein DnaA  COG:COG0593, COG:L, GO:0003677, GO:0003688, GO:0005524, GO:0005737, GO:0006270, GO:0006275, GO:0016887, SO:0001217, UniRef:UniRef50_Q8UIH1, UniRef:UniRef90_P35890
NC_017327.1     cds     1       1176    +       KOLOGA_03894    repA    plasmid partitioning protein RepA       COG:COG1192, COG:DN, SO:0001217, UniRef:UniRef50_A0A937CR46, UniRef:UniRef90_A0A2U8GGJ0
NC_017326.1     cds     1       1197    +       KOLOGA_05509    repA    plasmid partitioning protein RepA       COG:COG1192, COG:DN, SO:0001217, UniRef:UniRef50_A0A0H3GGI3, UniRef:UniRef90_A0A2U8GBJ3

And for all predicted genes at pos 1 Pyrodigal reports them as being truncated:

$ pyrodigal -i GCF_000218265.1.fna | grep partial=10
NC_017325.1     pyrodigal_v3.6.3        CDS     1       1524    194.2   +       0       ID=NC_017325.1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.628;conf=99.99;score=194.16;cscore=190.94;sscore=3.22;rscore=0.00;uscore=3.22;tscore=0.00;
NC_017327.1     pyrodigal_v3.6.3        CDS     1       1176    127.3   +       0       ID=NC_017327.1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.577;conf=100.00;score=127.25;cscore=124.03;sscore=3.22;rscore=0.00;uscore=3.22;tscore=0.00;
NC_017326.1     pyrodigal_v3.6.3        CDS     1       1197    169.1   +       0       ID=NC_017326.1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.602;conf=100.00;score=169.13;cscore=165.91;sscore=3.22;rscore=0.00;uscore=3.22;tscore=0.00;

I hope this helps to pin and fix this. It would be huge plus for Pyrodigal and thus, for bacterial gene prediction in general...

@althonos
Copy link
Owner

Digging a bit and it seems that this is not just caused by the RBS not being found, but by Prodigal actually assigning start codons at positions 0, 1, and 2 to always be counted as "edge" codons on closed=False even if they are a real start codon...

@althonos
Copy link
Owner

althonos commented Jan 8, 2025

Okay, so tried my hand at both this and #67, I refactored the node extraction and the scoring code so that it can process circular sequences. I also think I will eventually include a fix for hyattpd/Prodigal#101 because it was much harder to reproduce the buggy behaviour than making a streamlined implementation from scratch.

On the test plasmid I used the repA1 gene is called properly with its RBS motif and not considered an edge gene. Shifting the origin of the circular sequence does not impact anything and the gene appears to be called properly still, but I have to do more testing, as Prodigal does a couple of suspicious penalization/bonus in some places based on the distance of nodes to the sequence edges, and I need to make sure it doesn't happen for circular sequences.

pSmeSM11d       pyrodigal_v3.6.3        CDS     1       1197    169.8   +       0       ID=pSmeSM11d_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.602;conf=100.00;score=169.76;cscore=159.81;sscore=9.94;rscore=7.63;uscore=-1.33;tscore=3.65;

@gbouras13
Copy link
Author

Legendary as always @althonos - would you like some more test data? Would think it would be good to try out some full chromosomes and phages too. @oschwengers and I have plenty if you want it.

George

@althonos
Copy link
Owner

althonos commented Jan 31, 2025

Ok so progress update: so far I manage to get all nodes to score properly. Then, the problem of the connection scoring algorithm is that you will end up with different scores depending on which codon you start with, which means ultimately you end up with different genes depending on the sequence breakpoint, which is not great. My idea to counter that is to always "rotate" the list of codons to make them always start with the highest scoring codon; this would make the search deterministic and invariant over rotation of the circular sequence.

So far this works, and on my test sequence I am always getting the same genes after this tweak. The problem I have with this approach however is with the overlapping codons bonus: to resolve overlapping codons, Prodigal has an extra pre-processing step that happens before the connection scoring is started, so that potential operons can be identified. The problem now is that because of the circular sequences I need the algorithm to be able to identify overlapping codons across the sequence breakpoint, otherwise this creates another set of issues. Once I manage to do that, I think I'll be able to release an alpha version for circular topologies.

That could maybe be a fun little application note, I'd be happy to write something together with y'all?

@oschwengers
Copy link

oschwengers commented Feb 2, 2025

Wow! Sounds very cool and promising! I'd like to volunteer as an alpha-tester ;-) Writing something up also sounds very reasonable, especially thinking about all the hard work, fixes, patches novel features you've added. Of course I'd be more than happy to contribute my $0.02 and help writing something, but it goes w/o saying that you deserve the honor ;-)

@gbouras13
Copy link
Author

Super awesome as per usual @althonos - happy to help out as well and agree with @oschwengers. I have a couple of ideas where we could try this out for some cool application to find novel CDS vanilla P(y)rodigal missed or truncates (e.g. gene-calling the whole PLSDB https://academic.oup.com/nar/article/50/D1/D273/6439675).

George

@gbouras13
Copy link
Author

Another example to test out on gbouras13/pharokka#379

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants