-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support detection of RBS across linear chromosome breakpoint ? #65
Comments
Thanks @gbouras13 for opening this here, and hi @althonos ! Having this fixed or improved would be a huge plus since this affects all re-oriented replicons (which is part of many standard workflows), especially the very important |
Hi @gbouras13, hi @oschwengers ! That sounds like a reasonable request, and something that should be able to be handled when running in closed mode, or probably with and additional flag so that it stays backwards compatible. The RBS site detection is executed for every start codon independently so I'm pretty positive (although I'd need to actually check the code more in details) that this can be done in a way that wraps over sequence starts and ends. Would you have an example genome you can point me to that I can use for testing? I'm not sure when I can get that started, between the thesis writing (urgh) and the other projects I have running, but if it's straightforward enough I can probably one shot it a day I have time. |
I'll leave an example to @oschwengers as he has the experience on that front with bakta - and no stress at all, I hope the writeup is going well (I have no doubt it is even if you might think otherwise)! George |
Thanks @althonos for taking on this. I screened a couple of test genomes for Bakta and found this interesting Sinorhizobium meliloti SM11 genome from RefSeq: It contains 1 chromosome and 2 plasmids, all complete:
Due to Bakta, all of them contain
And for all predicted genes at pos 1 Pyrodigal reports them as being truncated:
I hope this helps to pin and fix this. It would be huge plus for Pyrodigal and thus, for bacterial gene prediction in general... |
Digging a bit and it seems that this is not just caused by the RBS not being found, but by Prodigal actually assigning start codons at positions 0, 1, and 2 to always be counted as "edge" codons on |
Okay, so tried my hand at both this and #67, I refactored the node extraction and the scoring code so that it can process circular sequences. I also think I will eventually include a fix for hyattpd/Prodigal#101 because it was much harder to reproduce the buggy behaviour than making a streamlined implementation from scratch. On the test plasmid I used the
|
Legendary as always @althonos - would you like some more test data? Would think it would be good to try out some full chromosomes and phages too. @oschwengers and I have plenty if you want it. George |
Ok so progress update: so far I manage to get all nodes to score properly. Then, the problem of the connection scoring algorithm is that you will end up with different scores depending on which codon you start with, which means ultimately you end up with different genes depending on the sequence breakpoint, which is not great. My idea to counter that is to always "rotate" the list of codons to make them always start with the highest scoring codon; this would make the search deterministic and invariant over rotation of the circular sequence. So far this works, and on my test sequence I am always getting the same genes after this tweak. The problem I have with this approach however is with the overlapping codons bonus: to resolve overlapping codons, Prodigal has an extra pre-processing step that happens before the connection scoring is started, so that potential operons can be identified. The problem now is that because of the circular sequences I need the algorithm to be able to identify overlapping codons across the sequence breakpoint, otherwise this creates another set of issues. Once I manage to do that, I think I'll be able to release an alpha version for circular topologies. That could maybe be a fun little application note, I'd be happy to write something together with y'all? |
Wow! Sounds very cool and promising! I'd like to volunteer as an alpha-tester ;-) Writing something up also sounds very reasonable, especially thinking about all the hard work, fixes, patches novel features you've added. Of course I'd be more than happy to contribute my $0.02 and help writing something, but it goes w/o saying that you deserve the honor ;-) |
Super awesome as per usual @althonos - happy to help out as well and agree with @oschwengers. I have a couple of ideas where we could try this out for some cool application to find novel CDS vanilla P(y)rodigal missed or truncates (e.g. gene-calling the whole PLSDB https://academic.oup.com/nar/article/50/D1/D273/6439675). George |
Another example to test out on gbouras13/pharokka#379 |
Hi @althonos ,
I hope you've been well mate! I have an issue/discussion to raise based on some related discussions here on my circular contig reorientation tool Dnaapler gbouras13/dnaapler#90 . The default and desired behaviour of dnaapler is to reorient the chromosome to begin with the start codon of dnaA at coordinate 1 (there is some unrelated discussion in that issue about issues with reorientation that you can ignore :) ).
However, @oschwengers makes the very good point that for bacterial chromosomes that are re-oriented this way, when pyrodigal is run (e.g. in annotation with bakta), it may fail to call the dnaA gene properly, as the RBS will be on the other end of the (linearised) circular contig.
I am personally not a fan of changing dnaapler so that it offsets starting with dnaA with e.g. 50 or 100bps like Oliver is suggesting (due to the diversity in RBSs and intergenic distances before dnaA across all bacteria), and therefore I propose that the best answer would be for prodigal to detect RBSs across edges (if you haven't already implemented such functionality) - is this possible in you view (and of course would you want to implement it)? And if not, do you have other suggestions?
George
The text was updated successfully, but these errors were encountered: