-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improper rotation when dnaA hits cross the zero coordinate #90
Comments
I think this is an important and obviously also often issue. $0.02 that I could add here is that also very often, on rotated chromosomes it is very hard to get the |
Hi @marade @oschwengers , Thanks for this insightful discussion! I am partial to the suggestion by @marade of Luckily with the change to MMSeqs2, simply running it twice isn't really a compute limitation question now given it is almost instant (whereas with BLAST it did take a while typically), so twice is fine. Something like:
Seem reasonable? In terms of actually doing this - maybe next week, I will need to find some time :) George |
Hey @gbouras13 , sorry, I guess I wasn't specific enough. My suggestion for To get some data into this dark matter, I extracted the intergenic regions for all 661k ATB/BakRep genomes around
I'm just posting this here to add my thoughts on this. However, I am aware that changing the position by a certain offset would break with the "dnaA at pos 1" convention and that ultimately this might be a bug that should be addressed best within P(y)rodigal itself. |
@gbouras13 sounds like a good plan. I appreciate your work on this excellent tool. Regarding @oschwengers comments about the final gene position, I think it's more important to not have features crossing the zero coordinate and have dnaA / rep genes / etc be the first gene than to have the first gene start at position 1. If it starts at say, position 100 or 400, that would still be reasonable in my opinion, so long as the other conditions are met. But as mentioned, perhaps this isn't best addressed with your tool. |
@oschwengers - I see your point now. It will propagate the bakta annotations as partial. I actually think this might be a good issue to raise with Martin in the pyrodigal repository - let me raise an issue. George |
Yeah I think that if Pyrodigal would be able to detect RBS related to pos 1 dnaA /repA genes, leaving the pos 1 convention unchanged might be best. I totally agree that changing such a convention might be tricky and thus I totally understand your reluctance in this point. Sorry for kind of hijacking this issue with this related topic. I'll jump over with my discussion on this to the Pyrodigal issue you've opened. Coming back to the initial issue, thanks a lot for addressing this. This will be a very good and nice improvement to an already super cool tool! Thanks. |
Hi @oschwengers @marade, I got around to implementing a fix to allow detection of the desired genes across contig ends. It will be available in v1.1.0 shortly. Please let me know if it gives the desired output - I have added some tests in CI and tried it out on a bunch locally, but of course that is no guarantee of anything :) George |
Testing this change, so far the results are indeed better. I imagine over time we might encounter cases where the new scheme doesn't work, but I haven't found any yet. Thanks for this improvement! |
Pseudomonas aeruginosa chromosomes are often improperly rotated. For example many chromosome annotations hosted at NCBI have dnaA spanning the 0 coordinate, with a telltale join annotation like 'join(6478288..6478686,1..1146)'. I could cite dozens or maybe hundreds of these, but here are just a few:
(https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP147557.1?report=gbwithparts&log$=seqview)
https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP140615.1?report=gbwithparts&log$=seqview
https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP100653.1?report=gbwithparts&log$=seqview
I think this happens because P. aeruginosa has multiple dnaA boxes, which can fool annotation software into guessing that the dnaA gene starts later in the sequence than it actually does. Some details here:
https://pmc.ncbi.nlm.nih.gov/articles/PMC4119464/#S2
Most of these improperly rotated chromosomes can be corrected by rotating clockwise by 399bp, and dnaA then starts after the zero coordinate, as is the convention for bacterial annotations.
Looking at the dnaapler MMseqs output for the first example (GN04821), unfortunately the first 300 hits are wrong, and we finally start getting good hits with sp~B7V0N6~DNAA_PSEA8 matching to GN04821 with start position 6961242. If I rotate the contig by 2k clockwise and run dnaapler again, sp~B7V0N6~DNAA_PSEA8 is the top hit. This highlights a problem with linear references / alignment being used for circular problems like rotation, where the rotation of the contig before MMseqs is run leads dnaapler to the wrong conclusion.
To prevent problems like this, I wonder if you might chain hits crossing the zero coordinate? Or perhaps do one MMseqs run, then rotate all contigs halfway, then another MMseqs run? Or is there a better solution?
The text was updated successfully, but these errors were encountered: