improper rotation when dnaA hits cross the zero coordinate #90

marade · 2024-12-10T00:06:28Z

Pseudomonas aeruginosa chromosomes are often improperly rotated. For example many chromosome annotations hosted at NCBI have dnaA spanning the 0 coordinate, with a telltale join annotation like 'join(6478288..6478686,1..1146)'. I could cite dozens or maybe hundreds of these, but here are just a few:

(https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP147557.1?report=gbwithparts&log$=seqview)

https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP140615.1?report=gbwithparts&log$=seqview

https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP100653.1?report=gbwithparts&log$=seqview

I think this happens because P. aeruginosa has multiple dnaA boxes, which can fool annotation software into guessing that the dnaA gene starts later in the sequence than it actually does. Some details here:

https://pmc.ncbi.nlm.nih.gov/articles/PMC4119464/#S2

Most of these improperly rotated chromosomes can be corrected by rotating clockwise by 399bp, and dnaA then starts after the zero coordinate, as is the convention for bacterial annotations.

Looking at the dnaapler MMseqs output for the first example (GN04821), unfortunately the first 300 hits are wrong, and we finally start getting good hits with sp~B7V0N6~DNAA_PSEA8 matching to GN04821 with start position 6961242. If I rotate the contig by 2k clockwise and run dnaapler again, sp~B7V0N6~DNAA_PSEA8 is the top hit. This highlights a problem with linear references / alignment being used for circular problems like rotation, where the rotation of the contig before MMseqs is run leads dnaapler to the wrong conclusion.

To prevent problems like this, I wonder if you might chain hits crossing the zero coordinate? Or perhaps do one MMseqs run, then rotate all contigs halfway, then another MMseqs run? Or is there a better solution?

oschwengers · 2024-12-10T08:23:23Z

I think this is an important and obviously also often issue. $0.02 that I could add here is that also very often, on rotated chromosomes it is very hard to get the dnaA detection/annotation right. The issue is, that P(y)rodigal often fails to correctly predict dnaA at position 1 because the corresponding RBS is not properly detected and thus, the gene (if detected at all) is often detected as a partial hit, though 100% of its aa seq is present and does not cross the seq border. A simple solution might be to rotate the chrom by additional x basepairs. However, I'm not sure what a good value for x could be, maybe something between 50 bp to 100 bp?

gbouras13 · 2024-12-10T09:09:37Z

Hi @marade @oschwengers ,

Thanks for this insightful discussion! I am partial to the suggestion by @marade of x being half the length of the contig - making it a fixed number will be tricky and inconsistent just because the relevant gene isn't just dnaA, and the smallest replicons (e.g. plasmids) might only be a couple thousand bases long.

Luckily with the change to MMSeqs2, simply running it twice isn't really a compute limitation question now given it is almost instant (whereas with BLAST it did take a while typically), so twice is fine.

Something like:

rotate all replicons by half the length of the contig
Run MMSeqs2 on both of the original and the half-rotated contigs
Combine both outputs, and select the hit with the largest bitscore/lowest evalue per replicon

Seem reasonable?

In terms of actually doing this - maybe next week, I will need to find some time :)

George

oschwengers · 2024-12-10T14:00:07Z

Hey @gbouras13 , sorry, I guess I wasn't specific enough. My suggestion for x did not relate to @marade 's suggestion for a 2nd round of "blasting" rep genes on a 1/2-rotated chromosome - this is absolutely straight forward and I would also vote for this. My suggestion addressed the question to which position a replicon should finally be re-oriented, once we have detected a proper dnaA/rep/etc gene. If we re-orientate a chrom so that dnaA is located at position 1, then P(y)rodigal often fails to properly detect this important gene, because its ribosomal binding site (RBS) is then located at the other end (linear perspective). This often results in wrong dnaA gene predictions (wrong coordinates) or correct coordinates but this gene being predicted as partial as P(y)rodigal can't detect its RBS.

To get some data into this dark matter, I extracted the intergenic regions for all 661k ATB/BakRep genomes around dnaA genes cropped at +/- 100 bp, and plotted their length:

$ awk '$1 > -100' dnaa-intergenic.txt | awk '$1 < 100' | hist -x -r -b 100

 92329|                                                                                       o              
 87470|                                                                                       o              
 82610|                                                                                       o              
 77751|                                                                                       o              
 72892|                                                                                       o              
 68032|                                            o                               o          o              
 63173|                                            o                               o          o              
 58313|                                            o                               o          o              
 53454|                                            o                               o          o              
 48595|                                            o                               o          o              
 43735|                                            o                               o          o              
 38876|                                            o                               o          o              
 34016|                                            o                               o          o              
 29157|                                            o                               o          o              
 24298|                                            o                 o             o          o              
 19438|                                            o                 o             o          o              
 14579|                                            o                 o             o          o              
  9719|                                            o                 o             o          o              
  4860|                                            o   o             o             o          o              
     1| o o              o  oo o oooooooooo oooooooo ooo oooooooooooooooooooooooooooooooooooooooooooooooooooo
       -----------------------------------------------------------------------------------------------------
       - - - - - - - - - - - - - - - - - - - - - - - - - 3 7 1 1 1 2 2 3 3 3 4 4 4 5 5 6 6 6 7 7 7 8 8 9 9 9 
       9 8 8 8 7 7 6 6 6 5 5 4 4 4 3 3 3 2 2 1 1 1 7 4 0 . . 1 4 8 2 6 0 4 7 1 5 9 3 6 0 4 8 2 6 9 3 7 1 5 9 
       2 8 4 0 6 2 9 5 1 7 3 9 6 2 8 4 0 7 3 9 5 1 . . . 5 3 . . . . . . . . . . . . . . . . . . . . . . .   
         . . . . . . . . . . . . . . . . . . . . . 9 1 3   2 1 9 7 6 4 2 0 8 7 5 3 1 9 8 6 4 2 0 9 7 5 3 1   
         1 3 5 7 9 0 2 4 6 8 9 1 3 5 7 8 0 2 4 6 7 6 4 2     4 6 8   2 4 6 8   2 4 6 8   2 4 6 8   2 4 6 8   
         8 6 4 2   8 6 4 2   8 6 4 2   8 6 4 2   8                                                           
                                                                                                             
                                                                                                             

-----------------------------------
|             Summary             |
-----------------------------------
|       observations: 277874      |
|      min value: -92.000000      |
|         mean : 37.981862        |
|       max value: 99.000000      |
-----------------------------------

I'm just posting this here to add my thoughts on this. However, I am aware that changing the position by a certain offset would break with the "dnaA at pos 1" convention and that ultimately this might be a bug that should be addressed best within P(y)rodigal itself.

marade · 2024-12-10T19:06:58Z

@gbouras13 sounds like a good plan. I appreciate your work on this excellent tool.

Regarding @oschwengers comments about the final gene position, I think it's more important to not have features crossing the zero coordinate and have dnaA / rep genes / etc be the first gene than to have the first gene start at position 1. If it starts at say, position 100 or 400, that would still be reasonable in my opinion, so long as the other conditions are met. But as mentioned, perhaps this isn't best addressed with your tool.

gbouras13 · 2024-12-11T01:57:29Z

@oschwengers - I see your point now. It will propagate the bakta annotations as partial. I actually think this might be a good issue to raise with Martin in the pyrodigal repository - let me raise an issue.

George

oschwengers · 2024-12-11T08:36:30Z

Yeah I think that if Pyrodigal would be able to detect RBS related to pos 1 dnaA /repA genes, leaving the pos 1 convention unchanged might be best. I totally agree that changing such a convention might be tricky and thus I totally understand your reluctance in this point. Sorry for kind of hijacking this issue with this related topic. I'll jump over with my discussion on this to the Pyrodigal issue you've opened.

Coming back to the initial issue, thanks a lot for addressing this. This will be a very good and nice improvement to an already super cool tool! Thanks.

gbouras13 · 2025-01-13T02:44:11Z

Hi @oschwengers @marade,

I got around to implementing a fix to allow detection of the desired genes across contig ends. It will be available in v1.1.0 shortly. Please let me know if it gives the desired output - I have added some tests in CI and tried it out on a bunch locally, but of course that is no guarantee of anything :)

George

marade · 2025-01-21T21:38:13Z

Testing this change, so far the results are indeed better. I imagine over time we might encounter cases where the new scheme doesn't work, but I haven't found any yet. Thanks for this improvement!

gbouras13 added enhancement New feature or request question Further information is requested labels Dec 10, 2024

gbouras13 mentioned this issue Dec 11, 2024

support detection of RBS across linear chromosome breakpoint ? althonos/pyrodigal#65

Open

gbouras13 closed this as completed in 2464239 Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improper rotation when dnaA hits cross the zero coordinate #90

improper rotation when dnaA hits cross the zero coordinate #90

marade commented Dec 10, 2024

oschwengers commented Dec 10, 2024

gbouras13 commented Dec 10, 2024

oschwengers commented Dec 10, 2024 •

edited

Loading

marade commented Dec 10, 2024

gbouras13 commented Dec 11, 2024

oschwengers commented Dec 11, 2024

gbouras13 commented Jan 13, 2025

marade commented Jan 21, 2025

improper rotation when dnaA hits cross the zero coordinate #90

improper rotation when dnaA hits cross the zero coordinate #90

Comments

marade commented Dec 10, 2024

oschwengers commented Dec 10, 2024

gbouras13 commented Dec 10, 2024

oschwengers commented Dec 10, 2024 • edited Loading

marade commented Dec 10, 2024

gbouras13 commented Dec 11, 2024

oschwengers commented Dec 11, 2024

gbouras13 commented Jan 13, 2025

marade commented Jan 21, 2025

oschwengers commented Dec 10, 2024 •

edited

Loading