Genes (well) annotated in prokka end up all in different groups?? #355

4ureliek · 2017-09-21T17:50:05Z

Hi,
I used prokka on Staph aureus strains. I checked the MecA annotations and they look good, but since I could not find MecA in the roary output, I checked which group each 'locus tag' ended up in (grepping them from the gene_presence_absence.csv file).
I am getting back 81 different groups (~as many as the original genes, which means they mostly did not end up in the same clusters). And these groups have lots of different gene names, that have nothing to do with MecA... Looks like these MecA genes all cluster with the wrong set of genes, which is surprising?
I used this command line:
nohup time roary -e --mafft -o staph.roary.out -v ../prokka/*.gff -p 5 -r > staph.roary.aln.log &
Is there something I should be doing differently? I can send some files to reproduce the issue if needed, but first I was hoping there was a simple explanation, such as an option I overlooked!
Cheers,
Aurelie

The text was updated successfully, but these errors were encountered:

tseemann · 2017-10-04T05:51:10Z

@4ureliek how sure are you that you have 81 conserved MecA sequences in your isolates? If you extract the protein sequence for the first MecA, and BLASTP against the other 80 isolates, do you get 1 good hit for every species?

You could also combine all the prokka .FFN files and run cd-hit-est -c 0.96 -i ALL.ffn -o out and look at the out.clstr file to see if it is indeed conserved (this is essentially what roary does)

4ureliek · 2017-10-04T23:03:02Z

MecA presence was confirmed through an independent method, but the problem was mostly the inconsistency between the prokka annotations (that looked good) and the roary gene_presence_absence.csv file. Interestingly when we added the -s option, now we get all MecA annotated by prokka in one group, and not 1) split in different groups and more importantly 2) under the expected name (description column). Since this option is basically not splitting paralogs (i.e. when not sure the genes are orthologs) I think that the quality of the assemblies (they are at the contig level) was the main factor for the splitting in groups, but I am still confused regarding the naming of these groups, and how that is decided by roary.

tseemann · 2017-10-05T04:13:47Z

When you annotate with Prokka I would strongly recommened you provide the option --proteins GENOME.gbk where GENOME.gbk is a genbank file of a Staph that you trust is well annotated. Maybe USA300 or TW20 for example. This will use the Staph-specific annotations as a priority over everything else.

4ureliek · 2017-10-05T17:23:50Z

Yes, we did that, it helped a lot!
The prokka annotation was satisfying - the issue was with roary

andrewjpage · 2018-01-31T09:48:41Z

As you have found using the -s option to turn off paralog splitting collapses them all into one group. It is likely you have some genomes with multiple copies of this gene and they get split into different groups based on syntany. As for naming, it is the most frequently annotated name from the prokka input. As @tseemann suggests, providing species specific annotation will help.

karchern mentioned this issue Oct 8, 2017

Paralog splitting in roary #357

Closed

andrewjpage closed this as completed Jan 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Genes (well) annotated in prokka end up all in different groups?? #355

Genes (well) annotated in prokka end up all in different groups?? #355

4ureliek commented Sep 21, 2017

tseemann commented Oct 4, 2017

4ureliek commented Oct 4, 2017

tseemann commented Oct 5, 2017

4ureliek commented Oct 5, 2017

andrewjpage commented Jan 31, 2018

Genes (well) annotated in prokka end up all in different groups?? #355

Genes (well) annotated in prokka end up all in different groups?? #355

Comments

4ureliek commented Sep 21, 2017

tseemann commented Oct 4, 2017

4ureliek commented Oct 4, 2017

tseemann commented Oct 5, 2017

4ureliek commented Oct 5, 2017

andrewjpage commented Jan 31, 2018