Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genes (well) annotated in prokka end up all in different groups?? #355

Closed
4ureliek opened this issue Sep 21, 2017 · 5 comments
Closed

Genes (well) annotated in prokka end up all in different groups?? #355

4ureliek opened this issue Sep 21, 2017 · 5 comments

Comments

@4ureliek
Copy link

Hi,
I used prokka on Staph aureus strains. I checked the MecA annotations and they look good, but since I could not find MecA in the roary output, I checked which group each 'locus tag' ended up in (grepping them from the gene_presence_absence.csv file).
I am getting back 81 different groups (~as many as the original genes, which means they mostly did not end up in the same clusters). And these groups have lots of different gene names, that have nothing to do with MecA... Looks like these MecA genes all cluster with the wrong set of genes, which is surprising?
I used this command line:
nohup time roary -e --mafft -o staph.roary.out -v ../prokka/*.gff -p 5 -r > staph.roary.aln.log &
Is there something I should be doing differently? I can send some files to reproduce the issue if needed, but first I was hoping there was a simple explanation, such as an option I overlooked!
Cheers,
Aurelie

@tseemann
Copy link
Contributor

tseemann commented Oct 4, 2017

@4ureliek how sure are you that you have 81 conserved MecA sequences in your isolates? If you extract the protein sequence for the first MecA, and BLASTP against the other 80 isolates, do you get 1 good hit for every species?

You could also combine all the prokka .FFN files and run cd-hit-est -c 0.96 -i ALL.ffn -o out and look at the out.clstr file to see if it is indeed conserved (this is essentially what roary does)

@4ureliek
Copy link
Author

4ureliek commented Oct 4, 2017

MecA presence was confirmed through an independent method, but the problem was mostly the inconsistency between the prokka annotations (that looked good) and the roary gene_presence_absence.csv file. Interestingly when we added the -s option, now we get all MecA annotated by prokka in one group, and not 1) split in different groups and more importantly 2) under the expected name (description column). Since this option is basically not splitting paralogs (i.e. when not sure the genes are orthologs) I think that the quality of the assemblies (they are at the contig level) was the main factor for the splitting in groups, but I am still confused regarding the naming of these groups, and how that is decided by roary.

@tseemann
Copy link
Contributor

tseemann commented Oct 5, 2017

When you annotate with Prokka I would strongly recommened you provide the option --proteins GENOME.gbk where GENOME.gbk is a genbank file of a Staph that you trust is well annotated. Maybe USA300 or TW20 for example. This will use the Staph-specific annotations as a priority over everything else.

@4ureliek
Copy link
Author

4ureliek commented Oct 5, 2017

Yes, we did that, it helped a lot!
The prokka annotation was satisfying - the issue was with roary

@andrewjpage
Copy link
Member

As you have found using the -s option to turn off paralog splitting collapses them all into one group. It is likely you have some genomes with multiple copies of this gene and they get split into different groups based on syntany. As for naming, it is the most frequently annotated name from the prokka input. As @tseemann suggests, providing species specific annotation will help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants