Investigating the role of homologous recombination in driving sequence-discrete units at the species and intra-species levels.

Contains the code and workflow for the F100 recombination project.

This workflow uses 100% sequence similarity between reciprocal best matched genes (RBMs) from a pair of genomes as a proxy for recent homologous recombination events. It assesses the frequency of 100% similar RBMs to total RBMs (F100) for each genome pair input by the user and compares that to the expected value based on aggregate-genome average nucleotide identity (ANI) of the genome pair compared to 1) a model of 330+ species genomes from NCBI that have at least 10 Complete genomes per species, 2) a model of simulated random neutral evolution genomes along a 95%-100% ANI gradient with zero recombination, or 3) an easy to build custom model from the users own colleciton of genomes. This workflow also evaluates the genomic positions of recent recombinant events for genome pairs and for one genome against many genomes. It can also create clusters of genomes with high recombination activity and it can perform hypthosesis testing of broad level functional gene annotations of recombinant vs. non-recombinant genes.

A collection of genomes in fasta format is all that is required as input to begin. This workflow was designed focused on genome collections from the same species (≥95% ANI) but it will work at broader or finer genome similarity groupings as long as some 100% RBMs exist between the genomes.

The steps are left separately so the user can more easily follow the workflow, and so individual steps can be more efficiently parallelized depending on the users system.

Fasta: Predicted CDS gene sequences from Prodigal
Figures: Histograms of gene length distributions
DataTable: RBM sequence similarities for each genome pair
DataTable: F100 score for each genome pair
Figure: Histogram of RBM sequence similarity for each genome pair
Figure: F100 vs ANI with various GAM models
DataTable: F100 vs ANI data, confidence interval and p value for each genome pair
Figure: F100 distance hierarchical clustered heatmap

Part 03:

DataTable: Gene clusters from MMSeqs2
Fasta: Representative sequence fasta for each gene cluster
DataTable: Presence/Absence binary matrix of genomes and gene cluster
DataTable: tsv gene list for Coinfinder input
Figure: Pangenome curve model
Figure: Pangenome clustered heatmap
DataTable: Gene annotations
DataTable: Genes assigned to pangenome classes: Conserved, Core, Accessory, Specific
Figure: Histogram of average within gene cluster sequence distance for Core genes

Genome pairs

Figure: genome pairs: Recombinant gene position by pangenome class
Figure: genome pairs: Distance between recombination events distribution test
Figure: genome pairs: Recombinant vs. Non-recombinant gene annotation test
Figure: genome pairs: Sequence identity of RBMs vs. genome position
DataTable: Gene RBM info, position info, annoation info

One genome to many genomes

Figure: genome group: Recombinant gene position by pangenome class
Figure: genome group: Distance between recombination events distribution test
Figure: genome group: Recombinant vs. Non-recombinant gene annotation test
Figure: genome group: Sequence identity of RBMs vs. genome position
Figure: genome group: Core vs total recombinant positions rarefaction curve
DataTable: Gene RBM info, position info, annoation info
DataTable: RBM Matrix of gene/genome recombinant sites
Figure: Rarefaction curve of recombinant sites per genome

PART 01: Genome Preparation

This workflow is intended for a collection of genomes belonging to the same species (ANI ≥ 95%) or to closely related species (ANI ≥ 85-90%). Start with your genome files in fasta format in their own directory. We will refer to this directory as the ${genomes_dir}.

Step 01: Rename fasta deflines

Rename the fasta deflines of your genome files. This is necessary to ensure all contigs (or chromosomes/plasmids) in your genome files follow the same naming format for downstream processing.

Because genomes downloaded from NCBI follow a typical naming convention of, "GCF_000007105.1_ASM710v1_genomic.fna," the default behavior of this script is to cut the third underscore position ("_") and use it as a prefix for renaming the fasta deflines in numeric consecutive order.

So with default settings the script will cut "ASM710v1" from filename "GCF_000007105.1_ASM710v1_genomic.fna" and rename the fasta deflines (Contigs/Scaffolds/Chromosomes) as:

>ASM710v1_1
AATGGATCAGTCCGCCGACCGCGCCTGGAACGAATGTCTCGACATCATCCGGGACAATGT...
>ASM710v1_2
GAGCCGCCAGAGCTTCACGACCTGGTTTGAGCCGCTGGAGGCCCACTCCTTGGAGGACGA...
>ASM710v1_n
GGACGACCTGCGCAAGCTGACGATCCAACTTCCGAGCCGGTTTTACTACGAGTGGATTGA...

This step requires Python.

Input:

genome fasta files in ${my_genomes} directory

Output:

overwrites genome fasta files with new names

To use the renaming script on all files in a directory with default setting:

for f in ${genomes_dir}/*; do python 00d/Workflow_Scripts/01a_rename_fasta.py -i $f; done

Alternatively, the user can input their own desired prefix using the "-p" flag in which case the input filename is ignored. Replace "${name}" with anything you want:

for f in ${genomes_dir}/*; do name=`echo basename $f | cut -d_ -f3`; python 00d/Workflow_Scripts/01a_rename_fasta.py -i $f -p ${name}; done

So with -p my_genome the script will output:

>my_genome_1
AATGGATCAGTCCGCCGACCGCGCCTGGAACGAATGTCTCGACATCATCCGGGACAATGT...
>my_genome_2
GAGCCGCCAGAGCTTCACGACCTGGTTTGAGCCGCTGGAGGCCCACTCCTTGGAGGACGA...
>my_genome_n
GGACGACCTGCGCAAGCTGACGATCCAACTTCCGAGCCGGTTTTACTACGAGTGGATTGA...

Files

README.md

Latest commit

History

README.md

File metadata and controls

Investigating the role of homologous recombination in driving sequence-discrete units at the species and intra-species levels.

Table of Contents

Data table and Figure Outputs

Part 01:

Part 02:

Part 03:

PART 01: Genome Preparation

Step 01: Rename fasta deflines

Step 02: All vs. all fastANI

Step 03: Inspect genome similarity

Step 04: Assign clades, phylogroups, and genomovars

PART 02: Genome Analysis

Step 01: Predict genes with Prodigal

Step 02: Compute Reciprocal Best Match Genes

Step 03: Compute F100 scores

Step 04: Compare User Genomes to GAM Models

Option 01: Complete genomes model

Option 02: Simulated NEZR model

Option 03: Custom models

Step 05: Identify Significant Outliers

Step 06: F100 score Clustered Heatmap

Step 07: Identical gene fractions by groupings

PART 03: Gene Analysis

Step 01: Generate gene clusters with MMSeqs2

Concatenate all gene CDS to single file

Create a directory that we'll use for MMSeq2 intermediates

Create an mmseqs database using all_genes_CDS.fnn

Cluster at 90% nucleotide identity

Write mmseqs database to TSV format

Write out cluster representative fasta file

Cleanup tempory files

Create a binary matrix of genomes and gene clusters

(OPTIONAL): create Coinfinder input file

(OPTIONAL): create pangenome model

(OPTIONAL): create clustermap

Step 02: Annotate representative genes with EggNog Mapper or COGclassifier

Concatenate all amino acid sequence predicted CDS

Retrieve amino acid sequence for representative genes

Annotate genes with EggNog Mapper

Annotate genes with COGclassifier

Step 03: Assign pangenome class to genes

Step 04: Reorder-align contigs for MAGs, SAGs, and draft genomes

Step 05: Explore genome pairs of interest

Run analysis for each genome pair of interest

Recombinant gene position by pangenome class

Distance between recombination events distribution test

Recombinant vs. Non-recombinant gene annotation test

Sequence identity of RBMs vs. genome position

Step 06: Explore one vs. many genome groups of interest

Recombinant gene position by pangenome class

Distance between recombination events distribution test

Recombinant vs. Non-recombinant gene annotation test

Sequence identity of RBMs vs. genome position

Recombinant RBM curve plot

this is an old plot adapted from my pangenome work. Replacing it with the rarefaction plot below.

Recombinant RBM gene clustermap

Recombinant RBM gene rarefaction plot

Software Dependencies

External dependencies

References

Required packages for Python

References

How to Cite

Future Improvements

Step 03: Compute F₁₀₀ scores

Step 06: F₁₀₀ score Clustered Heatmap