Skip to content

Latest commit

 

History

History
187 lines (137 loc) · 9.44 KB

README.md

File metadata and controls

187 lines (137 loc) · 9.44 KB

pathophy

Scripts for aiding in pathogen phylogenetics analysis

Phylogenetic Inclusion python script:

This script is used to take a transmission cluster (e.g. defined by SNP cut-offs) and extend it to include all other related strains that share the common ancestor in a phylogenetic tree.
####Inputs:

A cluster file: one cluster per line, separated by tabs
A tree file in newick format
NOTE: requires dendropy to be installed
usage
python phylogeneticInclusion.py -t treeFile -c clusterFile

See comments in script for full instructions

Mixed population detection by comparing VCF files

This script is used to try estimate mixed populations from whole genome sequencing VCF files.

Input

2 vcf files output from a SNP calling pipeline such as snippy but at different minumum proportion for variant evidence (fractions)
E.g. run snippy with --minfrac 0.3 and compare to default (0.9)

See comments in script for full instructions

Snippy output to Phylogenetic input

This pipeline converts outputs from snippy into a whole genome alignment, a SNP alignment and an invariant count of positions for use in ascertainment bias correction in phylogenetics

Input:

A folder in which there are snippy output folders. In particular the following files must be in each
snps.vcf This is used to get all snps and indels
snps.aligned.fa This is used for the Ns and large areas of either deletions or unaligned regions (e.g. insertion sites; multiple mapping sites). These two latter cannot be distinguished within this script.
These files can be named other things (e.g. not snps but another prefix, as done by the --prefix in snippy) but must be PREFIX.snps and PREFIX.aligned.fa with no other aligned.fa file
The fasta reference sequence used to create the snippy outputs (assumed same for all). E.g. this is the ref.fa file in the reference folder of a snippy output.
An interval masking file (optional; see below)
A set of characters not to count as variable (optional; see below)
A flag for whether to include the refrence sequence (optional)

See comments in script for full instructions

Cutting BEAST output to desired step

Script takes a MCMC log or trees file from BEAST and will cut them at a specific step E.g. the file may have more than 40,000 steps and you wish to cut at that step (step requested will be included)

Input

File to cut File type (log or trees) Last step to include

Output

New log or trees file. Will be named same as input with _cut e.g. if input it Dataset.log and the last step is 40000, the output is Dataset_cut40000.log

Usage

python cut_BEAST_output.py --file FileToCut --type <log/trees> --step

Creating core genome alignments from isolate lists of aligned genes

This consists of two scripts: isolateFilesToGeneFiles.py and genesToConcatAlign.py

isolateFilesToGeneFiles.py

This script is used to combine files of individual gene sequences, per isolate, into gene-specific alignment files

Input:

A folder of isolate files where the isolate name is everything before the first _. NOTE: assumes all end in .fasta Inside each file is a set of fasta formatted sequences where the gene name is xxx_xxx (i.e. based around 1st _ only) e.g. >STENO_1_6, the gene name is STENO_1

Output:

In a folder named geneAlignments: Per gene files where the file name is the gene name (e.g. STENO_1.fasta) Inside each file the sequences are named as isolate_geneName (e.g. 677_STENO_1)

NOTE: uses biopython NOTE: assumes only 1 entry per gene in each isolate. Will warn if double and only keep last one

Usage:

python isolateFilesToGeneFiles.py --folder folderName

genesToConcatAlign.py

This script is used to combine files of gene alignments into a concatenated alignment (assumes all species/isolates in all gene alignments)

Input:

A folder of fasta gene alignmentds where the sequence names is the species/isolate name (identical between files) NOTE: assumes all end in .fasta

Output:

A concatenated alignment of all genes, using the species name as the sequence header (pulled from the first input file) A file that outlines the gene name and the start/stop positions of the gene in the concatenated alignment e.g. STENO_1, 1-1345

NOTE: uses biopython NOTE: Assumes all genes have all species/isolates. If not, will mess up the alignment

Usage:

python genesToConcatAlign.py --folder folderName

Creating SNP cut off-based clusters from an alignment

This consists of two scripts: pairwiseDist.py and pairwiseCluster.py NOTE: inputs to these scripts are position specific so must be given in a specific order

pairwiseDist.py

Script to take an alignment and get pairwise distance between all sequences as a matrix, with allowed exclusions

Input:

A fasta alignment (fastaAlignment) Characters to not count as divergent if encountered (excludedCharacters) e.g. if the excluded characters are N- then if a comparison has a N or - this will not be counted as different NOTE: is not case sensitive so N and n are the same thing

Output:

A mnatrix of distances between the samples.

Usage:

python pairwiseDist.py fastaAlignment excludedCharacters

pairwiseCluster.py

This script is used to cluster sequences based on a pairwise distance file. This takes the output of pairwiseDist.py as input

Input:

The pairwise distance matrix output from pairwiseDist.py (pairwiseDistFile) A SNP cut-off number (cut-off)

Output:

A file listing all sequences that are within the cut-off from another sequence A file where each line is a list of sequences within cut-off distance of each other (tight cluster) A file where each line is a list of sequences where at each sequence is within the cut-off from at least 1 other (e.g. if cutoff is 5 then seq1 and seq2 could be 5 apart and seq3 could be 10 from seq2 and 3 from seq1 but all on same line) (loose cluster)

NOTE: cut-off is used as <= so if cutoff is 5 then those a distance of 5 will be counted as clustered NOTE2: NA can be an entry (e.g. for a tip distance that is not supported by bootstrap or unequal lengths of patterns) so if it is, count as not within cutoff

Usage:

python pairwiseCluster.py pairwiseDistFile cut-off

Detecting gene absences from MTBseq outputs

Script takes a folder of Position Tables created by MTBseq and searches for genes that have no coverage (suggesting they are absent).

Input

Folder containing Position Tables from MTBseq (default is Position_Tables) The genes file from the reference used for MTBseq (can be found in the var/ref/ folder of the MTBseq installation) The minimum proportion of positions absent to be counted as an absent gene (default is 0.95) The minimum number of reads to consider a position to be covered (default is 8, same as MTBseq)

Output

A presence/absence map of the genes in a matrix. Names are the position table names with everythign after the first _ removed (similar to MTBseq) The samples are listed in the first column; the genes are listed in the first row. There is a 1 for a present gene and a 0 for an absent gene

Usage

python MTBseq_to_genePresAbs.py --folder Position_TablesFolder --genes genesFile --cutoff proportionAbsent --minreads coverageMinumum

Converting MTBseq Amend alignment into suitable SNP alignment with ascertainment bias correction for use in RAxML-NG

This script takes the output of MTBseq and prepares the SNP alignment and invariant site count for use in RAxML-NG

Input

Amend folder with *_amended_u95_phylo_w12.plainIDs.fasta and *_amended_u95_phylo_w12.tab files inside (can be any starting text but must end with these suffixes) The reference genome used for the assembly in MTBseq Any characters to be ignored in terms of invariance. E.g. if the skipping characters are - and N a site such as --A-N will be removed as only 1 useable character is left

Output

A SNP file for use in phylogenetics (e.g. RAxML-NG). This file is named _SNPalignment.fasta A count of the invariant sites in RAxML-NG format (i.e. countA/countC/countG/countT). This file is named _invariantSiteCount.txt A list of the sites in the SNP file that were removed in the 2nd step of the script. This file is named _invariantSitesRemovedFromSNPfile.txt The suggested RAxML-NG command. This is placed in file _suggestedRAxML-NG.sh (Project name is derived from the name before the first _ in the tab file above)

Filtering BLAST hits by similarity and length of match

This script takes a BLAST output in tabular format and the input query fasta file and filters out all entries under a certain similarity percentage and percentage of length cut-off

Input

BLAST table (outfmt 6 from blast+) The original query fasta file A percentage similarity/identity above which to retain hits (is >=, not just >) A percentage length of the query sequence above which to retain a hit (is >=, not just >) NOTE: assumes proper formatting of query names, so there is an exact match between query name and what is in the BLAST table NOTE:fasta file should be a single line per sequence, not multi-line

Output

A subset of results based on the filtering above

Usage

python blastHitsFilterIdLength.py --blast --query --id <% id cut-off> --length <% query length cut-off>