Poliovirus Investigation Resource Automating Nanopore Haplotype Analysis
Piranha is a tool developed to help standardise and streamline sequencing of poliovirus. It's been developed by members of the Rambaut group at the University of Edinburgh as part of the Poliovirus Sequencing Consortium. Piranha runs an end-to-end read-to-report analysis that produces distributable, interactive reports alongside analysed consensus data. By default, piranha will attempt to generate consensus genomes for populations of the sam It produces an overall report, summarising the entire sequencing run, as well as a sample-specific report. Samples with virus of interest, such as VDPV are highlighted and certain quality-control flags can alert the user if there are issues with the run (such as a failed negative or positive control or identical sequences between samples that may be the result of contamination).
Any issues or feedback about the analysis or report please flag to this repository.
Note: piranha has been tested primarily on poliovirus VP1 sequencing data. There are alternative analysis modes in development (e.g. whole genome, panEV), but the authors recommend additional QC checks if using piranha beyond its established poliovirus VP1 pipeline.
Example report here
Example data here
Piranha does not require use of the command-line to run. Users can access piranha through the PiranhaGUI application that can be downloaded at the link below. This application enables teams without access to a bioinformatician to use piranha and run standard detection of poliovirus VP1 in stool through nanopore sequencing.
- Download the release package for your machine from the PiranhaGUI respository
- Piranha is also compatible with the EPI2ME workflow desktop application. This mode of running does not provide the same level of input checking and configurability as either the ARTIFICE GUI or the command line, but is provided as an alternative interactive interface. Once EPI2ME and docker have been downloaded, follow the instructions to download a custom workflow using the github path
https://github.com/polio-nanopore/piranha
.
You need to have Git, a version of conda (link to Miniconda here) and mamba installed to run the following commands.
Detailed installation instructions given below, but - in brief - to install with mamba run the following in a terminal:
git clone https://github.com/polio-nanopore/piranha.git
cd piranha
mamba env create -f environment.yml
conda activate piranha
pip install .
If this is your first time running piranha, you’ll need to clone the GitHub repository and install it. If you're running on a Mac machine (OS X) and have never used command line tools such as Git before, you may need to install them. They can be installed easily with the installer. Full instructions can be found here. Linux systems will have Git installed already.
Double check you have git installed by typing the following into a terminal window:
git --version
You should see something like the following printed below:
git version 2.21.1 (Apple Git-122.3)
If you don't see this, follow the link to the install instructions above to install git. If you're using a Windows machine, it's possible to use Windows Subsystem for Linux, which should have git pre-installed.
You will also need to have a version of conda installed. We recommend the latest version of Miniconda, which can be accessed for download and install here. Follow the installation instructions on that page. After successful conda installation and initialistaion you should have (base)
at the start of your command prompt before your username like below:
(base) aine$
You can double check your installation has completed successfully by checking the conda version:
(base) aine$ conda --version
This should return a version number. If it does not, repeat the installaion steps outlined here.
Once conda has been installed, we also recommend installing mamba
. This can be done using the following command:
(base) aine$ conda install mamba -c conda-forge
Press y
to confirm installation if prompted.
I like to keep my repositories in the same place so they’re easy to find, so below I’m making a directory in my home directory (~
) called repositories and moving into it with cd
, which is short for "change directory".
(base) aine$ cd ~
(base) aine$ mkdir repositories
(base) aine$ cd repositories
Now we’re ready to clone the piranha GitHub repository. This creates a local copy of the piranha repository on your computer. It also retains a link back to the original repository, so if any updates are made (such as addition of new features or bug fixes), it's very easy to pull those changes down to your local machine and update your version of piranha. More information on git cloning can be found here.
Clone the piranha repository with:
(base) aine$ git clone https://github.com/polio-nanopore/piranha.git
(base) aine$ cd piranha
This directory remains linked to the original GitHub copy, so if you need to update it or get any changes in the future, you can do that from this location with the following command:
(base) aine$ git pull
We now need to create the piranha environment. Hopefully you have mamba installed, check if you do with the following command:
(base) aine$ mamba --version
If you see the message mamba not found
then you should repeat the mamba installation instructions found at the end of step 2 above. Next we need to create the environment where piranha will run by using mamba. This following command needs to be run in the newly cloned piranha repository. To double check you're in the correct directory, you can type pwd
(print working directory):
(piranha) aine$ pwd
/localdisk/home/repositories/piranha
If you see a path printed like the one above, ending with piranha, you know you're in the correct directory. If you are not in the correct directory, please navigate to it using the cd
command. If you have followed the instructions in step 3 then your piranha repository will be located at ~/repositories/piranha/
.
Once you are in the piranha directory you can create the piranha environment with this command:
mamba env create -f environment.yml
Press y
to confirm installation if prompted. When completed activate the piranha environment:
(base) aine$ conda activate piranha
(piranha) aine$
Notice that the (base)
in your command prompt has changed to (piranha)
. This tells you that the piranha environment is activated. Now you’ll need run install the piranha python package while you’re in the environment:
(piranha) aine$ pip install .
The .
refers to the current working directory (cwd), which should be the piranha repository.
Congratulations! You should now have piranha installed.
(piranha) aine$ piranha --version
Should return piranha and the version number installed:
piranha v1.0
If no errors have come up (such as messages saying "command not found"), you should now be ready to run piranha!
Sometimes there can be issues unrelated to the commands you've run.
- Firstly, check you're in the piranha environment. Your prompt should start with
(piranha)
. If not, activate the piranha environment and check your install again.
Example:
(piranha) aine$
<- correct
(base) aine$
<- incorrect
- A common issue can be related to internet connectivity. If a download has failed because of a break in internet, I suggest running through the commands again and just try again.
mamba
andconda
can cache the files already downloaded, so often you can make progress even if internet connectivity is an issue. - Similarly, if you have a laptop with not enough storage space, installation and download can fail. To solve this, try clear some space on your machine and try again.
- If you're still having trouble installing via the command line, piranha can be installed using the ARTIFICE GUI (linked above).
- Please post an issue to the GitHub if there are further unresolved issues.
- Change directory to the piranha repository:
cd piranha
- Pull the latest changes from GitHub:
git pull
- Ensure you're in the piranha environment:
conda activate piranha
- Install the changes:
pip install .
If there has been a major change to piranha, it's possible the environment will need to be updated as follows:
- Change directory to the piranha repository:
cd piranha
- Pull the latest changes from GitHub:
git pull
- Ensure you're in the piranha environment:
conda activate piranha
- Update the environment:
mamba env update -f environment.yml
Piranha is also available now on bioconda and can be installed with
mamba install -c bioconda piranha-polio
Note: we recommend using piranha in the conda environment specified in the environment.yml file as per the instructions above. If you can't use conda for some reason, dependency details can be found in the environment.yml file.
piranha -i <demultiplexed read directory> -b <path/to/barcodes.csv>
At a minimum, two columns- one for barcode information and one with the name you'd like your sample to be called. This is also where you can flag which samples are negative or positive controls. Barcode and sample names should be unique and sample names shouldn't contain spaces or special characters. This is because the sample name will get incorporated into the output consensus fasta header and spaces in a fasta header disrupt the sequence ID.
barcode,sample
barcode01,EDI001
barcode02,EDI002
barcode03,EDI003
barcode04,negative
barcode05,positive
You can also include additional information in your barcode.csv. If you include a date
column or an EPID
column they'll automatically be included in the fasta header output too. Dates should always be in ISO format (YYYY-MM-DD) for metadata best practice. Piranha has a flag --all-metadata-to-header
that will take any metadata fields in your barcodes.csv and append them to the final output file (separated by a |
pipe symbol). Be aware any odd characters or spaces in these fields will also get added to the fasta header and can interfere with downstream phylogenetics you might want to run (e.g. :
,;
) can cause issues with some tree-building or reading software.
barcode,sample,EPID,date
barcode01,EDI001,EPI111,2022-10-10
barcode02,EDI002,EPI112,2022-09-20
barcode03,EDI003,EPI113,2022-09,21
barcode04,negative,,
barcode05,positive,,
Piranha is configured to make analysis as straightforward as possible for users running MinION sequencing. Piranha takes the output of guppy directly and looks for the demultiplexed read directory containing the barcodes you've specified in your barcodes.csv. Guppy outputs directories in the structure:
fastq_pass/
├── barcode01/
│ ├── FAT75518_pass_barcode01_6d172881_0.fastq.gz
│ ├── FAT75518_pass_barcode01_6d172881_1.fastq.gz
│ └── FAT75518_pass_barcode01_6d172881_2.fastq.gz
├── barcode02/
│ ├── FAT75518_pass_barcode02_6d172881_0.fastq.gz
│ ├── FAT75518_pass_barcode02_6d172881_1.fastq.gz
│ ├── FAT75518_pass_barcode02_6d172881_2.fastq.gz
│ ├── FAT75518_pass_barcode02_6d172881_3.fastq.gz
│ ├── FAT75518_pass_barcode02_6d172881_4.fastq.gz
│ └── FAT75518_pass_barcode02_6d172881_5.fastq.gz
├── barcode03/
│ ├── FAT75518_pass_barcode03_6d172881_0.fastq.gz
│ ├── FAT75518_pass_barcode03_6d172881_1.fastq.gz
│ └── FAT75518_pass_barcode03_6d172881_2.fastq.gz
└── barcode05/
├── FAT75518_pass_barcode05_6d172881_0.fastq.gz
├── FAT75518_pass_barcode05_6d172881_1.fastq.gz
└── FAT75518_pass_barcode05_6d172881_2.fastq.gz
This is what piranha will look for. Point the software to the directory containing the different barcodeXX sub-directories and it will iterate within these to find the files. Piranha can accept fastq (fq) or fastq.gz files. It will only attempt to analyse the barcodes present in the input csv file.
Piranha has been preconfigured with defaults specific to the VP1 protocol developed by the Polio Sequencing Consortium. All command line arguments (full list below) can be configured either as command line flags when running piranha, or as snakecase arguments in a yaml config file (which can then be supplied with the -c
flag). Snakecase formatting separates works with an underscore, i.e. my_argument
.
For example, you can supply a custom references file (piranha has a default one supplied, which you can access here) using the -r
flag, or by pointing to it within the config file.
Example Case 1: command line argument
(piranha) aine$ piranha -i /localdisk/home/data/minion_run_1/fastq_pass -b /localdisk/home/data/minion_run_1/barcodes.csv -r /localdisk/home/custom_reference_file.fasta
This command shows an example piranha run, where everything is in the default settings except the reference file. In this case you point piranha to the read directory, the barcodes.csv file and your custom reference file.
Example Case 2: config file
Example config file (called config.yaml in current working directory). Actual example file can be found here.
readdir: /localdisk/home/data/minion_run_1/fastq_pass
barcodes_csv: /localdisk/home/data/minion_run_1/barcodes.csv
reference_sequences: /localdisk/home/custom_reference_file.fasta
Then to run piranha you can simply run the command below, and all the information in the file will be included in your run.
(piranha) aine$ piranha -c config.yaml
Piranha allows you to specify which samples are controls (positive or negative). By default, if the sample name contains negative
or positive
within the barcode csv file, piranha will automatically detect that these are control samples. See minimal example above for format.
You can overwrite this if you would rather call your controls something else (like nc
, my_fave_control
etc) with the flags -pc,--positive-control
or -nc,--negative-control
.
You can have one or more positive and negative controls. If specifying more than one sample they can be supplied as a comma-separated string on the command line, or as a list or comma-separated string in the config file.
Example:
piranha -i path/to/fastq_pass -b barcodes.csv -pc "Pos1,P2" -nc "my negative control"
Alternatively you can supply this in a config file with the fields:
positive_control: "Pos1,P2"
negative_control: Negative1
It is important to make sure that the fields match within barcodes.csv. Also note that above because there are spaces in sample names for the command line example negative control, this command will need quotes around the full name or else the terminal won't interpret it as a single field. ALSO, it's in general better to avoid having spaces in sample names because if you get a consensus sequence out of piranha as a fasta file, record ids are defined as the field up to the first space, so you can lose information in downstream analysis software if you're not careful. Best to just avoid spaces (and also special characters like :
, ;
and |
) in general when dealing with this kind of data that might have phylogenetics run on it.
Samples flagged as controls will appear in the report at the end in a separate table as well and will be flagged as either passing (row in table coloured green and a tick appears) or not passing (row in rable coloured red and no tick appears). Piranha's behaviour treats negative controls as passing if there are fewer than the configured minimum number of reads in the sample (Default: 50 reads) and positive controls as passing if there is more than the minimum number of reads in the sample for the positive reference (Default: 50 reads).
The positive control sequence can be configured. By default it is set to a a CoxsackievirusA20 reference sequence present in the references FASTA file supplied with piranha (CoxsackievirusA20_AF499642
). The Poliovirus Sequencing Consortium has developed a positive CoxA20 control, which can be supplied to user labs upon request. If a different positive control is being used within the sequencing run, this can be configured with the--positive-references
flag. The user can specify one or more references to match as a positive control, however they must be present in the references file supplied to piranha. A custom references file can be supplied alongside the run if the positive control used doesn't have a close match in the piranha reference database. Note however, that if your positive control is supplied within the file and does have an existing close match within the supplied piranha reference file, the existence of two very close sequences within a FASTA file may lead to competition for minimap2, and result in a reduced mapping quality score.
If you have a reference positive control sequence that is not closely matched with anything already existing in the piranha database and can share the sequence, we would be happy to include it in the distributed file. In this way, piranha could match this sequence for the positive control without the need to supply a custom reference file.
Piranha takes nanopore sequencing reads (fastq
) and matches them against a reference set of enterovirus VP1 sequences. This matching is currently performed with minimap2 (Li 2018), run in the mode configured for noisynanopore data (-x map-ont
). For more information on the details of what this pre-configured setting is, visit this link. In addition to these pre-configured settings, within piranha we currently only consider primary alignments (i.e. the best hit) for each read against the reference file, rather than all potential matches.
This reference set includes representatives of wild-type poliovirus 1, 2 and 3, as well as Sabin-1, 2 and 3 as well as a selection of other enterovirus VP1 sequences too.
IMPORTANT: Nanopore reads can have up to ~10% error rate and sporadic mapping to some references can occur at the read level, particularly when sequencing at great depth. When looking at the output read counts in the piranha report, this must be interpreted with caution and awareness of the underlying noise in the data. For example, low numbers of reads may map to WPV2 or WPV3 as they are present in the default database. This DOES NOT mean that WPV2 or WPV3 is present in your sample. Hits with greater than the minimun read count are highlighted in the report in the sample composition table (Table 2). If the read count passes the minimum thresholds (number of reads and percentage of sample) the reads will be taken forward for consensus building. It is only possible at the consensus level to get an accurate picture of that read population in your sample.
There are a number of default thresholds applied, which can be overwritten depending on your purposes.
-n,--min-read-length
-x, --max-read-length
The default read length range filters accepts reads between 1000 and 1300 nucleotide bases in length.
-d, --min-read-depth
-p, --min-read-pcent
These parameters set the minimum number of reads hitting a particular reference in the reference file (and the minimum percentage of reads within the sample) that are necessary to create a binned read group and attempt to make a consensus sequence for that particular sample. By default a minimum of 50 reads are necessary to build a consensus sequence and a minimum of 2% of the sample is required to be represented by that particular reference before it will attempt to create a consensus for this.
Similarly, mapping quality and alignment block length are used as quality filters within piranha to minimise the risk of any false positive results. By default, the minimum mapping quality is set to 15, on a phred scale of 0 to 60. Note that if you are using a non-default references FASTA file, having identical or near-identical sequences in the references file this may superficially decrease the mapping quality as there is no clear individual hit in the file. It is for this reason we only include a single Sabin reference for individual poliovirus types and no additional VDPVs.
The length of the alignment block is also used to filter out sporadic, incorrect hits. The minimum block length is 0.6 (or 60%) of the minimum read length by default, so an alignment block length of 600 or more.
-q, --min-mapping-quality
-a, --min-aln-block
We have set the minimum read depth to be 50 reads in order to attempt to make a consensus. Within piranha, we run minimap2 to map reads against the background reference panel. The top hit within the background reference panel is reported, by default showing the "ddns_group" field. The categories displayed are:
- Sabin1-Related
- Sabin2-Related
- Sabin3-Related
- WPV1
- WPV2
- WPV3
- NonPolioEV
- Unmapped
- PositiveControl (if a positive control has been used)
When sequencing samples at high depth, using mapping software on the raw nanopore reads (which are error prone) can lead to a certain level of noise. Hits above the minimum read depth threshold (Default >50) are highlighted in red in the final report. If the population of reads mapping to a particular reference successfully makes a consensus sequence at the end of the piranha pipeline, this is an indication of a genuine population of reads rather than noise.
Importantly, the reference that is hit within the background references file does not fully indicate what consensus sequence will be generated from the read population. Further phylogenetic analysis should be performed to confirm the identity of the sequence that you get.
Users can specify their own reference file or by default piranha will access the reference file packaged with the software.
The reference file must be in fasta format and, within that file, the first field must be the reference ID (without spaces). There must also be ddns_group specified in the header field. This reference file can include additional information in the following format:
>Poliovirus3-wt_JN812657 ddns_group=WPV3 species=Poliovirus3-wt
GGGGTGGACGATCTGATAACAGAA...
>Poliovirus3-Sabin_AY184221 ddns_group=Sabin3-related species=Sabin3-related
GGTATTGAAGATTTGACTTCTGAA...
Notably, ddns_group
allows multiple references to be aggregated together/ anonymised within the final report. Current compatibility within piranha allows ddns_group
fields to include: WPV1
,WPV2
,WPV3
,Sabin1-related
,Sabin2-related
,Sabin3-related
and NonPolioEV
.
- Gathers all read files for a given barcode together
- Filters these reads by length (Default only reads between 1000 and 1300 bases are included for further analysis)
- The reads are mapped against a panel of references. By default there is a reference panel included as part of piranha. It includes VP1 sequences from various wild-type polio viruses, reference Sabin-1, Sabin-2 and Sabin-3 sequences for identification of Sabin-like or VDPV polioviruses, and also a number of non-polio enterovirus reference seqeuences. A custom VP1 fasta file can be supplied with the
-r
flag. - The read map files are parsed to assign each read to the closest virus sequence in the reference panel. This assigns each read a broad category of either Sabin-1 like, Sabin-2 like, Sabin-3 like, wild-type Poliovirus (1, 2, or 3), non-polio enterovirus, or unmapped.
- These broad category assignments are used to bin reads for further downstream analysis. Any bin with greater than the minimum read threshold (Default 50 reads, but can be customised) and minimum read percentage (default 10% of sample, but can be customised) is written out in a separate fastq file which will be used to generate the broad-category consensus sequence.
- For each bin, a consensus sequence is generated using medaka and variation information is calculated for each site in the alignment against the reference. This calculates the consensus variants within each sample.
- The variants that are flagged by medaka are assessed for read co-occurance to tease apart variant haplotypes within the sample.
- For the entire run, and for each individual barcode/ sample, an interactive html report is generated summarising the information.
Piranha is configured by default to work with the nested amplification VP1 protocol developed by the Polio Sequencing Consortium. If you're testing another protocol or want to modify the default analysis behaviour you can configure thresholds and models with various command line arguments (the snake case of the long-form arguments can be supplied in an input config file, see example here).
Piranha uses medaka to generate consensus sequences. Medaka is software developed by Oxford nanopore technologies that explicitly deals with the error profile of nanopore data. Unlike Illumina sequencing, the error profile in sequencing reads generated by nanopore technology is not randomly distributed. Areas of low complexity, repetitive regions and particularly homopolymeric runs are lower in accuracy. Software like medaka (instead of traditional variant calling and consensus generation software) uses machine learning methods to compensate for this non-random error profile and for a long time the leading variant calling software for nanopore sequencing has used machine learning methods (e.g. nanopolish and medaka). For detailed information about the internal workings of medaka, see this helpful tutorial.
By default the medaka model run is r941_min_high_g360
. You should ensure to run the appropriate model for your data. The format of medaka model names is of:
{pore}_{device}_{caller variant}_{caller version}
so the default model in use assumes you've run
- a R9.4.1 flow cell
- on a MinION (or also for GridION can leave it set to
min
- only need to changemin
toprom
if it was a PromethION run) - in high accuracy mode (if you've run in fast mode this should be changed)
- with Guppy version 3.6.0
To see the available list of medaka models installed you can use the piranha command
(piranha) aine$ piranha --medaka-list-models
and piranha will check which ones you have installed with your version of medaka, print them to screen and exit. Medaka may lag behind the latest models of guppy available, so use the closest model you can to what you're running but also be aware that the exact model for your version of Guppy may not yet exist if you're on the cutting edge of Guppy versions. Conversely, if you are running a very old version there may also not be an exact medaka model for your data. The developers of medaka suggest to run the correct model for best results. They also state:
Where a version of Guppy has been used without an exactly corresponding medaka model, the medaka model with the highest version equal to or less than the guppy version should be selected.
Piranha now has the flexibility to cluster sequencing reads by a specified reference group field (-rg / --reference-group-field
, or reference_group_field
in a config file). By default, this field is set to ddns_group
and any records in the reference file must contain this as an annotation in the header description. The reference file that is supplied with piranha contains this field in all records and the values fall into the following categories:
- Sabin1-related
- Sabin2-related
- Sabin3-related
- WPV1
- WPV2
- WPV3
- NonPolioEV
These fields focus on poliovirus detection and all non-polio enterovirus sequences are collaposed into a single category.
If your use of piranha extends to other enteroviruses, it is now possible to run a more tailored analysis. The reference file that is supplied with piranha has the following format:
>Poliovirus1-Sabin_AY184219 ddns_group=Sabin1-related species=Sabin1-related cluster=Sabin1-related
GGGTTAG...CCACATAT
>CoxsackievirusA21_EF015028 ddns_group=NonPolioEV species=CoxsackievirusA21 cluster=CoxsackievirusA21
AGATGCA...CAGTGAGA
With the -rg
flag, you can use the reference file supplied to cluster by ddns_group (default), species or cluster. Then, piranha will dynamically calculate which categories are present during the run. For example, running piranha with -rg species
will enable clusters of CoxsackievirusA21, or other enterovirus species to be identified and have a consensus sequence generated during piranha analysis.
It is also possible to supply a custom reference file with references and fields of interest to piranha in the following format:
>REF1 my_custom_field=group1
ACTGT....CTAGCTA
>REF2 my_custom_field=group1
ACTGT....CTAGCTA
>REF3 my_custom_field=group1
ACTGT....CTAGCTA
>REF4 my_custom_field=group2
ACTGT....CTAGCTA
>REF5 my_custom_field=group2
ACTGT....CTAGCTA
and in this example, if -rg my_custom_field
is supplied, any reads that map to REF1, REF2 or REF3 will be clustered into 'group1' and any reads that map to REF4 or REF5 will be clustered into 'group2'.
Any values will reads mapping will be shown in the final reports.
Pitfalls to note:
-
A caveat with this new approach is that our reference file has been constructed with poliovirus primarily in mind. There are quite a few enterovirus references in the file, covering 113
species
groups, but there is likely to be gaps in the broad enterovirus diversity covered in this reference file. If you have species of interest you want to be detected, it may be worth supplying a custom reference file. -
Enteroviruses are a very diverse group of viruses, it may well be that portions of a read match to something in our reference set, but the rest is too divergent to map. In this case, lowering the alignment block length minimum requirement may enable partial sequences to be recovered. Here, the limitation is the reference-based approach and supplementing the reference file with additional reference sequences may resolve this.
-
Mapping against a reference set that contains multiple sequences that are very similar may significantly decrease the alignment quality score from minimap2. This case arises when there are 2 equally likely matches in the database to your sequencing reaad and the software cannot confidently call the best hit. In the default reference set, we only include a single Sabin1, Sabin2 and Sabin3 reference respectively, as including VDPV sequences in the database may confuse the mapping software. By default, reads with low alignment mapping scores will be filtered out. By lowering the mapping quality threshold, these split cases may be included in the analysis for enteroviruses.
Suggested configuration settings can be found here and we would recommend to only use this mode if the user can sufficiently appreciate the limitations the reference file composition may have on detection.
With the supplied config file, piranha can be run as follows:
piranha -c config.ev.yaml
There is now an experimental haplotype calling pipeline that uses freebayes for initial variant calling and flopp for read phasing with the called variants. It has a number internal QC steps for merging identical haplotypes. This pipeline requires further validation, but can theoretically produce multiple consensus sequences for each poliovirus population present within a sample. The pipeline has mostly be tested on VDPVs of two mixtures, but will be further assessed.
To try out the novel haplotype calling pipeline use the --run-haplotyping
flag.
Piranha allows the user to optionally run a phylogenetics module in addition to variant calling and consensus builing. If you have previously installed piranha, are 3 additional dependencies needed if you wish to run this module:
- IQTREE2
- mafft
- jclusterfunk
The latest environment file contains this dependencies, so to install you can just update your environment (
conda env update -f environment.yml
) or run using the latest image for piranha GUI.
This module will cluster any consensus sequences generated during the run into reference_group
, so either Sabin1-related
, Sabin2-related
, Sabin3-related
or WPV1
and will ultimately build one maximum-likelihood phylogeny for each reference group with consensus sequences in a given sequencing run. To annotate the phylogeny with certain metadata from the barcodes.csv file, specify columns to include with -pcol/--phylo-metadata-columns
.
Piranha then extracts any relevant reference sequences from the installed reference file (identified by having ddns_group=Sabin1-related
in their sequence header, or whichever reference group the relevant phylogeny will be for).
An optional set of local sequences can be supplied to supplement the phylogenetic analysis. To supply them to piranha, point to the correct directory using -sd,--supplementary-datadir
. The sequence files should be in FASTA format, but do not need to be aligned. To allow piranha to assign the sequences to the relevant phylogeny, the sequence files should have the reference group annotated in the header in the format ddns_group=Sabin1-related
, for example.
This supplementary sequence files can be accompanied with csv metadata files (one row per supplementary sequence) and this metadata can be included in the final report and annotated onto the phylogenies (-smcol/--supplementary-metadata-columns
). By default, the metadata is matched to the FASTA sequence name with a column titled sequence_name
but this header name can be configured by specifying -smid/--supplementary-metadata-id-column
.
Piranha will iterate accross the directory supplied and amalgamate the FASTA files, retaining any sequences with ddns_group=X
in the header description, where X can be one of Sabin1-related
, Sabin2-related
, Sabin3-related
or WPV1
. It then will read in every csv file it detects in this directory and attempts to match any metadata to the gathered fasta records. These will be added to the relevant phylogenies.
The phylogenetic pipeline is activated by running with the flag -rp/--run-phylo
, which then triggers the following analysis steps:
- Amalgamate the newly generated consensus sequences for all barcodes into their respective reference groups.
- Combine with the relevant Sabin or wild-type poliovirus references.
- Combine in any relevant supplementary sequences provided.
- Add in an outgroup sequence for each reference group phylogeny.
- Determine what metadata is provided and which columns should be used for tree annotation (this will allow the user to colour the tips in the report by certain metadata).
- For each reference group, align the combined FASTA file using
mafft
. - Build a phylogeny using
IQTREE2
with a HKY substitution model and zero branches collapsed, with the specified outgroup sequence. - Prune off the outgroup sequences using
jclusterfunk
. - Annotate the tree newick files with the specified metadata (Default: just whether it's a new consensus sequence or not).
- Extract phylogenetic trees and embed in interactive report.
If you supply a path to the -sd,--supplementary-datadir
for the phylogenetics module, you have the option of updating this data directory with the new consesnsus sequences generated during the piranaha analysis. If you run with the -ud,--update-local-database
flag, piranha will write out the new sequences and any accompanying metadata supplied into the directory provided.
The files written out will be in the format runname.today.fasta
and runname.today.csv
. For example, if your runname supplied is MIN001
and today's date is 2023-11-05
, the files written will be:
MIN001.2023-11-05.fasta
MIN001.2023-11-05.csv
with the newly generated consensus sequences and accompanying metadata from that run.
Note: if supplying the supplementary directory to piranha on a subsequent run, your updated local database will be included in the phylogenetics. However, piranha will ignore any files with identical
runname.today
patterns to the active run. So, if your current run would produce files calledMIN001.2023-11-05.fasta
andMIN001.2023-11-05.csv
, if those files already exist in the supplementary data directory, they will be ignored. This is to avoid conflicts if piranha is run multiple times on the same data.
By default the output directory will be created in the current working directory and will be named analysis-YYYY-MM-DD
, where YYYY-MM-DD is today's date. This output can be configured in a number of ways. For example, the prefix analysis
can be overwritten by using the -pre/--output-prefix new_prefix
flag (or output_prefix: new_prefix
in a config file) and this will change the default behaviour to new_prefix_YYYY-MM-DD
. It's good practice not to include spaces or special characters in your directory names.
To completely overwrite the output directory name (rather than just the prefix), you can run the -o/--outdir
flag (or supply outdir
in the config file).
If you're running multiple analyses from the same directory and not supplying new directory names, piranha will append an incrementing number to the end of the directory name so that the contents within the previous run output don't cause conflicts within the new run. To change this default behaviour and overwrite the previous directory, use the --overwrite
flag. Note: this will wipe all the contents within the analysis-YYYY-MM-DD
directory and re-populate it with the output of the new piranha run.
By default, piranha removes the intermediate files it produces during the analysis run. If you want to access the intermediate files (for example, the bam files, the read bins etc.) run with --no-temp
and all intermediate files will be kept.
By default temporary files are stored in $TMPDIR
, which then gets wiped when the process completes. If you want to store temp files elsewhere this can be done with --tempdir
.
Piranha can handle archiving the read directory supplied for analysis. If the --archive-fastq
flag is supplied to piranha (or archive_fastq
in a config file), piranha will copy the directory tree (so all files and sub-directories) in the readdir path to the output directory under the name "fastq_pass".
If you want to customise where piranha copies the sequencing read directory to, supply the --archivedir
argument with the path you wish to write the reads to and it will overwrite the default option to write the reads to the output directory.
The output report can include some information about the sequencing run, the name of the individual running the report and the institute doing the sequencing. To give the report a specific run name (rather than the default title Nanopore sequencing report
) supply the new name with the command line flag --runname
or as runname:
in the config file. Similarly, to enable the report to display the name of the user and institute, provide --username
and --institute
(or within the config file).
A free-text --notes
field can be supplied that will get embedded into the top of the html report.
The orientation of barcodes in the vizualisation representing a 96-well plate can be configured. By default the barcodes increment from 1 to 96 in a vertical
orientation (down each column in turn). By using the --orientation
flag on the command line or orientation
key in a configuration file, either the default vertical
or horizontal
(which will increment the numbers from 1-96 across each row in turn) can be specified. If well
is supplied as a column in the barcode.csv file, this default orientation will be overwritten. Format for well specification is as a standard 96-well plate: A01 to H12.
The reports are available in English (default) and in French.
The header includes the title of the run (supply with --runname
) and the date of report generation.
Optionally the report can also display user and institute information beneath the header (--username
and --institute
flags).
This table gives summary information about which viruses were identified within each sample and, for Sabin-related polioviruses, the number of mutations away from Sabin. It also has a link that allows you to download a particular consensus sequence. The rows can be sorted by each column in either ascending or descending order by clicking on the column header. By default, the table is sorted by sample name. It's possible to search the table by typing in the text box on the right-hand side. Clicking on the sample name within the table will redirect the user to the individual sample reports. Be aware this link works in situ, when path to the sample report relative to the main output report has not changed, but if the report is moved or distributed without the barcode reports, this link will no longer function.
Under the table header, there's a dropdown menu that gives options to export the displayed table (either by copying to the clipboard, as a csv or directly printing it). By clicking on rows within the table, it's possible to select the subset of rows you're interested in exporting and this is what will be sent to the clipboard, file or printer.
To the right, there is a button to download the detailed table. This additional table has a compiled set of data from the sequencing run. By default the information provided in this this file are as follows.
First columns:
sample, barcode, EPID, institute,...
Then any additional fields supplied in the barcodes.csv for each sample (e.g. date of collection, date of sequencing)
Then information about each reference group in turn. For example, Sabin1-related:
Sabin1-related|closest_reference,Sabin1-related|num_reads,Sabin1-related|nt_diff_from_reference,Sabin1-related|pcent_match,Sabin1-related|classification,...
Finally a generic comments column that can be modified later (or prepopulated if provided in the barcodes.csv).
An example of this detailed file can be found here.
This table displays read counts for each sample that have mapped to each of the reference groups. Similar to Table 1, this table can be sorted, searched and exported.
This table will only appear if there are identical sequences in your sequencing run. It flags samples that contain consensus sequences that are completely identical. This may be a sign of contamination within the sequencing run, but doesn't necessarily mean this. This table serves as a prompt to investigate why these sequences may be identical.
If a negative and/or positive control are included in your sequencing run (piranha will automatically detect them if their sample name is negative
or positive
, or the user can specify the name of their controls with the command line flags or by providing them in a config file). If the control "passes" (i.e. has fewer than 50 reads for the negative control or has more than 100 reads in the NonPolioEV category for the positive control), the row will be coloured green and have a tick mark under the "Pass" column. If the controls fail, the row will be coloured red. If the controls fail, this may be an indication of a failed sequencing run.
This report gives additional information specific to the sequences generated for each sample.
This table summarises the reference groups found within the sample. Clicking on the reference group within the table will link the user down to the appropriate section of the report.
All consensus sequences generated for this sample are available within this box in fasta format. The header of the fasta file by default is of the format:
>sample|barcode|reference_group|reference|number_nt_diffs_from_reference|variants
AAGTCAGTCATCAGCTGACT...CAGTGATCGAGCTAGTAT
Additional information can be detected and appended to this fasta header. By default, if a date
column or EPID
column is provided in the barcodes.csv file, this will be appended to the header. Piranha has a flag --all-metadata-to-header
that will append all metadata fields provided in the input csv file to the header.
This fasta file can be highlighted and copied to the clipboard, or accessed as part of the "published_data" directory in the output directory of piranha.
This section of the report exists for each reference group identified within the sample.
Firstly, a table summarises the number of mutations and what mutations have been identified relative to the closest reference.
The variation figure shows the noise within the data in contrast to the variants that get called within the sample. Each point represents the percentage alternative allele present at that site in the VP1 sequence. The baseline reference for Sabin-related sequences is the relevant Sabin reference, whereas for the other reference groups (wild-type poliovirus and non-polio enterovirus) the baseline reference is the consensus sequence generated. This means that for Sabin-related sequences, the differences relative to Sabin are highlighted and for the other read populations within the sample, differences within sample relative to the consensus of that read population are highlighted. The SNPs called by Medaka relative to the chosen baseline reference (either Sabin or consensus sequence) are highlighted in dark green and insertion-deletion mutations (if any) are highlighted in yellow. Masked variants are highlighted in red. A variant will be masked if it will cause a frameshift within the VP1 region (like an indel whose length is not a multiple of 3) or if it is part of a string of variants in very close proximity (highlighting an issue around that area).
This snipit plot (https://github.com/aineniamh/snipit) highlights the consensus differences relative to the reference.
This plot calculates the percentage of reads that share the variants called against the reference, and can give a noisy estimate of haplotypes present within the sample. The calculation only takes bases of high-quality into account, so counts reflect the level of support rather than the level of coverage. For each variant, the percentage displayed is a function of the total high quality bases at that site, and for each variant the cooccurance is a percentage of high quality bases at each site in turn that share both variants. This figure can give an idea if mixed populations are present within the sample.
usage:
piranha -c <config.yaml> [options]
piranha -b <barcodes.csv> -i <demultiplexed fastq_dir> [options]
Input options:
-c CONFIG, --config CONFIG
Input config file in yaml format, all command line arguments can be passed via the config file.
-i READDIR, --readdir READDIR
Path to the directory containing fastq read files
-b BARCODES_CSV, --barcodes-csv BARCODES_CSV
CSV file describing which barcodes were used on which sample
-r REFERENCE_SEQUENCES, --reference-sequences REFERENCE_SEQUENCES
Custom reference sequences file.
-rg REFERENCE_GROUP_FIELD, --reference-group-field REFERENCE_GROUP_FIELD
Specify reference description field to group references by. Default: `ddns_group`
-nc NEGATIVE_CONTROL, --negative-control NEGATIVE_CONTROL
Sample name of negative control. If multiple samples, supply as comma-separated string of sample names. E.g.
`sample01,sample02` Default: `negative`
-pc POSITIVE_CONTROL, --positive-control POSITIVE_CONTROL
Sample name of positive control. If multiple samples, supply as comma-separated string of sample names. E.g.
`sample01,sample02`. Default: `positive`
-pr POSITIVE_REFERENCES, --positive-references POSITIVE_REFERENCES
Comma separated string of sequences in the reference file to class as positive control sequences.
Analysis options:
-s SAMPLE_TYPE, --sample-type SAMPLE_TYPE
Specify sample type. Options: `stool`, `environmental`. Default: `stool`
-m ANALYSIS_MODE, --analysis-mode ANALYSIS_MODE
Specify analysis mode to run. Options: `vp1`,`panev`, `wg`. Default: `vp1`
--medaka-model MEDAKA_MODEL
Medaka model to run analysis using. Default: r941_min_hac_variant_g507
--medaka-list-models List available medaka models and exit.
-q MIN_MAP_QUALITY, --min-map-quality MIN_MAP_QUALITY
Minimum mapping quality. Default: 15
-n MIN_READ_LENGTH, --min-read-length MIN_READ_LENGTH
Minimum read length. Default: 1000
-x MAX_READ_LENGTH, --max-read-length MAX_READ_LENGTH
Maximum read length. Default: 1300
-d MIN_READ_DEPTH, --min-read-depth MIN_READ_DEPTH
Minimum read depth required for consensus generation. Default: 50
-p MIN_READ_PCENT, --min-read-pcent MIN_READ_PCENT
Minimum percentage of sample required for consensus generation. Default: 2
-a MIN_ALN_BLOCK, --min-aln-block MIN_ALN_BLOCK
Minimum alignment block length. Default: 0.6*MIN_READ_LENGTH
--primer-length PRIMER_LENGTH
Length of primer sequences to trim off start and end of reads. Default: 30
Haplotyping options:
-rh, --run-haplotyping
Trigger the optional haplotyping module. Additional dependencies may need to be installed.
-hs HAPLOTYPE_SAMPLE_SIZE, --haplotype-sample-size HAPLOTYPE_SAMPLE_SIZE
Number of reads to downsample to for haplotype calling. Default: 3000
-hf MIN_ALLELE_FREQUENCY, --min-allele-frequency MIN_ALLELE_FREQUENCY
Minimum allele frequency to call. Note: setting this below 0.07 may significantly increase run time. Default: 0.07
-hx MAX_HAPLOTYPES, --max-haplotypes MAX_HAPLOTYPES
Maximum number of haplotypes callable within reference group. Default: 4
-hdist MIN_HAPLOTYPE_DISTANCE, --min-haplotype-distance MIN_HAPLOTYPE_DISTANCE
Minimum number of SNPs between haplotypes. Default: 2
-hd MIN_HAPLOTYPE_DEPTH, --min-haplotype-depth MIN_HAPLOTYPE_DEPTH
Minimum number of reads in a given haplotype. Default: 20
Phylogenetics options:
-rp, --run-phylo Trigger the optional phylogenetics module. Additional dependencies may need to be installed.
-sd SUPPLEMENTARY_DATADIR, --supplementary-datadir SUPPLEMENTARY_DATADIR
Path to directory containing supplementary sequence FASTA file and optional metadata to be incorporated into phylogenetic
analysis.
-pcol PHYLO_METADATA_COLUMNS, --phylo-metadata-columns PHYLO_METADATA_COLUMNS
Columns in the barcodes.csv file to annotate the phylogeny with. Default: ['call', 'sample_date', 'EPID']
-smcol SUPPLEMENTARY_METADATA_COLUMNS, --supplementary-metadata-columns SUPPLEMENTARY_METADATA_COLUMNS
Columns in the supplementary metadata to annotate the phylogeny with. Default: ['location', 'lineage']
-smid SUPPLEMENTARY_METADATA_ID_COLUMN, --supplementary-metadata-id-column SUPPLEMENTARY_METADATA_ID_COLUMN
Column in the supplementary metadata files to match with the supplementary sequences. Default: sequence_name
-ud, --update-local-database
Add newly generated consensus sequences (with a distance greater than a threshold (--local-database-threshold) away from
Sabin, if Sabin-related) and associated metadata to the supplementary data directory.
-dt, --local-database-threshold
The threshold beyond which Sabin-related sequences are added to the supplementary data directory if update local database
flag used. Default: 6
Output options:
-o OUTDIR, --outdir OUTDIR
Output directory. Default: `analysis-2022-XX-YY`
-pub PUBLISHDIR, --publishdir PUBLISHDIR
Output publish directory. Default: `analysis-2022-XX-YY`
-pre OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX
Prefix of output directory & report name: Default: `analysis`
--datestamp DATESTAMP
Append datestamp to directory name when using <-o/--outdir>. Default: <-o/--outdir> without a datestamp
--overwrite Overwrite output directory. Default: append an incrementing number if <-o/--outdir> already exists
-temp TEMPDIR, --tempdir TEMPDIR
Specify where you want the temp stuff to go. Default: `$TMPDIR`
--no-temp Output all intermediate files. For development/ debugging purposes
--all-metadata-to-header
Parse all fields from input barcode.csv file and include in the output fasta headers. Be aware spaces in metadata will
disrupt the record id, so avoid these.
--language LANGUAGE Output report language. Options: English, French. Default: English
--save-config Output the config file with all parameters used
--archive-fastq Write the supplied fastq_pass directory to the output directory.
--archivedir ARCHIVEDIR
Configure where to put the fastq_pass files, default in the output directory.
Misc options:
--runname RUNNAME Run name to appear in report. Default: polioDDNS
--username USERNAME Username to appear in report. Default: no user name
--institute INSTITUTE
Institute name to appear in report. Default: no institute name
--notes NOTES Miscellaneous notes to appear at top of report. Default: no notes
--orientation ORIENTATION
Orientation of barcodes in wells on a 96-well plate. If `well` is supplied as a column in the barcode.csv, this default
orientation will be overwritten. Default: `vertical`. Options: `vertical` or `horizontal`.
-t THREADS, --threads THREADS
Number of threads. Default: 1
--verbose Print lots of stuff to screen
-v, --version show program's version number and exit
-h, --help