This workflow was created to efficiently gather, process, and visualize data presented in Halfmann and Minor et al. 2022, Evolution of a globally unique SARS-CoV-2 Spike E484T monoclonal antibody escape mutation in a persistently infected, immunocompromised individual. Raw reads from the course of the prolonged infection are pulled automatically from SRA BioProject PRJNA836936. Consensus sequences are bundled together with the workflow documents and are included in the project GitHub repository. Also included is a table of neutralization assay results and a table of qPCR cycle threshold (Ct) values from the infection.
Note that this workflow does not include processing behind Halfmann and Minor et al. 2022 supplemental tables and figures. These workflows are bundled separately and can be accessed by following the links in these GitHub repos:
If you already have Docker and NextFlow installed on your system, simply run the following command in the directory of your choice:
nextflow run dholab/E484T-visualizations -latest
This command automatically pulls the workflow from GitHub and runs it. If you do not have Docker and NextFlow installed, or want to tweak any of the default configurations in the workflow, proceed to the following sections.
To run this workflow, start by running git clone
to bring the workflow files into your directory of choice, like so:
git clone https://github.com/dholab/E484T-visualizations .
Once the workflow bundle is in place, you may need to double-check that the workflow scripts are executable by running chmod +x bin/*
in the command line.
You will also need to install the Docker engine if you haven't already. The workflow pulls all the software it needs automatically from Docker Hub, which means you will never need to permanently install that software on your system. To install Docker, simply visit the Docker installation page at https://docs.docker.com/get-docker/.
This workflow uses the NextFlow workflow manager. We recommend you install NextFlow to your system in one of the two following ways:
- Install the miniconda python distribution, if you haven't already: https://docs.conda.io/en/latest/miniconda.html
- Install the
mamba
package installation tool in the command line:conda install -y -c conda-forge mamba
- Install Nextflow to your base environment:
mamba install -c bioconda nextflow
- Run the following line in a directory where you'd like to install NextFlow, and run the following line of code:
curl -fsSL https://get.nextflow.io | bash
- Add this directory to your $PATH. If on MacOS, a helpful guide can be viewed here.
To double check that the installation was successful, type nextflow -v
into the terminal. If it returns something like nextflow version 21.04.0.5552
, you are set and ready to proceed.
Now that you have cloned the workflow files, installed Docker, and installed NextFlow, you are ready to run the workflow. To do so, simply change into the workflow directory and run the following in the BASH terminal:
nextflow run prolonged_infection_workflow.nf
If the workflow runs partway, but a computer outage or other issue interrupts its progress, no need to start over! Instead, run:
nextflow run prolonged_infection_workflow.nf -resume
The workflow's configurations (see below) tell NextFlow to plot the workflow and record run statistics. However, the plot the workflow, note that NextFlow requires the package GraphViz, which is easiest to install via the intructions on GraphViz's website.
Finally, if you are running the workflow on macOS, you may run into memory issues if you have not allocated enough storage to the Docker Engine. If possible, we recommend you allocate 16GB of RAM to your Docker engine, though 6 GB, 8 GB, or 10 GB may also supply enough. At the time that we designed this workflow, this Stack Overflow answer is a good guide for increasing Docker memory.
The following runtime parameters have been set for the whole workflow:
primer_key
- path to an important CSV that lists which primer set was used for each samplerefdir
- path to the directory where the reference sequence and primer BED files are locatedrefseq
- path to the SARS-CoV-2 Wuhan-1 sequence from GenBank Accession MN9089473.3refgff
- path to the gene feature file for the SARS-CoV-2 Wuhan-1 sequenceCt_data
- path to the CSV file of Ct valuesneut_data
- path to the CSV file of neutralization assay dataresults
- path to the results directory, the workflow's default output locationresults_data_files
- path to a subdirectory ofresults/
where data files like VCFs are placed.visuals
- path to a subdirectory ofresults/
where graphics are stored. This is where the final figure PDFs are placed.
These parameters can be altered in the command line with a double-dash flag, like so:
nextflow run prolonged_infection_workflow.nf --results ~/Downloads
Bundled together with the workflow are:
- primer BED files for ARTIC v3, ARTIC v4.1, and MIDNIGHT v1 are in
ref/
- the SARS-CoV-2 Wuhan-1 sequence from GenBank Accession MN9089473.3 is in
ref/
and is called reference.fasta - the MN9089473.3 gene coordinate/codon GFF file is also in
ref/
and is called MN9089473.gff3 - a CSV file of qPCR cycle threshold values through the course of the prolonged infection is in
data/
- a CSV file of neutralization assay results is also in
data/
- a very important file called
fig2b_raw_read_guide.csv
is inresources/
. This file enables the workflow to use the correct primers for each sample insamtools ampliconclip
.
- the first process identifies the SARS-CoV-2 pango lineage for the consensus sequences from each timepoint
- qPCR cycle threshold values are then plotted
- each sequence from the FASTA of consensus sequences is isolated into its own FASTA file
- the consensus FASTAs are converted to SAM format (mapped to the Wuhan-1 SARS-CoV-2 sequence)
- a samtools
mpileup
is generated for each consensus SAM - the consensus pileups are piped into iVar, which variant calls each consensus sequence and also identifies protein effects
- SRA accession numbers are taken from
config/fig2b_raw_read_guide.csv
and used to pull FASTQs for each sequencing timepoint. Because NCBI'sfastq-dump
is very slow, this step takes the longest. NOTE that because the settings for mapping these reads depend on sequencing platform (e.g., Illumina vs. Oxford Nanopore), there are near-identical processes laid out for reads from each sequencing platform. - the reads are mapped to the Wuhan-1 SARS-CoV-2 sequence
- the mapped reads in SAM format are then clipped down to amplicons and converted to BAM format
- a tool called
covtobed
then identifies regions in any of these BAMs where depth of coverage is less than 20 reads, indicating an amplicon dropout - the amplicon dropout BED files, the consensus sequence FASTA, and the iVar variant tables are then plotted with an R script
- once Figure 2A has been plotted, all the FASTQ files, BAM files, BED files, and variant tables generated at intermediate steps of the workflow are stored and compressed in a TAR archive.
- the neutralization assay results are plotted with an R script
data/reads/[timepoint_day_sequencingplatform].fastq.gz
- Raw reads for each timepoint that we have deposited in Sequence Read Archive (SRA) BioProject PRJNA836936data/[timepoint_day_sequencingplatform].sam
- reads aligned to the SARS-CoV-2 Wuhan-1 sequence in human-readable SAM formatdata/[timepoint_day_sequencingplatform].bam
- reads aligned to the SARS-CoV-2 Wuhan-1 sequence, filtered down to only reads that are within the span of each amplicon in the timepoint's primer set, and converted to the smaller, non-human readable BAM format.results/data/[timepoint_day_sequencingplatform].bed
- BED files showing genome regions where fewer than 20 reads aligned. These low-depth regions are likely to be regions where amplification failed, i.e., amplicon dropouts.results/data/[GISAID_strain_name]_consensus_variant_table.tsv
- iVar tables of single-nucleotide variants identified in the consensus sequence of each timepoint, along with codon information and protein effects for each variant.results/data/patient_variant_counts.csv
- simple table recording the number of variants iVar identified at each timepoint. This is used as an input for the script that generates Supplementary Figure 2.results/prolonged_infection_lineage_report.csv
- pango lineages classified for each timepoint.results/visuals/fig1a_ct_values_thru_time.pdf
- PDF file of Figure 1A, which can be polished in vector graphics software like Adobe Illustrator.results/visuals/fig2a_consensus_mutations.pdf
- PDF file of Figure 2A, which can be polished in vector graphics software like Adobe Illustrator.results/visuals/fig2c_neutralization_assay.pdf
- PDF file of Figure 2C, which can be polished in vector graphics software like Adobe Illustrator.
Note that figures 1B and 2B are produced manually.
This workflow was created by Nicholas R. Minor. To report any issues, please visit the GitHub repository for this project.
- NBI Sweden Reproducibility Workshop GitHub - This one is more up to date than the version on canva, which is the next link
- NBI Sweden Reproducibility Workshop Canva
- NextFlow Official Documentation
- Exhaustive NextFlow Github.io Tutorial
- Seqera Labs NextFlow Tutorial
- Tronflow NextFlow tutorial
- Google Cloud NextFlow tutorial
- Data Carpentries Incubator Workshop on NextFlow - This one is especially thorough and helpful, and also introduces nf-core.
- NextFlow Cheatsheet PDF - Not a tutorial, but this is a useful PDF to have on hand.