This repository contains the Varpipe_wgs pipeline developed by the Division of TB Elimination. The pipeline cleans the data and performs analyses, including typing and variant detection. While originally build to analyze Tuberculosis data, the pipeline accepts other references, allowing it be used more broadly.
End users can run the pipeline using docker, singularity, or their local machine.
First, copy the gzipped fastq files you wish to analyze to the data/ directory in this repository. Fastq files should be named with Illumina standard format.
To use Docker to run the pipeline, first choose whether you will use the container image with or without references included.
To use the image with references, run the following commands
docker pull ghcr.io/cdcgov/varpipe_wgs_with_refs:latest
docker run -it -v <path to data>:/varpipe_wgs/data ghcr.io/cdcgov/varpipe_wgs_with_refs:latest
To use the image without references you will need to change the container name in the command to varpipe_wgs_without_refs:latest, and you will also need to specify a folder with the references to be used by clockwork. This folder must be mounted to the container as /varpipe_wgs/tools/clockwork-0.11.3/OUT.
docker pull ghcr.io/cdcgov/varpipe_wgs_with_refs:latest
docker run -it -v <path to data>:/varpipe_wgs/data -v <path to references>:/varpipe_wgs/tools/clockwork-0.11.3/OUT ghcr.io/cdcgov/varpipe_wgs_without_refs:latest
Those commands will download the most recent version of the pipeline and then start the container and connect to it.
When connected to the container you will be in the directory /varpipe_wgs/data. From there simply start the pipeline with the command, where <threads> is the number of threads you would like to run (Default: 4)
cd /varpipe_wgs/data
./runVarpipeline.sh <threads>
That will identify all gzipped fastq files in the directory and run the pipeline over them, creating a results folder named "Output_<MM><DD><YYYY>" with subfolders containing the results for each sample.
When the pipeline is finished you can disconnect from and close the container by pressing CTRL+D. You can then review the results.
The singularity images are too large to include in this repository, but the version with references is available on Singularity Container Library. To access you, run the command
singularity pull library://reagank/varpipe_wgs/pipeline_with_refs
The Singularity images can also be built locally using the provided build_singularity.sh script.
./build_singularity.sh
The script builds the image without references by default, to build the image including references provide the argument "with_references"
./build_singularity.sh with_references
Once you have downloaded or built the .sif file containing the singularity image, the command to start and connect to the container that includes references is:
singularity shell --bind <path to data>:/varpipe_wgs/data pipeline_with_refs.sif
As with Docker, to run the pipeline without references you will need to supply clockwork compatible references and bind it to the image as the /varpipe_wgs/tools/clockwork-0.11.3/OUT directory
singularity shell --bind ./data:/varpipe_wgs/data --bind <path-to-references>:/varpipe_wgs/tools/clockwork-0.11.3/OUT pipeline_without_refs.sif
Please note, if you are running this pipeline using CDC SciComp resources then security settings will require specifying the SINGULARITY_TMPDIR environmental variable like this:
SINGULARITY_TMPDIR=~/.tmp singularity shell --bind ./data:/varpipe_wgs/data --bind <path-to-references>:/varpipe_wgs/tools/clockwork-0.11.3/OUT pipeline_without_refs.sif
This command starts the container and connect to it.
When connected to the container you will be in your home directory. You must cd to the directory /varpipe_wgs/data. From there start the pipeline with the command
cd /varpipe_wgs/data
./runVarpipeline.sh <threads>
That will identify all gzipped fastq files in the directory and run the pipeline over them, creating a results folder named "Output_<MM><DD><YYYY>" with subfolders containing the results for each sample
When the pipeline is finished you can disconnect from and close the container by pressing CTRL+D. You can then review the results.
To run the pipeline locally, you will need to have the following programs installed:
- Python 2.7
- Python 3
- Java 1.8
- Singularity >=3.5
The remaining programs used by the pipeline are included in this repository in the tools/ directroy.
First, clone and dowload this repository with the command
git clone https://github.com/CDCGov/NCHHSTP-DTBE-Varpipe-WGS.git
Then, simply run setup.sh to finish the installation. This script runs several steps:
- Downloads the clockwork singularity image
- Downloads GATK
- Builds a reference fasta and creates BWA indexes
Latly, update the script tools/clockwork-0.11.3/clockwork script to correctly point to the clockwork 0.11.3 image
After the data has been added to the data/ directory, cd into data/ and run the pipeline with the command, where <threads> is the number of threads you would like to run (Default: 4)
cd data/
./runVarpipeline.sh <threads>
That will identify all gzipped fastq files in the directory and run the pipeline over them, creating a results folder named "Output_<MM><DD><YYYY>" with subfolders containing the results for each sample.
This pipeline requires a large amount of memory, a minimum of 32GB of RAM is necessary for the clockwork decontamination step. If the log includes a line with the word "Killed" in the clockwork output and trimmomatic shows are error that the fastq.gz files are not found, the most likely cause is insufficent RAM.
If you have difficulty downloading this image (specifically the message "failed to register layer: ApplyLayer exit status 1" and ending with "no space left on device") then you may need to increase the base device size to something larger than 18GB. This problem should only occur when using the devicemapper storage driver and an older version of Docker engine.
This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.
The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later.
This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version.
This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details.
You should have received a copy of the Apache Software License along with this program. If not, see http://www.apache.org/licenses/LICENSE-2.0.html
The source code forked from other open source projects will inherit its license.
This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer and Code of Conduct. For more information about CDC's privacy policy, please visit http://www.cdc.gov/other/privacy.html.
Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.
All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.
This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.
Please refer to CDC's Template Repository for more information about contributing to this repository, public domain notices and disclaimers, and code of conduct.
The Laboratory Branch (LB) of the Division of Tuberculosis Elimination developed this bioinformatic pipeline for analyzing whole genome sequencing data generated on Illumina platforms. This is not a controlled document. The performance characteristics as generated at Centers for Disease Control and Prevention (CDC) are specific to the version as written. These documents are provided by LB solely as an example for how this test performed within LB. The recipient testing laboratory is responsible for generating validation or verification data as applicable to establish performance characteristics as required by the testing laboratory’s policies, applicable regulations, and quality system standards. These data are only for the sample and specimen types and conditions described in this procedure. Tests or protocols may include hazardous reagents or biological agents. No indemnification for any loss, claim, damage, or liability is provided for the party receiving an assay or protocol. Use of trade names and commercial sources are for identification only and do not constitute endorsement by the Public Health Service, the United States Department of Health and Human Services, or the Centers for Disease Control and Prevention.