Skip to content

This repository contains an analysis pipeline developed to characterize WGS output

License

Notifications You must be signed in to change notification settings

CDCgov/NCHHSTP-DTBE-Varpipe-WGS

Repository files navigation

Installing and Using the varpipe_wgs pipeline

Overview

This repository contains the Varpipe_wgs pipeline developed by the Division of TB Elimination. The pipeline cleans the data and performs analyses, including typing and variant detection. While originally build to analyze Tuberculosis data, the pipeline accepts other references, allowing it be used more broadly.

End users can run the pipeline using docker, singularity, or their local machine.

Prepare the Data

First, copy the gzipped fastq files you wish to analyze to the data/ directory in this repository. Fastq files should be named with Illumina standard format.

Use Docker

Start the container

To use Docker to run the pipeline, first choose whether you will use the container image with or without references included.

To use the image with references, run the following commands

docker pull ghcr.io/cdcgov/varpipe_wgs_with_refs:latest
docker run -it -v <path to data>:/varpipe_wgs/data ghcr.io/cdcgov/varpipe_wgs_with_refs:latest

To use the image without references you will need to change the container name in the command to varpipe_wgs_without_refs:latest, and you will also need to specify a folder with the references to be used by clockwork. This folder must be mounted to the container as /varpipe_wgs/tools/clockwork-0.11.3/OUT.

docker pull ghcr.io/cdcgov/varpipe_wgs_with_refs:latest
docker run -it -v <path to data>:/varpipe_wgs/data -v <path to references>:/varpipe_wgs/tools/clockwork-0.11.3/OUT ghcr.io/cdcgov/varpipe_wgs_without_refs:latest

Run the pipeline

Those commands will download the most recent version of the pipeline and then start the container and connect to it.

When connected to the container you will be in the directory /varpipe_wgs/data. From there simply start the pipeline with the command, where <threads> is the number of threads you would like to run (Default: 4)

cd /varpipe_wgs/data
./runVarpipeline.sh <threads>

That will identify all gzipped fastq files in the directory and run the pipeline over them, creating a results folder named "Output_<MM><DD><YYYY>" with subfolders containing the results for each sample.

When the pipeline is finished you can disconnect from and close the container by pressing CTRL+D. You can then review the results.

Use Singularity

Obtain the Singularity image

The singularity images are too large to include in this repository, but the version with references is available on Singularity Container Library. To access you, run the command

singularity pull library://reagank/varpipe_wgs/pipeline_with_refs

The Singularity images can also be built locally using the provided build_singularity.sh script.

./build_singularity.sh

The script builds the image without references by default, to build the image including references provide the argument "with_references"

./build_singularity.sh with_references

Start the Singularity image

Once you have downloaded or built the .sif file containing the singularity image, the command to start and connect to the container that includes references is:

singularity shell --bind <path to data>:/varpipe_wgs/data pipeline_with_refs.sif

As with Docker, to run the pipeline without references you will need to supply clockwork compatible references and bind it to the image as the /varpipe_wgs/tools/clockwork-0.11.3/OUT directory

singularity shell --bind ./data:/varpipe_wgs/data --bind <path-to-references>:/varpipe_wgs/tools/clockwork-0.11.3/OUT pipeline_without_refs.sif

Please note, if you are running this pipeline using CDC SciComp resources then security settings will require specifying the SINGULARITY_TMPDIR environmental variable like this:

SINGULARITY_TMPDIR=~/.tmp singularity shell --bind ./data:/varpipe_wgs/data --bind <path-to-references>:/varpipe_wgs/tools/clockwork-0.11.3/OUT pipeline_without_refs.sif

Run the pipeline

This command starts the container and connect to it.

When connected to the container you will be in your home directory. You must cd to the directory /varpipe_wgs/data. From there start the pipeline with the command

cd /varpipe_wgs/data
./runVarpipeline.sh <threads>

That will identify all gzipped fastq files in the directory and run the pipeline over them, creating a results folder named "Output_<MM><DD><YYYY>" with subfolders containing the results for each sample

When the pipeline is finished you can disconnect from and close the container by pressing CTRL+D. You can then review the results.

Use Local

Prerequisites

To run the pipeline locally, you will need to have the following programs installed:

  • Python 2.7
  • Python 3
  • Java 1.8
  • Singularity >=3.5

The remaining programs used by the pipeline are included in this repository in the tools/ directroy.

Install the Pipeline

First, clone and dowload this repository with the command

git clone https://github.com/CDCGov/NCHHSTP-DTBE-Varpipe-WGS.git

Then, simply run setup.sh to finish the installation. This script runs several steps:

  • Downloads the clockwork singularity image
  • Downloads GATK
  • Builds a reference fasta and creates BWA indexes

Latly, update the script tools/clockwork-0.11.3/clockwork script to correctly point to the clockwork 0.11.3 image

Run the Pipeline

After the data has been added to the data/ directory, cd into data/ and run the pipeline with the command, where <threads> is the number of threads you would like to run (Default: 4)

cd data/
./runVarpipeline.sh <threads>

That will identify all gzipped fastq files in the directory and run the pipeline over them, creating a results folder named "Output_<MM><DD><YYYY>" with subfolders containing the results for each sample.

Troubleshooting

This pipeline requires a large amount of memory, a minimum of 32GB of RAM is necessary for the clockwork decontamination step. If the log includes a line with the word "Killed" in the clockwork output and trimmomatic shows are error that the fastq.gz files are not found, the most likely cause is insufficent RAM.

If you have difficulty downloading this image (specifically the message "failed to register layer: ApplyLayer exit status 1" and ending with "no space left on device") then you may need to increase the base device size to something larger than 18GB. This problem should only occur when using the devicemapper storage driver and an older version of Docker engine.

Public Domain Standard Notice

This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.

License Standard Notice

The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later.

This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version.

This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details.

You should have received a copy of the Apache Software License along with this program. If not, see http://www.apache.org/licenses/LICENSE-2.0.html

The source code forked from other open source projects will inherit its license.

Privacy Standard Notice

This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer and Code of Conduct. For more information about CDC's privacy policy, please visit http://www.cdc.gov/other/privacy.html.

Contributing Standard Notice

Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.

All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.

Records Management Standard Notice

This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.

Additional Standard Notices

Please refer to CDC's Template Repository for more information about contributing to this repository, public domain notices and disclaimers, and code of conduct.

Disclaimer

The Laboratory Branch (LB) of the Division of Tuberculosis Elimination developed this bioinformatic pipeline for analyzing whole genome sequencing data generated on Illumina platforms. This is not a controlled document. The performance characteristics as generated at Centers for Disease Control and Prevention (CDC) are specific to the version as written. These documents are provided by LB solely as an example for how this test performed within LB. The recipient testing laboratory is responsible for generating validation or verification data as applicable to establish performance characteristics as required by the testing laboratory’s policies, applicable regulations, and quality system standards. These data are only for the sample and specimen types and conditions described in this procedure. Tests or protocols may include hazardous reagents or biological agents. No indemnification for any loss, claim, damage, or liability is provided for the party receiving an assay or protocol. Use of trade names and commercial sources are for identification only and do not constitute endorsement by the Public Health Service, the United States Department of Health and Human Services, or the Centers for Disease Control and Prevention.