Skip to content

CSB5/OPERA-MS

Repository files navigation

Preprint at bioRxiv

Introduction

OPERA-MS is a hybrid metagenomic assembler which combines the advantages of short and long-read technologies to provide high quality assemblies, addressing issues of low contiguity for short-read only assemblies, and low base-pair quality for long-read only assemblies. OPERA-MS has been extensively tested on mock and real communities sequenced using different long-read technologies, including Oxford Nanopore, PacBio and Illumina Synthetic Long Read, and is particularly robust to noise in the read data.

OPERA-MS employs a staged assembly strategy that is designed to exploit even low coverage long read data to improve genome assembly. It begins by constructing a short-read metagenomic assembly (default: MEGAHIT) that provides a good representation of the underlying sequence in the metagenome but may be fragmented. Long and short reads are then mapped to the assembly to identify connectivity between the contigs and to compute read coverage information. This serves as the basis for the core of the OPERA-MS algorithm which is to exploit coverage as well as connectivity information to accurately cluster contigs into genomes using a Bayesian model based approach. Another important advantage of OPERA-MS is that it can deconvolute strains in the metagenome, optionally using information from reference genomes to support this. This is fundamentally challenging for pipelines that begin with assembly of error-prone long reads. After clustering, individual genomes are further scaffolded and gap-filled using the lightweight and robust scaffolder OPERA-LG.

OPERA-MS can assemble near complete genomes from a metagenomic dataset with as little as 9x long-read coverage. It is designed to be conservative and avoid aggresive assembly, a strategy favored by many modern assemblers that aim to report high contiguity statistics. Applied to human gut microbiome data, OPERA-MS provides hundreds of high quality draft genomes, a majority of which have N50 >100kbp. We observed the assembly of complete plasmids, many of which were novel and contain previously unseen resistance gene combinations. In addition, OPERA-MS can very accurately assemble genomes even in the presence of multiple strains of a species in a complex metagenome, allowing us to associate plasmids and host genomes using longitudinal data. For further details about these and other results using nanopore sequencing on stool samples from clinical studies see our manuscript or preprint.

Installation

To install OPERA-MS on a typical Linux/Unix system run the following commands:

git clone https://github.com/CSB5/OPERA-MS.git
cd OPERA-MS
make
perl OPERA-MS.pl check-dependency

If you encounter any problems during the installation, or if some third party software binaries are not functional on your system, please see the Dependencies section.

A set of test files is provided to test out the OPERA-MS pipeline. To run OPERA-MS on the test dataset, simply use the following commands (please note that the test run requires 2 cores which is also the minimum):

cd test_files
perl ../OPERA-MS.pl \
    --contig-file contigs.fasta \
    --short-read1 R1.fastq.gz \
    --short-read2 R2.fastq.gz \
    --long-read long_read.fastq \
    --no-ref-clustering \
    --out-dir RESULTS 2> log.err

This will assemble a low diversity mock community in the folder RESULTS. Note that in the case of interruption during an OPERA-MS run, using the same command-line will re-start the execution after the last completed checkpoint.

Reference genome database

To download a precomputed genome database that would enable leveraging this information for reference-based clustering, please refer to the dedicated wiki page to generate a new database from the last GTDB release or to the opera-ms-db utility command wiki to generate a custom database.

Otherwise, you can download our old prepacked database if the previous options do not work with you:

perl OPERA-MS.pl install-db

The database contains representative genomes for 23k bacterial species from GTDB and requires 35Gb of free disc space. Please be aware that this version has been generated from a very old GTDB release, and GTDB now includes more than 80k bacterial species.

Usage

Essential arguments

  • --short-read1 : path to file containing the first read for Illumina paired-end read data (fasta/fastq/fasta.gz/fastq.gz)

  • --short-read2 : path to file containing the second read for Illumina paired-end read data (fasta/fastq/fasta.gz/fastq.gz)

  • --long-read : path to the long-read file obtained from either Oxford Nanopore, PacBio or Illumina Synthetic Long Read sequencing (fasta/fastq)

  • --out-dir : directory where OPERA-MS results will be outputted

Optional arguments

  • --genome-db : path to a custom OPERA-MS genome database used during reference-based clustering (defaut=OPERA-MS-DB)

  • --no-ref-clustering : disable reference-based clustering

  • --no-strain-clustering : disable strain-level clustering

  • --no-polishing : disable short-read polishing (currently using Pilon). The polished contigs can be found in contigs.polished.fasta. For samples with high coverage and/or high complexity this step may take a significant amount of time.

  • --long-read-mapper : software used for long-read mapping i.e. blasr or minimap2 (tested with version 2.11-r797, default)

  • --short-read-assembler : software used for short read assembly i.e. MEGAHIT (default) or SPAdes

  • --no-gap-filling : disable gap-filling stage

  • --kmer-size : kmer value (default=60) used to assemble contigs

  • --contig-len-thr : contig length threshold (default=500) for clustering; contigs smaller than contig-len-thr will be filtered out.

  • --contig-edge-len : during contig coverage calculation, number of bases filtered out from each contig end (default=80), to avoid biases due to lower mapping efficiency

  • --contig-window-len : window length (default=340) in which the coverage estimation is performed. We recommend using contig_len_thr - 2 * contig_edge_len as the value

  • --contig-file : path to the contig file, if the short-reads have been assembled previously

  • --num-processors : number of processors to use (default=2; note that 2 is the minimum)

Alternatively, OPERA-MS parameters can be set using a configuration file.

Output

The following output files can be found in the specified output directory i.e. RESULTS. The file contigs.polished.fasta (and contigs.fasta if the assembly has not been polished) contains the assembled contigs, assembly.stats provides overall assembly statistics (e.g. assembly size, N50, longest contig etc.), and contig_info.txt provides a detailed overview of the assembled contigs.

Finally, OPERA-MS strain-level clusters (one fasta file per strain) can be found in the directory RESULTS/opera_ms_clusters/all and cluster_info.txt provides a detailed overview of assembly statistics for these clusters. Note that these clusters are constructed for producing high-quality assemblies and are therefore conservative. Contigs can be binned further using approaches such as MaxBin2 or MetaBAT2.

OPERA-MS-UTILS

Scripts to post-process the assemblies are now availaible using the OPERA-MS-UTILS command. We are now providing streamlined analysis tools to compute the concordance in represented taxa between short and long-read sequencing, to bin contigs, to assess bin quality and to identify genomes for novel species from the metagenomic assembly. A complete description of these tools can be found in the the OPERA-MS-UTILS wiki section.

This is work-in-progress and additional tools will be available as part of the next release. Please contact us if you would like to add your favorite metagenomic analysis tool.

Resource Requirements

OPERA-MS's runtime depends on the complexity of the metagenome and the amount of short/long-read data available. We typically run OPERA-MS with default parameters using 16 threads on an Intel Xeon platinum server with SSD hard drive. With this hardware specification, we obtain the following running time and memory usage characteristics.

Dataset Short-read data (Gbp) Long-read data (Gbp) Running time (hours) Peak RAM usage (GB)
CAMI2 multi-strain mock community (low complexity) 3.9 2 1.4 5.5
Human gut microbiome (medium complexity) 24.4 1.6 2.7 10.2
CAMI2 environmental mock community (high complexity) 9.9 4.8 4.5 12.8

Important note: Peak RAM usage became mostly dependant of the databasize used by OPERA-MS. Custom databases, including updated GTDB will very likely lead to very high RAM usage.

OPERA-MS is designed to work with deep short-read sequencing, but can work with lower coverage in terms of long-read sequencing. In practice, short-read coverage >15x is recommended, while OPERA-MS can use long-read coverage as low as 9x to boost assembly contiguity. Based on this, we recommend at least 9Gbp of short-read data and 3Gbp of long-read data to allow for assembly of bacterial genomes at 1% relative abundance in the metagenome.

Dependencies

The only true dependency is cpanm, which is used to automatically install Perl modules. All other required programs come either pre-compiled with OPERA-MS or are built during the installation process. Binaries are placed inside the tools_opera_ms folder:

  1. MEGAHIT - (tested with version 1.0.4-beta)
  2. SPAdes - (tested with version 3.13.0)
  3. samtools - (version 0.1.19 or below)
  4. bwa - (tested with version 0.7.10-r789)
  5. blasr - (version 5.1 and above which uses '-' options)
  6. minimap2 (tested with version 2.11-r797)
  7. Racon - (version 0.5.0)
  8. Mash - (tested with version 2.2)
  9. MUMmer - (tested with version 3.23)
  10. Pilon - (tested with version 1.22)

If a pre-built program does not work on the user's machine, OPERA-MS will check if the program is present in the user's PATH. However, the version of the program may be different than the one packaged. Alternatively, to specify a different directory for the dependency, a link to the program may be placed in the tools_opera_ms folder.

OPERA-MS and its dependencies require:

Once cpanm is installed, simply run the following command to install all the perl modules:

perl tools_opera_ms/install_perl_module.pl

If the perl libraries cannot be installed under root, the following line should be added to .bashrc:

export PERL5LIB="/home/$USER/perl5/lib/perl5${PERL5LIB:+:${PERL5LIB}}";

Docker

A simple Dockerfile is provided in the root of the repository. To build the image:

[comment]:docker build -t operams .

The generic command to run a OPERA-MS docker container after building:

[comment]:docker run \
    -v /host/path/to/indata/:/indata/ \
    -v /host/path/to/outdata/:/outdata/ \
    -v /host/path/to/OPERA-MS/OPERA-MS-DB/:/operams/OPERA-MS-DB/ \
     operams
    --short-read1 /indata/R1.fastq.gz \
    --short-read2 /indata/R2.fastq.gz \
    --long-read /indata/long_read.fastq \
    --out-dir /outdata 

To process data with the dockerized OPERA-MS, directories for in- and outdata should be mounted into the container. An example is shown below for running the test dataset. In the example below, the repo was cloned to /home/myuser/git/OPERA-MS). The repo is needed only for the sample_files. OPERA-MS-DB should be dowloaded using the command described here. If Docker is running in a VM, as is the case for Windows or OSX, but also when deployed on a cloud platform such as AWS or Azure, a minimum of 2 available cores is required.

#Build the docker
docker build -t operams .

#Download OPERA-MS-DB
mkdir OPERA-MS-DB/ 
docker run -v OPERA-MS-DB/:/operams/OPERA-MS-DB/ operams install-db

#Run the assembly
docker run \
    -v test_files/:/sample_files \
    -v test_files/RESULTS/:/sample_out \
    -v OPERA-MS-DB/:/operams/OPERA-MS-DB/
     operams --contig-file /sample_files/contigs.fasta \
    --short-read1 /sample_files/R1.fastq.gz \
    --short-read2 /sample_files/R2.fastq.gz \
    --long-read /sample_files/long_read.fastq \
    --out-dir /sample_out

Contact information

For additional information, help and bug reports please send an email to: