Skip to content

A pipeline for taxonomic classification and functional annotation of metagenomic reads. Based on MEDUSA

License

Notifications You must be signed in to change notification settings

dalmolingroup/euryale

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nf-core CI docs

Nextflow run with docker run with singularity

EURYALE Logo

Introduction

dalmolingroup/euryale is a pipeline for taxonomic classification and functional annotation of metagenomic reads. Based on MEDUSA.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!

Pipeline summary

EURYALE diagram

Pre-processing

Assembly

  • (optionally) Read assembly (MEGAHIT)

Taxonomic classification

  • Sequence classification (Kaiju)
  • Sequence classification (Kraken2)
  • Visualization (Krona)

Functional annotation

  • Sequence alignment (DIAMOND)
  • Map alignment matches to functional database (annotate)

Quick Start

  1. Install Nextflow (>=22.10.1)

  2. Install any of Docker, Singularity (you can follow this tutorial), Podman, Shifter or Charliecloud for full pipeline reproducibility (you can use Conda both to install Nextflow itself and also to manage software within pipelines. Please only use it within pipelines as a last resort; see docs).

  3. Download the pipeline and test it on a minimal dataset with a single command:

nextflow run dalmolingroup/euryale -profile test,YOURPROFILE --outdir <OUTDIR>

Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (YOURPROFILE in the example command above). You can chain multiple config profiles in a comma-separated string.

  • The pipeline comes with config profiles called docker, singularity, podman, shifter, charliecloud and conda which instruct the pipeline to use the named tool for software management. For example, -profile test,docker.
  • Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile <institute> in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment.
  • If you are using singularity, please use the nf-core download command to download images first, before running the pipeline. Setting the NXF_SINGULARITY_CACHEDIR or singularity.cacheDir Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.
  • If you are using conda, it is highly recommended to use the NXF_CONDA_CACHEDIR or conda.cacheDir settings to store the environments in a central location for future pipeline runs.
  • Start running your own analysis!
nextflow run dalmolingroup/euryale \
  --input samplesheet.csv \
  --outdir <OUTDIR> \
  --kaiju_db kaiju_reference \
  --reference_fasta diamond_fasta \
  --host_fasta host_reference_fasta \
  --id_mapping id_mapping_file \
  -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>

Databases and references

A question that pops up a lot is: Since Euryale requires a lot of reference parameters, where can I find these references?

One option is to execute EURYALE's download entry, which will download the necessary databases for you. This is the recommended way to get started with the pipeline. This uses the same sources as EURYALE's predecessor MEDUSA.

nextflow run dalmolingroup/euryale \
	--download_functional \
  --download_kaiju \
  --download_host \
  --outdir <output directory> \
  -entry download \
  -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>

Check out the full documentation for a full list of EURYALE's download parameters. In case you download the Kraken2 database (--download_kraken), make sure to extract it using the following command before using it in the pipeline:

tar -xvf kraken2_db.tar.gz

Below we provide a short list of places where you can find these databases. But, of course, we're not limited to these references: Euryale should be able to process your own databases, should you want to build them yourself.

Alignment

For the alignment you can either provide --diamond_db for a pre-built DIAMOND database, or you can provide --reference_fasta. For reference fasta, by default Euryale expects something like NCBI-nr, but similarly formatted reference databases should also suffice.

Taxonomic classification

At its current version, Euryale doesn't build a reference taxonomic database, but pre-built ones are supported.

  • If you're using Kaiju (the default), you can provide a reference database with --kaiju_db and provide a .tar.gz file like the ones provided in the official Kaiju website. We have extensively tested Euryale with the 2021 version of the nr database and it should work as expected.
  • If you're using Kraken2 (By supplying --run_kraken2), we expect something like the pre-built .tar.gz databases provided by the Kraken2 developers to be provided to --kraken2_db.

Functional annotation

We expect an ID mapping reference to be used within annotate. Since we're already expecting by default the NCBI-nr to be used as the alignment reference, the ID mapping data file provided by Uniprot should work well when provided to --id_mapping.

Host reference

If you're using metagenomic reads that come from a known host's microbiome, you can also provide the host's genome FASTA to --host_fasta parameter in order to enable our decontamination subworkflow. Ensembl provides easy to download genomes that can be used for this purpose. Alternatively, you can provide a pre-built BowTie2 database directory to the --bowtie2_db parameter.

Documentation

The dalmolingroup/euryale documentation is split into the following pages:

  • Usage

    - An overview of how the pipeline works, how to run it and a description of all of the different command-line flags.
    
  • Output

    - An overview of the different results produced by the pipeline and how to interpret them.
    

Credits

dalmolingroup/euryale was originally written by João Cavalcante.

We thank the following people for their extensive assistance in the development of this pipeline:

  • Diego Morais (for developing the original MEDUSA pipeline)

Citations

J. V. F. Cavalcante, I. Dantas de Souza, D. A. A. Morais and R. J. S. Dalmolin, "EURYALE: A versatile Nextflow pipeline for taxonomic classification and functional annotation of metagenomics data," 2024 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Natal, Brazil, 2024, pp. 1-7, doi: 10.1109/CIBCB58642.2024.10702116.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.