dalmolingroup/euryale is a pipeline for taxonomic classification and functional annotation of metagenomic reads. Based on MEDUSA.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!
- Read QC (
FastQC
) - Read trimming and merging (
fastp
) - (optionally) Host read removal (
BowTie2
) - Duplicated sequence removal (
fastx collapser
) - Present QC and other data (
MultiQC
)
- (optionally) Read assembly (
MEGAHIT
)
-
Install
Nextflow
(>=22.10.1
) -
Install any of
Docker
,Singularity
(you can follow this tutorial),Podman
,Shifter
orCharliecloud
for full pipeline reproducibility (you can useConda
both to install Nextflow itself and also to manage software within pipelines. Please only use it within pipelines as a last resort; see docs). -
Download the pipeline and test it on a minimal dataset with a single command:
nextflow run dalmolingroup/euryale -profile test,YOURPROFILE --outdir <OUTDIR>
Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (YOURPROFILE
in the example command above). You can chain multiple config profiles in a comma-separated string.
- The pipeline comes with config profiles called
docker
,singularity
,podman
,shifter
,charliecloud
andconda
which instruct the pipeline to use the named tool for software management. For example,-profile test,docker
.- Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use
-profile <institute>
in your command. This will enable eitherdocker
orsingularity
and set the appropriate execution settings for your local compute environment.- If you are using
singularity
, please use thenf-core download
command to download images first, before running the pipeline. Setting theNXF_SINGULARITY_CACHEDIR
orsingularity.cacheDir
Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.- If you are using
conda
, it is highly recommended to use theNXF_CONDA_CACHEDIR
orconda.cacheDir
settings to store the environments in a central location for future pipeline runs.
- Start running your own analysis!
nextflow run dalmolingroup/euryale \
--input samplesheet.csv \
--outdir <OUTDIR> \
--kaiju_db kaiju_reference \
--reference_fasta diamond_fasta \
--host_fasta host_reference_fasta \
--id_mapping id_mapping_file \
-profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
A question that pops up a lot is: Since Euryale requires a lot of reference parameters, where can I find these references?
One option is to execute EURYALE's download entry, which will download the necessary databases for you. This is the recommended way to get started with the pipeline. This uses the same sources as EURYALE's predecessor MEDUSA.
nextflow run dalmolingroup/euryale \
--download_functional \
--download_kaiju \
--download_host \
--outdir <output directory> \
-entry download \
-profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
Check out the full documentation for a full list of EURYALE's download parameters.
In case you download the Kraken2 database (--download_kraken
), make sure to extract it using the following command before using
it in the pipeline:
tar -xvf kraken2_db.tar.gz
Below we provide a short list of places where you can find these databases. But, of course, we're not limited to these references: Euryale should be able to process your own databases, should you want to build them yourself.
For the alignment you can either provide --diamond_db
for a pre-built DIAMOND database, or you can provide --reference_fasta
.
For reference fasta, by default Euryale expects something like NCBI-nr, but similarly formatted reference databases should also suffice.
At its current version, Euryale doesn't build a reference taxonomic database, but pre-built ones are supported.
- If you're using Kaiju (the default), you can provide a reference database with
--kaiju_db
and provide a .tar.gz file like the ones provided in the official Kaiju website. We have extensively tested Euryale with the 2021 version of the nr database and it should work as expected. - If you're using Kraken2 (By supplying
--run_kraken2
), we expect something like the pre-built .tar.gz databases provided by the Kraken2 developers to be provided to--kraken2_db
.
We expect an ID mapping reference to be used within annotate. Since we're already expecting by default the NCBI-nr to be used as the alignment reference, the ID mapping data file provided by Uniprot should work well when provided to --id_mapping
.
If you're using metagenomic reads that come from a known host's microbiome, you can also provide the host's genome FASTA to --host_fasta
parameter in order to enable our decontamination subworkflow.
Ensembl provides easy to download genomes that can be used for this purpose.
Alternatively, you can provide a pre-built BowTie2 database directory to the --bowtie2_db
parameter.
The dalmolingroup/euryale documentation is split into the following pages:
-
- An overview of how the pipeline works, how to run it and a description of all of the different command-line flags.
-
- An overview of the different results produced by the pipeline and how to interpret them.
dalmolingroup/euryale was originally written by João Cavalcante.
We thank the following people for their extensive assistance in the development of this pipeline:
- Diego Morais (for developing the original MEDUSA pipeline)
J. V. F. Cavalcante, I. Dantas de Souza, D. A. A. Morais and R. J. S. Dalmolin, "EURYALE: A versatile Nextflow pipeline for taxonomic classification and functional annotation of metagenomics data," 2024 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Natal, Brazil, 2024, pp. 1-7, doi: 10.1109/CIBCB58642.2024.10702116.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.