GitHub - jsgounot/floria_analysis_workflow: Worflows linked with Glopp

Floria pipelines

This repository contains 1) all workflows used to run and compare phasing solutions, including regular assembly software, split phasing, and real reads approaches, 2) production workflow (Floria-PL) to directly run and process your metagenomic reads, and 3) all datasets used in the initial Floria paper.

Pipelines were initially built to assess our metagenomic strains phasing algorithm Floria.

Conda environments and software paths

System

Workflows have been written using Snakemake and Conda under a Linux environment:

Ubuntu (18.04.1 LTS)
Snakemake (7.18.1 and 7.32.4)

Conda

Your main conda environment is supposed to have snakemake, biopython and pigz installed as well. On ubuntu:

mamba install -c conda-forge -c bioconda biopython snakemake
sudo apt-get install pigz

Most of the other softwares used in the pipeline will be automatically downloaded by snakemake with conda during the first launch, following recipes found in workflow/envs. If you want to use different conda environments, you can replace associated environment names in the condaenvs.json file.

Softwares

Some softwares are not available on conda and need to be installed locally: strainberry, strainxpress, floria, and opera-ms. You can install just the ones you need, and add the executable paths in softpaths.json.

For Strainberry, I recommend installing my small fork version of the software here that is more suitable for snakemake pipeline (see details in this commit).

If you plan to use the kraken classification approach to identify reference sequences, you'll need to specify both the kraken database (krakendb) and the gzipped kraken database library fasta file (krakendb_lib). If not specified, the pipeline will assume the library file is found within path/to/krakendb/library/library.fna.gz. For gut microbiome, you can use the UHGG database. Don't forget to compress the fna file within the library directory.

Production pipeline

If you want to process your metagenomic samples, please use multiple_species_production.snk.

Here is an example of the configuration file:

{
    "outputs": [
        {
            "assembly": {
                "name": "kraken_ref",
                "mode": "nanopore"
            },
            "group": "all",
            "phasing": {
                "assembler_preset": "none",
                "assembler_rtype": "long_reads",
                "fmode": "none",
                "name": "floria",
                "post_assembler": "wtdbg2",
                "readtype": "nanopore",
                "assembler_preset": "nano"
            },
            "vcalling": "longshot"
        }
    ],
    "samples": {
        "Sample_name_1": {
            "nanopore": "path/to/your/sample/reads.fastq.gz",
        }
    }
}

With this configuration file, snakemake will 1. Identify the reference genomes using the kraken mapping approach (kraken_ref) on all samples (all), call SNPs using longshot, run floria on long reads and assemble haplotypes using wtdbg2.

Please check the related section in the configuration files guide for more information and other example configuration files (including one for production with Illumina reads).

Simulation and assessment pipelines

This repository contains multiple main pipelines (or workflows), each of them being a snakemake file .snk in the root folder and composed of multiple subworkflows you can find in the workflow folder.

Single species synthetic

single_species_synthetic.snk

Produce synthetic reads of multiple strains from the same species to produce phasing. For mapping-based approach, reads are mapped against one user-designated reference genome. This is the simplest way of testing phasing algorithms. Note that you can have multiple species that will be processed independently with the same run.

Single species subsampling

single_species_subsampling.snk

Subsample real reads of multiple strains from the same species to produce phasing. Excluding this initial part, the pipeline is similar to single species synthetic one.

Multiple species / Metagenome synthetic

multiple_species_synthetic.snk

Produce synthetic reads from multiple strains and species to simulate a metagenome and phase without prior-knowledge of existing species. For reference-based approaches such as Floria or Strainberry, kraken-based or merged assembly are available (see below).

Note on the reference base approaches for multiple species (including production) pipelines

For multiple species, two approaches are available to generate reference sequence(s) that will be used by Floria or Strainberry.

Option A: All reads are assembled within on single reference genome that can be used as reference. This would work with Floria but will lead to very bad results for Strainberry, that assume one single species.

Option B: Reference species are defined using a Kraken based approach where reads are classified against a Kraken database and only species with estimated coverage higher than nX (default = 5X) are kept. Reads are then assigned against a unique species, based on similarity threshold; and each species is processed individually with Floria or Strainberry. More details are available within our paper.

Configuration file

A configuration file is required to run the pipeline, and examples of configuration files can be found within the config folder. A description of configuration file is defined in a dedicated markdown file.

Running the pipeline

The pipeline can be run like this (dry-mode):

snakemake -s {pipeline_name}.snk --configfile {config.json} -d {outdir} --use-conda --cores {cores} -n

I recommend some other options:

--conda-prefix {absolute_path} to root your conda environment within a specific folder.
--attempt 2 to be sure that random errors (for example memory) does not make your whole pipeline crash
--resources ncbi_load=1 to ensure that you limit requests against NCBI server (important)
--rerun-incomplete to rerun rules that might have failed before
--rerun-triggers mtime to avoid to unnecessary rerun some samples
--scheduler greedy do not use the usual DAG scheduler, can make your pipeline to execute faster
--keep-going to avoid that the pipeline stops at the first error

Some options might be useful in specific conditions

-n to dry-run
--notemp if you want to keep all temporary files (including large bam files)
-p to show used command lines

CPU usage

If benchmarking is not a concern for you, I also recommend to slightly specify more cores than what you configuration offer, fore example 50 CPUs instead of 46. Advantages are multiples: Some used softwares are only partially using all their CPU. Some rules requiere very little CPU usage but can still take some time to be completed, this avoid to have unecessary bottlenecks. Note that increasing this number too much can also lead to some weird and untracktable errors (WTDBG2 for example), so please be cautious.

Outputs

Production pipelines

The main ouputs are in the results folder, including for Floria a ploidy file (tsv.gz) and an assembly file containing the concatenation of all haplosets (fa.gz).

Ploidy files contain all information related to each phased contig, first column are copied from Floria output (description here), extra columns are defined here:

Field	Description
contig_size	Contig size
taxid	Taxonomic ID from the Kraken database
kraken_abundance	Kraken abundance for this taxid
kraken_#covered	Number of reads covering the taxid
kraken_#assigned	Number of reads assigned to this taxid
kraken_rank	Kraken taxid rank
kraken_name	Kraken taxid name
kraken_assigned	Should be similar to `kraken_#assigned`, for debug
kraken_readlength	The cumulative reads length of assigned reads
kraken_assigned_abu	Should be similar to `kraken_abundance`, for debug
kraken_reflen	Kraken DB reference size
kraken_cov	Coverage estimation based on reads cumulative length and ref size
kraken_reflen_mb	Kraken ref size in Mb
bdepth_npos	Number of position from the extracted reference used for mapping, for debug
bdepth_sum_depth	Total cumulative depth of all position after mapping
bdepth_mean_cov	Mean coverage of the reference
af_nba	Number of SNPs after variant calling
af_con	Genome wide average of allele frequencies with confident alleles = `AC` / (`DP` - `AM`)
af_amb	Genome wide average of allele frequencies with ambiguous alleles = `AC` / `DP`

An estimation of the strains count at the species level can be obtained from this file:

import numpy as np
import pandas as pd

fname = 'path/to/your/file.tsv.gz'
df = pd.read_csv(fname, sep='\t')

# Minimum 15HAPQ used here!
use_col = 'average_straincount_min15hapq'

# Defining a minimum strain count of 1
df[use_col + '_oneplus'] = np.where(df[use_col] > 1, df[use_col], 1)

# Strain count normalized by contig size
df['alpw'] = df[use_col + '_oneplus'] * df['contig_size']
columns = ['alpw', 'contig_size', 'bdepth_npos', 'bdepth_sum_depth']
mdf = df.groupby('taxid')[columns].sum().reset_index()

# Adding coverage information
mdf['mean_cov'] = mdf['bdepth_sum_depth'] / mdf['bdepth_npos']
mdf['alp'] = mdf['alpw'] / mdf['contig_size']

# Adding Kraken information
kabu = df[['taxid', 'kraken_abundance']].drop_duplicates()
mdf = mdf.merge(kabu, on='taxid', how='left')

# Rounded value = Approximation of the number of strains you have here
mdf['alp_r'] = mdf['alp'].round()

mdf.head()

Please check paper methodology if you want more details regarding this part.

Assessment pipelines

Results for each main phases of the pipeline are saved into different folders. The main outputs are:

assembly contains the link to all prior assemblies (such as flye or megahit)
phasing contains all the phasing results, including the concatenated haplotypes
stats contains haplotype statistics (for assessment pipelines) and circos files
refcomp/*.fastani.txt & refcomp/*.mummer.txt for the reference comparisons when enabled
benchmarks contains the individual benchmark of the most CPU intensive rules

Notes on benchmarking

While most processes use CPU quite efficiently, some software underuse the number of available CPUs (e.g strainberry during the longshot process, using only one CPU). This results in high CPU wall time but low CPU loading, both being provided by snakemake (respectively cpu_time and mean_load). While it is tempting to only report CPU loading, we find this metric unfair for CPU efficient softwares. For Floria's paper, we decided to use snakemake CPU wall time normalized by the number of CPU used for the specific rule. While this can't be done directly by snakemake and requires the user to remember the number of CPUs used for each rule, this value provides a more accurate representation of each rule's CPU efficiency.

Citations

Please cite our paper on bioinformatic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Floria pipelines

Conda environments and software paths

System

Conda

Softwares

Production pipeline

Simulation and assessment pipelines

Single species synthetic

Single species subsampling

Multiple species / Metagenome synthetic

Note on the reference base approaches for multiple species (including production) pipelines

Configuration file

Running the pipeline

CPU usage

Outputs

Production pipelines

Assessment pipelines

Notes on benchmarking

Citations

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
configs		configs
dataset		dataset
workflow		workflow
.gitattributes		.gitattributes
README.md		README.md
condaenvs.json		condaenvs.json
config_desc.md		config_desc.md
multiple_species_production.snk		multiple_species_production.snk
multiple_species_synthetic.snk		multiple_species_synthetic.snk
single_species_subsampling.snk		single_species_subsampling.snk
single_species_synthetic.snk		single_species_synthetic.snk
softpaths.json		softpaths.json

jsgounot/floria_analysis_workflow

Folders and files

Latest commit

History

Repository files navigation

Floria pipelines

Conda environments and software paths

System

Conda

Softwares

Production pipeline

Simulation and assessment pipelines

Single species synthetic

Single species subsampling

Multiple species / Metagenome synthetic

Note on the reference base approaches for multiple species (including production) pipelines

Configuration file

Running the pipeline

CPU usage

Outputs

Production pipelines

Assessment pipelines

Notes on benchmarking

Citations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages