Metagenomic Read Recruitment to SAGs

Metagenomic read recruitment workflow developed by the Stepanauskas Group, used in Pachiadaki et al. 2017.

Package github page where extensive instructions can be found: sag-mg-recruit

Available on C1 and C2 via SCGC's anaconda module.

You'll also need to load the dependencies flash and bwa.

To load into environment:

module use /mod/scgc/
module load anaconda
module load flash
module load bwa

For instructions on how to run type: sag-mg-recruit --help

Which should return something like:

Usage: sag-mg-recruit [OPTIONS] INPUT_MG_TABLE INPUT_SAG_TABLE
Options:
  --outdir TEXT          directory location to place output files
  --cores INTEGER        number of cores to run on  [default: 8]
  --mmd FLOAT            for join step: mismatch density  [default: 0.05]
  --mino INTEGER         for join step: minimum overlap  [default: 35]
  --maxo INTEGER         for join step: maximum overlap  [default: 150]
  --minlen INTEGER       for alignment and mg read count: minimum alignment
                         length to include; minimum read size to include
                         [default: 150]
  --pctid INTEGER        for alignment: minimum percent identity to keep
                         within overlapping region  [default: 95]
  --overlap INTEGER      for alignment: percent read that must overlap with
                         reference sequence to keep  [default: 0]
  --log TEXT             name of log file, else, log sent to standard out
  --concatenate BOOLEAN  include concatenated SAG in analysis  [default: True]
  --checkm BOOLEAN       should checkm be run on the SAGs?  [default: True]
  --keep_coverage        if you want to keep the genome coverage table (large)
  -h, --help             Show this message and exit.

Each run requires a table listing input metagenomes and a table listing input SAGs. Example input tables can be found here. Make sure you also specify a new directory for output files using the --outdir parameter.

This workflow is not necessarily optimized for our current HPC environment as it was written pre-scheduler installation. It runs metagenomic read recruitment to SAGs one pair at a time. Good parameters to run this workflow might be 12 - 30 cores and a walltime dependent upon how many metagenomes and sags you are looking to compare as well as the size of your input metagenomes, something between 24 hours and a week.

It's worth noting that this workflow was designed with the recruitment of metagenomic reads generated by Illumina sequencers in mind.