-
Notifications
You must be signed in to change notification settings - Fork 4
Metagenomic Read Recruitment to SAGs
Metagenomic read recruitment workflow developed by the Stepanauskas Group, used in Pachiadaki et al. 2017.
Package github page where extensive instructions can be found: sag-mg-recruit
Available on C1 and C2 via SCGC's anaconda module.
You'll also need to load the dependencies flash and bwa.
To load into environment:
module use /mod/scgc/
module load anaconda
module load flash
module load bwa
For instructions on how to run type:
sag-mg-recruit --help
Which should return something like:
Usage: sag-mg-recruit [OPTIONS] INPUT_MG_TABLE INPUT_SAG_TABLE
Options:
--outdir TEXT directory location to place output files
--cores INTEGER number of cores to run on [default: 8]
--mmd FLOAT for join step: mismatch density [default: 0.05]
--mino INTEGER for join step: minimum overlap [default: 35]
--maxo INTEGER for join step: maximum overlap [default: 150]
--minlen INTEGER for alignment and mg read count: minimum alignment
length to include; minimum read size to include
[default: 150]
--pctid INTEGER for alignment: minimum percent identity to keep
within overlapping region [default: 95]
--overlap INTEGER for alignment: percent read that must overlap with
reference sequence to keep [default: 0]
--log TEXT name of log file, else, log sent to standard out
--concatenate BOOLEAN include concatenated SAG in analysis [default: True]
--checkm BOOLEAN should checkm be run on the SAGs? [default: True]
--keep_coverage if you want to keep the genome coverage table (large)
-h, --help Show this message and exit.
Each run requires a table listing input metagenomes and a table listing input SAGs. Example input tables can be found here. Make sure you also specify a new directory for output files using the --outdir parameter.
This workflow is not necessarily optimized for our current HPC environment as it was written pre-scheduler installation. It runs metagenomic read recruitment to SAGs one pair at a time. Good parameters to run this workflow might be 12 - 30 cores and a walltime dependent upon how many metagenomes and sags you are looking to compare as well as the size of your input metagenomes, something between 24 hours and a week.
It's worth noting that this workflow was designed with the recruitment of metagenomic reads generated by Illumina sequencers in mind.