-
Notifications
You must be signed in to change notification settings - Fork 10
Running a Whole Genome Pedigree Dataset
cd into a directory with at least 2 TB of allocated Disk space
cd /data/$USER
Launch an interactive session on Biowulf and load requisite Biowulf modules:
sinteractive
module load cromwell/40 git python/3.6
Clone the github repo and create a work directory for running the wdl workflow:
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
mkdir -p ${VG_WDL_DIR} && cd ${VG_WDL_DIR}
git clone https://github.com/vgteam/vg_wdl.git
Download workflow inputs and set up miniwdl virtual environment to run vg_wdl workflows:
WORKFLOW_INPUT_DIR="/data/$USER/test_vg_wdl_run/workflow_inputs"
${VG_WDL_DIR}/vg_wdl/scripts/setup_vg_wdl.sh -g ${WORKFLOW_INPUT_DIR} -v ${VG_WDL_DIR}
exit
Setup the cohort working directory and collect input reads for the cohort (this should take a few minutes). Only need to change COHORT_NAME
from this template. The COHORT_NAME
should be the sample name of the proband in a UDP cohort. The COHORT_NAMES_LIST
bash array variable needs to list the proband, sibling and parental ids in a space-delimited manner.
COHORT_INPUT_DATA
should contain the full path to the directory containing all raw read data of the cohort. For example, if the raw reads for PROBAND
and SIBLING_1
are located in /data/Udpdata/Individuals/PROBAND/R1_fastq.gz
and /data/Udpdata/Individuals/SIBLING_1/R1_fastq.gz
respectively, then the path for COHORT_INPUT_DATA
should be /data/Udpdata/Individuals
.
COHORT_NAME="UDP****"
COHORT_NAMES_LIST=("UDP_MATERNAL" "UDP_PATERNAL" "UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${COHORT_NAME}_cohort_workdir"
COHORT_INPUT_DATA="/PATH/TO/DIRECTORY/CONTAINING/INPUT/READS"
${VG_WDL_DIR}/vg_wdl/scripts/setup_input_reads.sh -l "${COHORT_NAMES_LIST[*]}" -w ${COHORT_WORKFLOW_DIR} -c ${COHORT_INPUT_DATA}
CD into cohort work directory and setup input variables.
Only need to change MATERNAL_SAMPLE_NAME
, PATERNAL_SAMPLE_NAME
and PROBAND_SAMPLE_NAME
from this template.
MATERNAL_SAMPLE_NAME="UDP_MOM"
PATERNAL_SAMPLE_NAME="UDP_DAD"
PROBAND_SAMPLE_NAME="UDP_CHILD"
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_vg_wdl_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${PROBAND_SAMPLE_NAME}_cohort_workdir"
Setup workflow bash script
${VG_WDL_DIR}/vg_wdl/scripts/setup_trio_mapping_script.part_1.sh -p ${PROBAND_SAMPLE_NAME} -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${VG_WDL_DIR}
Run the trio mapping workflow
cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=100g --gres=lscratch:200 --time=72:00:00 ${PROBAND_SAMPLE_NAME}_cohort_trio_map.part_1.sh
Setup input variables
MATERNAL_SAMPLE_NAME="UDP_MOM"
PATERNAL_SAMPLE_NAME="UDP_DAD"
PROBAND_SAMPLE_NAME="UDP_CHILD"
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_vg_wdl_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${PROBAND_SAMPLE_NAME}_cohort_workdir"
Setup workflow bash script
${VG_WDL_DIR}/vg_wdl/scripts/setup_trio_calling_script.part_2.sh -p ${PROBAND_SAMPLE_NAME} -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${VG_WDL_DIR}
Run the trio genotyping workflow
cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=100g --gres=lscratch:200 --time=24:00:00 ${PROBAND_SAMPLE_NAME}_cohort_trio_call.part_2.sh
For one of the input variables the PED_FILE
must point to a valid .ped
file in the form of the COHORT_ID.ped
or PROBAND_SAMPLE_ID.ped
naming scheme and must follow the tab-delimited PED file format. The .ped
file needs to only contain the mother-father-proband trio set of samples. For example the HG002 trio file looks like the following where the proband is HG002
the father is HG003
and the mother is HG004
:
#Family ID Father Mother Sex[1=M] Affected[2=A]
HG002 HG002 HG003 HG004 1 2
HG002 HG003 0 0 1 1
HG002 HG004 0 0 2 1
Setup input variables
MATERNAL_SAMPLE_NAME="UDP_MOM"
PATERNAL_SAMPLE_NAME="UDP_DAD"
PROBAND_SAMPLE_NAME="UDP_CHILD"
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_vg_wdl_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${PROBAND_SAMPLE_NAME}_cohort_workdir"
PED_FILE="${COHORT_WORKFLOW_DIR}/${PROBAND_SAMPLE_NAME}.ped"
Setup workflow bash script
${VG_WDL_DIR}/vg_wdl/scripts/setup_parent_graph_construct_script.part_3.sh -p ${PROBAND_SAMPLE_NAME} -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -c ${PED_FILE} -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${VG_WDL_DIR}
Run the parental graph construction workflow
cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=50g --gres=lscratch:100 --time=72:00:00 ${PROBAND_SAMPLE_NAME}_cohort_parental_graph_construction.part_3.sh
CD into cohort work directory and setup input variables.
The SIBLING_ID_LIST
bash array variable needs to list the proband and sibling ids in a space-delimited manner. The proband must be listed first. For example, if the pedigree has one proband UDP_PROBAND
and 2 additional siblings UDP_SIB_1
and UDP_SIB_2
:
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2").
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_vg_wdl_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${SIBLING_ID_LIST[0]}_cohort_workdir"
${VG_WDL_DIR}/vg_wdl/scripts/setup_sibling_mapping_script.part_4.sh -s "${SIBLING_ID_LIST[*]}" -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${VG_WDL_DIR}
Run the sibling alignment workflow
cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=50g --gres=lscratch:200 --time=72:00:00 ${SIBLING_ID_LIST[0]}_cohort_2nd_iter_sibling_map.part_4.sh
CD into cohort work directory and setup input variables.
The SIBLING_ID_LIST
bash array variable needs to list the proband and sibling ids in a space-delimited manner. The proband must be listed first. For example, if the pedigree has one proband UDP_PROBAND
and 2 additional siblings UDP_SIB_1
and UDP_SIB_2
:
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2").
MATERNAL_SAMPLE_NAME="UDP_MOM"
PATERNAL_SAMPLE_NAME="UDP_DAD"
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_vg_wdl_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${SIBLING_ID_LIST[0]}_cohort_workdir"
${VG_WDL_DIR}/vg_wdl/scripts/setup_pedigree_calling_script.part_5.sh -s "${SIBLING_ID_LIST[*]}" -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${VG_WDL_DIR}
Run the cohort genotyping workflow
cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=50g --gres=lscratch:200 --time=24:00:00 ${SIBLING_ID_LIST[0]}_cohort_2nd_iter_pedigree_call.part_5.sh
CD into cohort work directory and setup input variables.
The SIBLING_ID_LIST
bash array variable needs to list the proband and sibling ids in a space-delimited manner. The proband must be listed first. For example, if the pedigree has one proband UDP_PROBAND
and 2 additional siblings UDP_SIB_1
and UDP_SIB_2
:
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2").
MATERNAL_SAMPLE_NAME="UDP_MOM"
PATERNAL_SAMPLE_NAME="UDP_DAD"
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_vg_wdl_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${SIBLING_ID_LIST[0]}_cohort_workdir"
${VG_WDL_DIR}/vg_wdl/scripts/setup_pedigree_indel_realignment_script.part_6.sh -s "${SIBLING_ID_LIST[*]}" -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${VG_WDL_DIR}
Run the cohort genotyping workflow
cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=50g --gres=lscratch:200 --time=24:00:00 ${SIBLING_ID_LIST[0]}_cohort_2nd_iter_pedigree_indel_realign.part_6.sh
MATERNAL_SAMPLE_NAME="UDP_MOM"
PATERNAL_SAMPLE_NAME="UDP_DAD"
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${SIBLING_ID_LIST[0]}_cohort_workdir"
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
OUTPUT_DIR="${SIBLING_ID_LIST[0]}_workflow_outputs"
${VG_WDL_DIR}/vg_wdl/scripts/collect_outputs.sh -s "${SIBLING_ID_LIST[*]}" -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -w ${COHORT_WORKFLOW_DIR} -o ${OUTPUT_DIR}
Delete intermediate workflow directories if they still exist
rm -fr ${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_cohort_trio_map.final_outputs
rm -fr ${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_cohort_trio_call.final_outputs
rm -fr ${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_cohort_parental_graph_construction.final_outputs
rm -fr ${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_cohort_2nd_iter_sibling_map.final_outputs
rm -fr ${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_cohort_2nd_iter_pedigree_call.final_outputs
rm -fr ${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_cohort_2nd_iter_pedigree_indel_realign.final_outputs