Skip to content

Commit

Permalink
Climb (#45)
Browse files Browse the repository at this point in the history
* explicitly allocate resources for k8s

* change paths of python, add bin to docker containers

* permissions

* shebangs

* update readme

* update gh actions and image tags

* change path of bin in Dockerfile to grab down a dir

* revert bin path in docker

* change wf to include bin

* change wf to include bin

* try copy

* try copy

* try copy

* try copy

* try copy

* lint yaml

* lint yaml

* lint yaml

* back to safe commit

* back to safe commit

* back to safe commit

* back to safe commit

* manifest to stop linting complaining

* docker push

* make resources dir a variable and accessible via docker

* make resources dir a variable and accessible via docker

* change wf to path resources too

* resources in config

* typo in config

* resources -> resource

* readme update

* docker paths for resource dir

* docker path

* try without docker paths

* split on first colon to deal with S3 stored data

* add bin to docker path

* use newer container

* add catalogue pre-packed

* try and deal with S3 buckets in parser scripts

* docker

* docker

* docker install aws

* turn off warning

* arbitary push to trigger ci correctly

* find new folders

* docker push

* new docker

* delete extraneous import

* always pull k8s

* implement boto3 to interact with s3 in python scripts

* docker boto3

* delete files

* delete files

* permissions

* permissions

* permissions

* remove imports

* docker

* update script to add aws config

* path to aws

* try exporting variable from AWS

* split afanc to allow local execution to acces AWS creds

* includes

* remove boto/path checking in python

* docker clockwork

* break up jq queries to own processes

* add missing unmix

* docker bump for pr

* rm old dockerfiles

* config missing refseq

* rm src

---------

Co-authored-by: annacprice <[email protected]>
  • Loading branch information
WhalleyT and annacprice authored Dec 6, 2023
1 parent 0d28771 commit 0763b00
Show file tree
Hide file tree
Showing 21 changed files with 1,463 additions and 64 deletions.
6 changes: 6 additions & 0 deletions .github/workflows/build-push-quay.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,10 @@ on:
branches:
- v0.9.6
- 0.9.7-dev
- climb
paths:
- '**/Dockerfile*'
- "bin/"

workflow_dispatch:

Expand Down Expand Up @@ -40,6 +42,10 @@ jobs:

steps:
- uses: actions/checkout@v3

- name: Copy folders to docker
run: |
cp -r bin docker/bin
- name: Get image name
id: image_name
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ process clockwork:minos\
26. (Warn) If sample is not TB, then it is not passed to gnomonicus

## Running on CLIMB Jupyter Hub
There is a pre-configured climb profile to run Lodestone on a CLIMB Jupyter Notebook Server. Add ```profile climb,docker``` your command invocation. The input directory can point to an S3 bucket natively (e.g. ```--input_dir s3://my-team/bucket```).
There is a pre-configured climb profile to run Lodestone on a CLIMB Jupyter Notebook Server. Add ```profile climb``` to your command invocation. The input directory can point to an S3 bucket natively (e.g. ```--input_dir s3://my-team/bucket```). By default this will run the workflow in Docker containers and take advantage of kubernetes pods. The Kraken2, Bowtie2 and Afanc databases will by default point to the ```pluspf16```, ```hg19_1kgmaj_bt2``` and ```Mycobacteriaciae_DB_7.0``` respectively. These are mounted on a public shared volume.

## Acknowledgements ##
For a list of direct authors of this pipeline, please see the contributors list. All of the software dependencies of this pipeline are recorded in the version.json
Expand Down
1 change: 1 addition & 0 deletions bin/create_final_json.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
#!/usr/bin/env python3
import json
import os
import sys
Expand Down
52 changes: 38 additions & 14 deletions bin/identify_tophit_and_contaminants2.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,9 +1,15 @@
#!/usr/bin/env python3

import json
import os
import sys
import argparse
import re
import copy
import boto3
import sys
import configparser
import pathlib

# define process requirements function
def process_requirements(args):
Expand All @@ -15,7 +21,10 @@ def process_requirements(args):
unmix_myco = args[5]
myco_dir = args[6]
prev_species_json = args[7]


credential_file = "~/.aws/config"

"""
# check if input files exist and not empty
if not os.path.exists(afanc_json):
sys.exit('ERROR: cannot find %s' %(afanc_json))
Expand All @@ -32,24 +41,45 @@ def process_requirements(args):
if os.stat(assembly_file).st_size == 0:
sys.exit('ERROR: %s is empty' %(assembly_file))
if not os.path.exists(myco_dir):
if not os.path.exists(myco_dir) and not bucket_exists(myco_dir):
sys.exit('ERROR: cannot find %s' %(myco_dir))
if (prev_species_json != 'null'):
if not os.path.exists(prev_species_json):
sys.exit('ERROR: cannot find %s' %(prev_species_json))
if os.stat(prev_species_json).st_size == 0:
sys.exit('ERROR: %s is empty' %(prev_species_json))

"""

species = ['abscessus', 'africanum', 'avium', 'bovis', 'chelonae', 'chimaera', 'fortuitum', 'intracellulare', 'kansasii', 'tuberculosis']
for spec in species:
spec_fasta_path = os.path.join(myco_dir, spec + '.fasta')
spec_mmi_path = os.path.join(myco_dir, spec + '.mmi')
if not os.path.exists(spec_fasta_path):
sys.exit('ERROR: cannot find %s' %(spec_fasta_path))
if not os.path.exists(spec_mmi_path):
sys.exit('ERROR: cannot find %s' %(spec_mmi_path))

"""
if myco_dir.startswith("s3://"):
s3_myco_dir = myco_dir.replace("s3://", "")
spec_fasta = s3_myco_dir.split("/", 1)[-1] + "/" + spec + ".fasta"
s3_myco_dir = s3_myco_dir.split("/", 1)[0]
if not is_file_in_s3(s3_myco_dir, spec_fasta):
sys.exit('ERROR: cannot find %s' %(spec_fasta_path))
else:
if not os.path.exists(spec_fasta_path):
sys.exit('ERROR: cannot find %s' %(spec_fasta_path))
if myco_dir.startswith("s3://"):
s3_myco_dir = myco_dir.replace("s3://", "")
spec_mmi = s3_myco_dir.split("/", 1)[-1] + "/" + spec + ".mmi"
s3_myco_dir = s3_myco_dir.split("/", 1)[0]
if not is_file_in_s3(s3_myco_dir, spec_mmi):
sys.exit('ERROR: cannot find %s' %(spec_mmi_path))
else:
if not os.path.exists(spec_fasta_path):
sys.exit('ERROR: cannot find %s' %(spec_mmi_path))
"""

if ((supposed_species != 'null') & (supposed_species not in species)):
sys.exit('ERROR: if you provide a species ID, it must be one of either: abscessus|africanum|avium|bovis|chelonae|chimaera|fortuitum|intracellulare|kansasii|tuberculosis')

Expand Down Expand Up @@ -393,12 +423,6 @@ def process_reports(afanc_json_path, kraken_json_path, supposed_species, unmix_m
ref_fa = os.path.join(myco_dir_path, identified_species + ".fasta")
ref_dir = os.path.join(myco_dir_path, identified_species)
ref_mmi = os.path.join(myco_dir_path, identified_species + ".mmi")
if not os.path.exists(ref_fa):
sys.exit('ERROR: cannot find %s' %(ref_fa))
if not os.path.exists(ref_dir):
sys.exit('ERROR: cannot find %s' %(ref_dir))
if not os.path.exists(ref_mmi):
sys.exit('ERROR: cannot find %s' %(ref_mmi))

if 'file_paths' not in out['top_hit']: out['top_hit']['file_paths'] = {}
out['top_hit']['file_paths']['ref_fa'] = ref_fa
Expand Down Expand Up @@ -453,7 +477,7 @@ def process_reports(afanc_json_path, kraken_json_path, supposed_species, unmix_m
description += "By defining [species] you will automatically select this to be the genome against which reads will be aligned using Clockwork\n"
description += "[unmix myco] is either 'yes' or 'no', given in response to the question: do you want to disambiguate mixed-mycobacterial samples by read alignment?\n"
description += "If 'no', any contaminating mycobacteria will be recorded but NOT acted upon\n"
usage = "python identify_tophit_and_contaminants2.py [path to afanc JSON] [path to Kraken JSON] [path to RefSeq assembly summary file] [species] [unmix myco] [directory containing mycobacterial reference genomes]\n"
usage = "python identify_tophit_and_contaminants2.py [path to afanc JSON] [path to Kraken JSON] [path to RefSeq assembly summary file] [species] [unmix myco] [directory containing mycobacterial reference genomes] [aws_config]\n"
usage += "E.G.:\tpython identify_tophit_and_contaminants2.py afanc_report.json afanc_report.json assembly_summary_refseq.txt 1 tuberculosis yes myco_dir\n\n\n"

parser = argparse.ArgumentParser(description=description, usage=usage, formatter_class=argparse.RawTextHelpFormatter)
Expand Down
2 changes: 2 additions & 0 deletions bin/parse_kraken_report2.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
#!/usr/bin/env python3

import json
import os
import sys
Expand Down
2 changes: 2 additions & 0 deletions bin/reformat_afanc_json.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
#!/usr/bin/env python3

""" Reformats the Afanc report json for consumption by identify_tophit_and_contaminants
"""

Expand Down
Empty file modified bin/run-vcfmix.py
100644 → 100755
Empty file.
2 changes: 1 addition & 1 deletion bin/software-json.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ def go(singpath, configpath):
database.append(line)

database = [item.replace('=', ':') for item in database]
database_dict = dict(item.split(':') for item in database)
database_dict = dict(item.split(':', 1) for item in database)
database_dict = {"databases" : database_dict}

all_software_dict.update(database_dict)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
FROM debian:buster


LABEL maintainer="[email protected]" \
about.summary="container for the clockwork workflow"

Expand All @@ -22,6 +21,10 @@ clockwork_version=2364dec4cbf25c844575e19e8fe0a319d10721b5
ENV PACKAGES="procps curl git build-essential wget zlib1g-dev pkg-config jq r-base-core rsync autoconf libncurses-dev libbz2-dev liblzma-dev libcurl4-openssl-dev cmake tabix libvcflib-tools libssl-dev software-properties-common perl locales locales-all" \
PYTHON="python2.7 python-dev"

COPY bin/ /opt/bin/
ENV PATH=/opt/bin:$PATH


RUN apt-get update \
&& apt-get install -y $PACKAGES $PYTHON \
&& curl -fsSL https://www.python.org/ftp/python/${python_version}/Python-${python_version}.tgz | tar -xz \
Expand All @@ -32,7 +35,7 @@ RUN apt-get update \
&& ln -s /usr/local/bin/python3.6 /usr/local/bin/python3 \
&& ln -s /usr/local/bin/pip3.6 /usr/local/bin/pip3 \
&& pip3 install --upgrade pip \
&& pip3 install 'cluster_vcf_records==0.13.1' pysam setuptools \
&& pip3 install 'cluster_vcf_records==0.13.1' pysam setuptools awscli \
&& wget -qO - https://adoptopenjdk.jfrog.io/adoptopenjdk/api/gpg/key/public | apt-key add - \
&& add-apt-repository --yes https://adoptopenjdk.jfrog.io/adoptopenjdk/deb/ \
&& apt-get update && apt-get install -y adoptopenjdk-8-hotspot
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
FROM ubuntu:focal


LABEL maintainer="[email protected]" \
about.summary="container for the preprocessing workflow"

Expand All @@ -21,13 +20,16 @@ fastani_version=1.33

ENV PACKAGES="procps curl git wget build-essential zlib1g-dev libncurses-dev libz-dev libbz2-dev liblzma-dev libcurl4-openssl-dev libgsl-dev rsync unzip ncbi-blast+ pigz jq libtbb-dev openjdk-11-jre-headless autoconf r-base-core locales locales-all" \
PYTHON="python3 python3-pip python3-dev" \
PYTHON_PACKAGES="biopython"
PYTHON_PACKAGES="biopython awscli boto3"

ENV PATH=${PATH}:/usr/local/bin/mccortex/bin:/usr/local/bin/bwa-${bwa_version}:/opt/edirect \
LD_LIBRARY_PATH=/usr/local/lib

RUN export DEBIAN_FRONTEND="noninteractive"

COPY bin/ /opt/bin/
ENV PATH=/opt/bin:$PATH

RUN apt-get update \
&& DEBIAN_FRONTEND="noninteractive" apt-get install -y $PACKAGES $PYTHON \
&& pip3 install --upgrade pip \
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
FROM ubuntu:20.04


LABEL maintainer="[email protected]" \
about.summary="container for the vcf predict workflow"

Expand All @@ -13,6 +12,10 @@ piezo_version=0.3 \
gnomonicus_version=1.1.2 \
tuberculosis_amr_catalogues=12d38733ad2e238729a3de9f725081e1d4872968

COPY bin/ /opt/bin/
ENV PATH=/opt/bin:$PATH


RUN apt-get update \
&& DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get -y install tzdata \
&& apt-get install -y $PACKAGES $PYTHON \
Expand All @@ -21,6 +24,7 @@ RUN apt-get update \
&& cd VCFMIX \
&& git checkout ${vcfmix_version} \
&& pip3 install recursive_diff \
&& pip3 install awscli \
&& pip3 install . \
&& cp -r data /usr/local/lib/python3.8/dist-packages \
&& cd ..
Expand Down
5 changes: 3 additions & 2 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,7 @@ workflow {

input_files_vjson = input_files.combine(getversion.out.getversion_json)

preprocessing(input_files_vjson, krakenDB, bowtie_dir, params.afanc_myco_db)
preprocessing(input_files_vjson, krakenDB, bowtie_dir, params.afanc_myco_db, params.resource_dir, params.refseq)

// CLOCKWORK SUB-WORKFLOW

Expand All @@ -198,8 +198,9 @@ workflow {

mpileup_vcf = clockwork.out.mpileup_vcf
minos_vcf = clockwork.out.minos_vcf
genbank = channel.fromPath(params.gnomonicus_genbank)

vcfpredict(mpileup_vcf, minos_vcf)
vcfpredict(mpileup_vcf, minos_vcf, genbank)

}

Expand Down
66 changes: 58 additions & 8 deletions modules/clockworkModules.nf
Original file line number Diff line number Diff line change
@@ -1,5 +1,31 @@
// modules for the clockwork workflow

process getRefFromJSON {
tag { sample_name }
label 'clockwork'
label 'low_memory'
label 'low_cpu'

input:
path(species_json)
val(do_we_align)
val(sample_name)

when:
do_we_align =~ /NOW\_ALIGN\_TO\_REF\_${sample_name}/

output:
stdout

script:
"""
ref_string=\$(jq -r '.top_hit.file_paths.ref_fa' ${species_json})
echo "\$ref_string"
"""


}

process alignToRef {
/**
* @QCcheckpoint fail if insufficient number and/or quality of read alignments to the reference genome
Expand All @@ -15,6 +41,7 @@ process alignToRef {

input:
tuple val(sample_name), path(fq1), path(fq2), path(software_json), path(species_json), val(doWeAlign)
path(reference_path)

when:
doWeAlign =~ /NOW\_ALIGN\_TO\_REF\_${sample_name}/
Expand All @@ -35,9 +62,8 @@ process alignToRef {
error_log = "${sample_name}_err.json"

"""
ref_fa=\$(jq -r '.top_hit.file_paths.ref_fa' ${species_json})
cp \${ref_fa} ${sample_name}.fa
echo $reference_path
cp ${reference_path} ${sample_name}.fa
minimap2 -ax sr ${sample_name}.fa -t ${task.cpus} $fq1 $fq2 | samtools fixmate -m - - | samtools sort -T tmp - | samtools markdup --reference ${sample_name}.fa - minimap.bam
Expand All @@ -46,8 +72,8 @@ process alignToRef {
samtools index ${bam} ${bai}
samtools stats ${bam} > ${stats}
python3 ${baseDir}/bin/parse_samtools_stats.py ${bam} ${stats} > ${stats_json}
python3 ${baseDir}/bin/create_final_json.py ${stats_json} ${species_json}
parse_samtools_stats.py ${bam} ${stats} > ${stats_json}
create_final_json.py ${stats_json} ${species_json}
cp ${sample_name}_report.json ${sample_name}_report_previous.json
Expand Down Expand Up @@ -114,6 +140,30 @@ process callVarsMpileup {
"""
}

process getRefCortex {
tag { sample_name }
label 'clockwork'
label 'low_memory'
label 'low_cpu'

input:
tuple val(sample_name), path(report_json), path(bam), path(ref), val(doWeVarCall)

when:
doWeVarCall =~ /NOW\_VARCALL\_${sample_name}/

output:
stdout

script:
"""
ref_dir=\$(jq -r '.top_hit.file_paths.clockwork_ref_dir' ${report_json})
echo "\$ref_dir"
"""


}

process callVarsCortex {
/**
* @QCcheckpoint none
Expand All @@ -128,6 +178,7 @@ process callVarsCortex {

input:
tuple val(sample_name), path(report_json), path(bam), path(ref), val(doWeVarCall)
path(ref_dir)

when:
doWeVarCall =~ /NOW\_VARCALL\_${sample_name}/
Expand All @@ -139,9 +190,7 @@ process callVarsCortex {
cortex_vcf = "${sample_name}.cortex.vcf"

"""
ref_dir=\$(jq -r '.top_hit.file_paths.clockwork_ref_dir' ${report_json})
cp -r \${ref_dir}/* .
cp -r ${ref_dir}/* .
clockwork cortex . ${bam} cortex ${sample_name}
cp cortex/cortex.out/vcfs/cortex_wk_flow_I_RefCC_FINALcombined_BC_calls_at_all_k.raw.vcf ${cortex_vcf}
Expand All @@ -163,6 +212,7 @@ process minos {
tag { sample_name }
label 'clockwork'
label 'medium_memory'
label 'normal_cpu'

publishDir "${params.output_dir}/$sample_name/output_vcfs", mode: 'copy', pattern: '*.vcf'
publishDir "${params.output_dir}/$sample_name", mode: 'copy', overwrite: 'true', pattern: '*{_err.json,_report.json}'
Expand Down
2 changes: 1 addition & 1 deletion modules/getversionModules.nf
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ process getversion {
script:

"""
python3 ${baseDir}/bin/software-json.py ${params.sing_dir} ${params.config_dir}
software-json.py ${params.sing_dir} ${params.config_dir}
"""

stub:
Expand Down
Loading

0 comments on commit 0763b00

Please sign in to comment.