Climb (#45)

* explicitly allocate resources for k8s * change paths of python, add bin to docker containers * permissions * shebangs * update readme * update gh actions and image tags * change path of bin in Dockerfile to grab down a dir * revert bin path in docker * change wf to include bin * change wf to include bin * try copy * try copy * try copy * try copy * try copy * lint yaml * lint yaml * lint yaml * back to safe commit * back to safe commit * back to safe commit * back to safe commit * manifest to stop linting complaining * docker push * make resources dir a variable and accessible via docker * make resources dir a variable and accessible via docker * change wf to path resources too * resources in config * typo in config * resources -> resource * readme update * docker paths for resource dir * docker path * try without docker paths * split on first colon to deal with S3 stored data * add bin to docker path * use newer container * add catalogue pre-packed * try and deal with S3 buckets in parser scripts * docker * docker * docker install aws * turn off warning * arbitary push to trigger ci correctly * find new folders * docker push * new docker * delete extraneous import * always pull k8s * implement boto3 to interact with s3 in python scripts * docker boto3 * delete files * delete files * permissions * permissions * permissions * remove imports * docker * update script to add aws config * path to aws * try exporting variable from AWS * split afanc to allow local execution to acces AWS creds * includes * remove boto/path checking in python * docker clockwork * break up jq queries to own processes * add missing unmix * docker bump for pr * rm old dockerfiles * config missing refseq * rm src --------- Co-authored-by: annacprice <[email protected]>
Pathogen-Genomics-Cymru · Dec 6, 2023 · 0763b00 · 0763b00
1 parent 0d28771
commit 0763b00
Show file tree

Hide file tree

Showing 21 changed files with 1,463 additions and 64 deletions.
diff --git a/.github/workflows/build-push-quay.yml b/.github/workflows/build-push-quay.yml
@@ -4,8 +4,10 @@ on:
     branches:
       - v0.9.6
       - 0.9.7-dev
+      - climb
     paths:
       - '**/Dockerfile*'
+      - "bin/"
 
   workflow_dispatch:
 
@@ -40,6 +42,10 @@ jobs:
 
     steps:
       - uses: actions/checkout@v3
+
+      - name: Copy folders to docker
+        run: |
+          cp -r bin docker/bin
 
       - name: Get image name
         id: image_name

diff --git a/README.md b/README.md
@@ -128,7 +128,7 @@ process clockwork:minos\
 26. (Warn) If sample is not TB, then it is not passed to gnomonicus
 
 ## Running on CLIMB Jupyter Hub
-There is a pre-configured climb profile to run Lodestone on a CLIMB Jupyter Notebook Server. Add ```profile climb,docker``` your command invocation. The input directory can point to an S3 bucket natively (e.g. ```--input_dir s3://my-team/bucket```).
+There is a pre-configured climb profile to run Lodestone on a CLIMB Jupyter Notebook Server. Add ```profile climb``` to your command invocation. The input directory can point to an S3 bucket natively (e.g. ```--input_dir s3://my-team/bucket```). By default this will run the workflow in Docker containers and take advantage of kubernetes pods. The Kraken2, Bowtie2 and Afanc databases will by default point to the ```pluspf16```, ```hg19_1kgmaj_bt2``` and ```Mycobacteriaciae_DB_7.0``` respectively. These are mounted on a public shared volume.
 
 ## Acknowledgements ##
 For a list of direct authors of this pipeline, please see the contributors list. All of the software dependencies of this pipeline are recorded in the version.json

diff --git a/bin/create_final_json.py b/bin/create_final_json.py
@@ -1,3 +1,4 @@
+#!/usr/bin/env python3
 import json
 import os
 import sys

diff --git a/bin/identify_tophit_and_contaminants2.py b/bin/identify_tophit_and_contaminants2.py
@@ -1,9 +1,15 @@
+#!/usr/bin/env python3
+
 import json
 import os
 import sys
 import argparse
 import re
 import copy
+import boto3
+import sys
+import configparser
+import pathlib
 
 # define process requirements function
 def process_requirements(args):
@@ -15,7 +21,10 @@ def process_requirements(args):
     unmix_myco = args[5]
     myco_dir = args[6]
     prev_species_json = args[7]
-
+
+    credential_file = "~/.aws/config"
+
+    """
     # check if input files exist and not empty
     if not os.path.exists(afanc_json):
         sys.exit('ERROR: cannot find %s' %(afanc_json))
@@ -32,24 +41,45 @@ def process_requirements(args):
     if os.stat(assembly_file).st_size == 0:
         sys.exit('ERROR: %s is empty' %(assembly_file))
 
-    if not os.path.exists(myco_dir):
+    if not os.path.exists(myco_dir) and not bucket_exists(myco_dir):
         sys.exit('ERROR: cannot find %s' %(myco_dir))
 
     if (prev_species_json != 'null'):
         if not os.path.exists(prev_species_json):
             sys.exit('ERROR: cannot find %s' %(prev_species_json))
         if os.stat(prev_species_json).st_size == 0:
             sys.exit('ERROR: %s is empty' %(prev_species_json))
-
+     """
+
     species = ['abscessus', 'africanum', 'avium', 'bovis', 'chelonae', 'chimaera', 'fortuitum', 'intracellulare', 'kansasii', 'tuberculosis']
     for spec in species:
         spec_fasta_path = os.path.join(myco_dir, spec + '.fasta')
         spec_mmi_path = os.path.join(myco_dir, spec + '.mmi')
-        if not os.path.exists(spec_fasta_path):
-            sys.exit('ERROR: cannot find %s' %(spec_fasta_path))
-        if not os.path.exists(spec_mmi_path):
-            sys.exit('ERROR: cannot find %s' %(spec_mmi_path))
 
+        """
+        if myco_dir.startswith("s3://"):
+            s3_myco_dir = myco_dir.replace("s3://", "")
+            spec_fasta = s3_myco_dir.split("/", 1)[-1] + "/" + spec + ".fasta"
+            s3_myco_dir =  s3_myco_dir.split("/", 1)[0]
+
+            if not is_file_in_s3(s3_myco_dir, spec_fasta):
+                sys.exit('ERROR: cannot find %s' %(spec_fasta_path))
+        else:
+            if not os.path.exists(spec_fasta_path):
+                sys.exit('ERROR: cannot find %s' %(spec_fasta_path))
+
+        if myco_dir.startswith("s3://"):
+            s3_myco_dir = myco_dir.replace("s3://", "")
+            spec_mmi = s3_myco_dir.split("/", 1)[-1] + "/" + spec + ".mmi"
+            s3_myco_dir =  s3_myco_dir.split("/", 1)[0]
+
+            if not is_file_in_s3(s3_myco_dir, spec_mmi):
+                sys.exit('ERROR: cannot find %s' %(spec_mmi_path))
+        else:
+            if not os.path.exists(spec_fasta_path):
+                sys.exit('ERROR: cannot find %s' %(spec_mmi_path))
+        """
+
     if ((supposed_species != 'null') & (supposed_species not in species)):
         sys.exit('ERROR: if you provide a species ID, it must be one of either: abscessus|africanum|avium|bovis|chelonae|chimaera|fortuitum|intracellulare|kansasii|tuberculosis')
 
@@ -393,12 +423,6 @@ def process_reports(afanc_json_path, kraken_json_path, supposed_species, unmix_m
             ref_fa = os.path.join(myco_dir_path, identified_species + ".fasta")
             ref_dir = os.path.join(myco_dir_path, identified_species)
             ref_mmi = os.path.join(myco_dir_path, identified_species + ".mmi")
-            if not os.path.exists(ref_fa):
-                sys.exit('ERROR: cannot find %s' %(ref_fa))
-            if not os.path.exists(ref_dir):
-                sys.exit('ERROR: cannot find %s' %(ref_dir))
-            if not os.path.exists(ref_mmi):
-                sys.exit('ERROR: cannot find %s' %(ref_mmi))
 
             if 'file_paths' not in out['top_hit']: out['top_hit']['file_paths'] = {}
             out['top_hit']['file_paths']['ref_fa'] = ref_fa
@@ -453,7 +477,7 @@ def process_reports(afanc_json_path, kraken_json_path, supposed_species, unmix_m
     description += "By defining [species] you will automatically select this to be the genome against which reads will be aligned using Clockwork\n"
     description += "[unmix myco] is either 'yes' or 'no', given in response to the question: do you want to disambiguate mixed-mycobacterial samples by read alignment?\n"
     description += "If 'no', any contaminating mycobacteria will be recorded but NOT acted upon\n"
-    usage = "python identify_tophit_and_contaminants2.py [path to afanc JSON] [path to Kraken JSON] [path to RefSeq assembly summary file] [species] [unmix myco] [directory containing mycobacterial reference genomes]\n"
+    usage = "python identify_tophit_and_contaminants2.py [path to afanc JSON] [path to Kraken JSON] [path to RefSeq assembly summary file] [species] [unmix myco] [directory containing mycobacterial reference genomes] [aws_config]\n"
     usage += "E.G.:\tpython identify_tophit_and_contaminants2.py afanc_report.json afanc_report.json assembly_summary_refseq.txt 1 tuberculosis yes myco_dir\n\n\n"
 
     parser = argparse.ArgumentParser(description=description, usage=usage, formatter_class=argparse.RawTextHelpFormatter)

diff --git a/bin/parse_kraken_report2.py b/bin/parse_kraken_report2.py
@@ -1,3 +1,5 @@
+#!/usr/bin/env python3
+
 import json
 import os
 import sys

diff --git a/bin/reformat_afanc_json.py b/bin/reformat_afanc_json.py
@@ -1,3 +1,5 @@
+#!/usr/bin/env python3
+
 """ Reformats the Afanc report json for consumption by identify_tophit_and_contaminants
 """
 

diff --git a/bin/run-vcfmix.py b/bin/run-vcfmix.py
diff --git a/bin/software-json.py b/bin/software-json.py
@@ -62,7 +62,7 @@ def go(singpath, configpath):
                     database.append(line)
 
     database = [item.replace('=', ':') for item in database]
-    database_dict = dict(item.split(':') for item in database)
+    database_dict = dict(item.split(':', 1) for item in database)
     database_dict = {"databases" : database_dict}
 
     all_software_dict.update(database_dict)

diff --git a/docker/Dockerfile.clockwork-0.9.7 → docker/Dockerfile.clockwork-0.9.8 b/docker/Dockerfile.clockwork-0.9.7 → docker/Dockerfile.clockwork-0.9.8
@@ -1,6 +1,5 @@
 FROM debian:buster
 
-
 LABEL maintainer="[email protected]" \
 about.summary="container for the clockwork workflow"
 
@@ -22,6 +21,10 @@ clockwork_version=2364dec4cbf25c844575e19e8fe0a319d10721b5
 ENV PACKAGES="procps curl git build-essential wget zlib1g-dev pkg-config jq r-base-core rsync autoconf libncurses-dev libbz2-dev liblzma-dev libcurl4-openssl-dev cmake tabix libvcflib-tools libssl-dev software-properties-common perl locales locales-all" \
 PYTHON="python2.7 python-dev"
 
+COPY bin/ /opt/bin/
+ENV PATH=/opt/bin:$PATH
+
+
 RUN apt-get update \
 && apt-get install -y $PACKAGES $PYTHON \
 && curl -fsSL https://www.python.org/ftp/python/${python_version}/Python-${python_version}.tgz | tar -xz \
@@ -32,7 +35,7 @@ RUN apt-get update \
 && ln -s /usr/local/bin/python3.6 /usr/local/bin/python3 \
 && ln -s /usr/local/bin/pip3.6 /usr/local/bin/pip3 \
 && pip3 install --upgrade pip \
-&& pip3 install 'cluster_vcf_records==0.13.1' pysam setuptools \
+&& pip3 install 'cluster_vcf_records==0.13.1' pysam setuptools  awscli \
 && wget -qO - https://adoptopenjdk.jfrog.io/adoptopenjdk/api/gpg/key/public | apt-key add - \
 && add-apt-repository --yes https://adoptopenjdk.jfrog.io/adoptopenjdk/deb/ \
 && apt-get update && apt-get install -y adoptopenjdk-8-hotspot

diff --git a/docker/Dockerfile.preprocessing-0.9.7 → docker/Dockerfile.preprocessing-0.9.8 b/docker/Dockerfile.preprocessing-0.9.7 → docker/Dockerfile.preprocessing-0.9.8
@@ -1,6 +1,5 @@
 FROM ubuntu:focal
 
-
 LABEL maintainer="[email protected]" \
 about.summary="container for the preprocessing workflow"
 
@@ -21,13 +20,16 @@ fastani_version=1.33
 
 ENV PACKAGES="procps curl git wget build-essential zlib1g-dev libncurses-dev libz-dev libbz2-dev liblzma-dev libcurl4-openssl-dev libgsl-dev rsync unzip ncbi-blast+ pigz jq libtbb-dev openjdk-11-jre-headless autoconf r-base-core locales locales-all" \
 PYTHON="python3 python3-pip python3-dev" \
-PYTHON_PACKAGES="biopython"
+PYTHON_PACKAGES="biopython awscli boto3"
 
 ENV PATH=${PATH}:/usr/local/bin/mccortex/bin:/usr/local/bin/bwa-${bwa_version}:/opt/edirect \
 LD_LIBRARY_PATH=/usr/local/lib
 
 RUN export DEBIAN_FRONTEND="noninteractive"
 
+COPY bin/ /opt/bin/
+ENV PATH=/opt/bin:$PATH
+
 RUN apt-get update \
 && DEBIAN_FRONTEND="noninteractive" apt-get install -y $PACKAGES $PYTHON \
 && pip3 install --upgrade pip \

diff --git a/docker/Dockerfile.vcfpredict-0.9.7 → docker/Dockerfile.vcfpredict-0.9.8 b/docker/Dockerfile.vcfpredict-0.9.7 → docker/Dockerfile.vcfpredict-0.9.8
@@ -1,6 +1,5 @@
 FROM ubuntu:20.04
 
-
 LABEL maintainer="[email protected]" \
 about.summary="container for the vcf predict workflow"
 
@@ -13,6 +12,10 @@ piezo_version=0.3 \
 gnomonicus_version=1.1.2 \
 tuberculosis_amr_catalogues=12d38733ad2e238729a3de9f725081e1d4872968
 
+COPY bin/ /opt/bin/
+ENV PATH=/opt/bin:$PATH
+
+
 RUN apt-get update \
 && DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get -y install tzdata \
 && apt-get install -y $PACKAGES $PYTHON \
@@ -21,6 +24,7 @@ RUN apt-get update \
 && cd VCFMIX \
 && git checkout ${vcfmix_version} \
 && pip3 install recursive_diff \
+&& pip3 install awscli \
 && pip3 install . \
 && cp -r data /usr/local/lib/python3.8/dist-packages \
 && cd ..

diff --git a/main.nf b/main.nf
@@ -183,7 +183,7 @@ workflow {
 
       input_files_vjson = input_files.combine(getversion.out.getversion_json)
 
-      preprocessing(input_files_vjson, krakenDB, bowtie_dir, params.afanc_myco_db)
+      preprocessing(input_files_vjson, krakenDB, bowtie_dir, params.afanc_myco_db, params.resource_dir, params.refseq)
 
       // CLOCKWORK SUB-WORKFLOW
 
@@ -198,8 +198,9 @@ workflow {
 
       mpileup_vcf = clockwork.out.mpileup_vcf
       minos_vcf = clockwork.out.minos_vcf
+      genbank = channel.fromPath(params.gnomonicus_genbank)
 
-      vcfpredict(mpileup_vcf, minos_vcf)
+      vcfpredict(mpileup_vcf, minos_vcf, genbank)
 
 }
 

diff --git a/modules/clockworkModules.nf b/modules/clockworkModules.nf
@@ -1,5 +1,31 @@
 // modules for the clockwork workflow
 
+process getRefFromJSON {
+    tag { sample_name }
+    label 'clockwork'
+    label 'low_memory'
+    label 'low_cpu'
+
+    input:
+    path(species_json)
+    val(do_we_align)
+    val(sample_name)
+
+    when:
+    do_we_align =~ /NOW\_ALIGN\_TO\_REF\_${sample_name}/
+
+    output:
+    stdout
+
+    script:
+    """
+    ref_string=\$(jq -r '.top_hit.file_paths.ref_fa' ${species_json})
+    echo "\$ref_string"
+    """
+
+
+}
+
 process alignToRef {
     /**
     * @QCcheckpoint fail if insufficient number and/or quality of read alignments to the reference genome
@@ -15,6 +41,7 @@ process alignToRef {
 
     input:
     tuple val(sample_name), path(fq1), path(fq2), path(software_json), path(species_json), val(doWeAlign)
+    path(reference_path)
 
     when:
     doWeAlign =~ /NOW\_ALIGN\_TO\_REF\_${sample_name}/
@@ -35,9 +62,8 @@ process alignToRef {
     error_log = "${sample_name}_err.json"
 
     """
-    ref_fa=\$(jq -r '.top_hit.file_paths.ref_fa' ${species_json})
-
-    cp \${ref_fa} ${sample_name}.fa
+    echo $reference_path
+    cp ${reference_path} ${sample_name}.fa
 
     minimap2 -ax sr ${sample_name}.fa -t ${task.cpus} $fq1 $fq2 | samtools fixmate -m - - | samtools sort -T tmp - | samtools markdup --reference ${sample_name}.fa - minimap.bam
 
@@ -46,8 +72,8 @@ process alignToRef {
     samtools index ${bam} ${bai}
     samtools stats ${bam} > ${stats}
 
-    python3 ${baseDir}/bin/parse_samtools_stats.py ${bam} ${stats} > ${stats_json}
-    python3 ${baseDir}/bin/create_final_json.py ${stats_json} ${species_json}
+    parse_samtools_stats.py ${bam} ${stats} > ${stats_json}
+    create_final_json.py ${stats_json} ${species_json}
 
     cp ${sample_name}_report.json ${sample_name}_report_previous.json
 
@@ -114,6 +140,30 @@ process callVarsMpileup {
     """
 }
 
+process getRefCortex {
+    tag { sample_name }
+    label 'clockwork'
+    label 'low_memory'
+    label 'low_cpu'
+
+    input:
+    tuple val(sample_name), path(report_json), path(bam), path(ref), val(doWeVarCall)
+
+    when:
+    doWeVarCall =~ /NOW\_VARCALL\_${sample_name}/
+
+    output:
+    stdout
+
+    script:
+    """
+    ref_dir=\$(jq -r '.top_hit.file_paths.clockwork_ref_dir' ${report_json})
+    echo "\$ref_dir"
+    """
+
+
+}
+
 process callVarsCortex {
     /**
     * @QCcheckpoint none
@@ -128,6 +178,7 @@ process callVarsCortex {
 
     input:
     tuple val(sample_name), path(report_json), path(bam), path(ref), val(doWeVarCall)
+    path(ref_dir)
 
     when:
     doWeVarCall =~ /NOW\_VARCALL\_${sample_name}/
@@ -139,9 +190,7 @@ process callVarsCortex {
     cortex_vcf = "${sample_name}.cortex.vcf"
 
     """
-    ref_dir=\$(jq -r '.top_hit.file_paths.clockwork_ref_dir' ${report_json})
-
-    cp -r \${ref_dir}/* .
+    cp -r ${ref_dir}/* .
 
     clockwork cortex . ${bam} cortex ${sample_name}
     cp cortex/cortex.out/vcfs/cortex_wk_flow_I_RefCC_FINALcombined_BC_calls_at_all_k.raw.vcf ${cortex_vcf}
@@ -163,6 +212,7 @@ process minos {
     tag { sample_name }
     label 'clockwork'
     label 'medium_memory'
+    label 'normal_cpu'
 
     publishDir "${params.output_dir}/$sample_name/output_vcfs", mode: 'copy', pattern: '*.vcf'
     publishDir "${params.output_dir}/$sample_name", mode: 'copy', overwrite: 'true', pattern: '*{_err.json,_report.json}'

diff --git a/modules/getversionModules.nf b/modules/getversionModules.nf
@@ -12,7 +12,7 @@ process getversion {
     script:
 
     """
-    python3 ${baseDir}/bin/software-json.py ${params.sing_dir} ${params.config_dir}
+    software-json.py ${params.sing_dir} ${params.config_dir}
     """
 
     stub:
-Original file line number
+Diff line change
@@ Expand Up / @@ -12,7 +12,7 @@ process getversion { @@
         script:
         """
-        python3 ${baseDir}/bin/software-json.py ${params.sing_dir} ${params.config_dir}
+        software-json.py ${params.sing_dir} ${params.config_dir}
         """
         stub:
@@ Expand Down @@