Merge pull request #134 from c-BIG/develop

Release/0.13.0
c-BIG · Sep 25, 2024 · e962e15 · e962e15
2 parents 49940e6 + 12f7847
commit e962e15
Show file tree

Hide file tree

Showing 80 changed files with 6,006 additions and 3,951 deletions.
diff --git a/.gitignore b/.gitignore
@@ -18,6 +18,10 @@ tests/*/output_certified/results/mosdepth
 tests/*/output_certified/results/samtools
 tests/*/output_certified/results/picard
 tests/*/output_certified/results/verifybamid2
+tests/*/output_certified/picard_collect_multiple_metrics
+tests/*/output_certified/picard_collect_wgs_metrics
+tests/*/output_certified/bcftools
+
 
 # MacOS
 **/.DS*
diff --git a/README.rst b/README.rst
@@ -13,10 +13,10 @@ Requirements
 
 * `Install Nextflow`_ 
 
-v23.08.1-edge or higher ::
+v23.10.4 or higher ::
   
   # Download the executable package
-  wget -qO- https://github.com/nextflow-io/nextflow/releases/download/v23.08.1-edge/nextflow-23.08.1-edge-all | bash
+  wget -qO- https://github.com/nextflow-io/nextflow/releases/download/v23.10.4/nextflow-23.10.4-all | bash
   # Make the binary executable on your system
   chmod -x nextflow
   # Optionally, move the nextflow file to a directory accessible by your $PATH variable
@@ -45,7 +45,7 @@ Run workflow on 45Mbp region around AKT1 gene, 30X, of sample NA12878 from the 1
 
 This creates `output` directory with the results that can be compared to the content of `output_certified` ::
 
-  diff output_certified/results/metrics/NA12878-chr14-AKT1.metrics.json output/results/multiqc/NA12878-chr14-AKT1.metrics.json
+  diff output_certified/results/metrics/NA12878-chr14-AKT1.metrics.json output/results/metrics/NA12878-chr14-AKT1.metrics.json
 
 Please refer to the workflow help for more information on its usage and access to additional options: ::
 
@@ -59,7 +59,7 @@ Resources
 
 The workflow requires the following resources given in the ``conf/resources.config``
 
-- *N-regions reference file*, used as an input for computing "non-gap regions autosome" coverages (mosdepth).
+- *N-regions reference file*, used as an input for computing "non-gap regions autosome" coverages (picard, bcftools).
 
   - Gaps in the GRCh38 (hg38) genome assembly, defined in the AGP file delivered with the sequence, are being closed during the finishing process on the human genome. GRCh38 (hg38) genome assembly still contains the following principal types of gaps:
 
@@ -75,6 +75,8 @@ The workflow requires the following resources given in the ``conf/resources.conf
 
 - *FASTA file index*. This file can be downloaded from ``s3://1000genomes-dragen-3.7.6/references/fasta/hg38.fa.fai`` and not required to be specified in the config. The workflow will look fasta index file in a folder the fasta file is present.
 
+- *Verify Bam ID 2 reference panel files*, 100K sites from 1000 Genome Project phase 3 build 38, downloaded from ``https://github.com/Griffan/VerifyBamID/tree/master/resource/``.
+
 Inputs
 ------
 
@@ -100,14 +102,15 @@ Upon completion, the workflow will create the following files in the ``outdir``
           timeline.html
           trace.txt
       results/          # final metrics.json and intermediate outputs
-          mosdepth/
-          multiqc/
+          bcftools/
+          metrics/
             <sample_id>.metrics.json
           picard_collect_multiple_metrics/
+          picard_collect_wgs_metrics/
           samtools/
           verifybamid2/
 
-If ``keep_workdir`` has been specified, the contents of the Nextflow work directory (``work-dir``) will also be preserved.
+If ``cleanup = true`` in the nextflow.config is commented out, the contents of the Nextflow work directory (``work-dir``) will also be preserved.
 
 Docker image
 ------------
@@ -132,15 +135,11 @@ In a nutshell, this workflow generates QC metrics from single-sample WGS results
 
 **Metrics calculation**
 
-The current workflow combines widely-used third-party tools (samtools, picard, mosdepth) and custom scripts. Full details on which processes are run/when can be found in the actual workflow definition (``main.nf``). We also provide an example dag for a more visual representation (``tests/NA12878_1000genomes-dragen-3.7.6/dag.pdf``).
+The current workflow combines widely-used third-party tools (samtools, picard, bcftools, verifybamid2) and custom scripts. Full details on which processes are run/when can be found in the actual workflow definition (``main.nf``). We also provide an example dag for a more visual representation (``tests/NA12878_1000genomes-dragen-3.7.6/dag.pdf``).
 
 **Metrics parsing**
 
-Next, output files from each individual tool are parsed and combined into a single json file. This is done by calling ``bin/multiqc_plugins/multiqc_npm/``, a MultiQC plugin that extends the base tool to support additional files.
-
-**Metrics reporting**
-
-Finally, the contents of the MultiQC json are formatted into a final metrics report, also in json format. The reporting logic lives in the ``bin/compile_metrics.py`` script, and whilst its contents are simple, it enables automatic documentation of metric definitions from code comments (see the **Metric definitions** section).
+Next, output files from each individual tool are parsed and combined into a single json file.
 
 Metric definitions
 ==================

diff --git a/bin/compile_aln_variants_metrics.py b/bin/compile_aln_variants_metrics.py
@@ -0,0 +1,64 @@
+#!/usr/bin/env python3
+
+import argparse
+import json
+import pprint as pp
+import subprocess
+import numpy as np
+import sys
+import os
+from pathlib import Path
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--sample_id", dest="sample_id", required=True,
+                        default=None,
+                        help="Sample ID")
+    parser.add_argument("--input_aln_metrics", dest="input_aln_metrics", required=True,
+                        default=None,
+                        help="Path to input aln metrics list")
+    parser.add_argument("--input_variants_metrics", dest="input_variants_metrics", required=True,
+                        default=None,
+                        help="Path to input variants metrics list")
+    parser.add_argument("--output_json", dest="output_json", required=False,
+                        default="./metrics.json",
+                        help="Path to output file for variant metrics. Default: ./variant_counts.json")
+    parser.add_argument("--scratch_dir", dest="scratch_dir", required=False,
+                        default="./",
+                        help="Path to scratch dir. Default: ./")
+    args = parser.parse_args()
+
+    # create scratch dir if it doesn't exist
+    Path(args.scratch_dir).mkdir(parents=True, exist_ok=True)
+
+    return args
+
+
+def data1(input_aln_metrics):
+    aln_input = {}
+    with open(input_aln_metrics, 'r') as f1:
+        aln_input = json.load(f1)
+    return aln_input
+
+def data2(input_variants_metrics):
+    variants_input = {}
+    with open(input_variants_metrics, 'r') as f2:
+        variants_input = json.load(f2)
+    return variants_input
+
+
+def save_output(data1, data2, outfile):
+    with open(outfile, "w") as f:
+        data1['wgs_qc_metrics']['variant_metrics'].update(data2['wgs_qc_metrics']['variant_metrics'])
+        #data1['wgs_qc_metrics']['variant_metrics'] = data2['wgs_qc_metrics']['variant_metrics']
+        print(data1)
+        json.dump(data1, f, sort_keys=True, indent=4)
+        #json.dump({"biosample": data1["biosample"], "wgs_qc_metrics": {**data1["wgs_qc_metrics"], **data2["wgs_qc_metrics"]}}, f, sort_keys=True, indent=4)
+        f.write("\n")
+
+if __name__ == "__main__":
+    args = parse_args()
+
+    aln = data1(args.input_aln_metrics)
+    variants = data2(args.input_variants_metrics)
+    save_output(aln, variants, args.output_json)
diff --git a/bin/compile_metrics.py b/bin/compile_metrics.py
diff --git a/bin/count_aln.py b/bin/count_aln.py
@@ -0,0 +1,65 @@
+#!/usr/bin/env python3
+
+import argparse
+import json
+import pprint as pp
+import subprocess
+import numpy as np
+import sys
+import os
+from pathlib import Path
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--sample_id", dest="sample_id", required=True,
+                        default=None,
+                        help="Sample ID")
+    parser.add_argument("--input_metrics", dest="input_metrics", required=True,
+                        default=None,
+                        help="Path to input aln metrics list")
+    parser.add_argument("--output_json", dest="output_json", required=False,
+                        default="./variant_counts.json",
+                        help="Path to output file for variant metrics. Default: ./variant_counts.json")
+    parser.add_argument("--scratch_dir", dest="scratch_dir", required=False,
+                        default="./",
+                        help="Path to scratch dir. Default: ./")
+    args = parser.parse_args()
+
+    # create scratch dir if it doesn't exist
+    Path(args.scratch_dir).mkdir(parents=True, exist_ok=True)
+
+    return args
+
+def raw_data(input_metrics):
+    d = {}
+    with open(input_metrics) as f:
+        for line in f:
+            if not line.strip():
+                continue
+            row = line.split('\t')
+            key = row[0]
+            value_str = row[1]
+            #d[key] = value_str.replace("\n", "")
+            try:
+                #value = float(value_str.strip())
+                value = int(value_str)
+            except ValueError:
+                #value = value_str.strip()
+                value = float(value_str.strip())
+            d[key] = value
+    return d
+
+
+def save_output(data_metrics, outfile):
+    with open(outfile, "w") as f:
+        # data_metrics = {"biosample" : {"id" : args.sample_id}, "wgs_qc_metrics" : data_metrics}
+        data_metrics = {"biosample" : {"id" : args.sample_id}, "wgs_qc_metrics" : {"aln_metrics" : data_metrics, "variant_metrics" : {}}}
+        json.dump(data_metrics, f, sort_keys=True, indent=4)
+        f.write("\n")
+
+if __name__ == "__main__":
+    args = parse_args()
+
+    data_metrics = raw_data(args.input_metrics)
+    save_output(data_metrics, args.output_json)