Skip to content

Commit

Permalink
Merge pull request #134 from c-BIG/develop
Browse files Browse the repository at this point in the history
Release/0.13.0
  • Loading branch information
mhebrard authored Sep 25, 2024
2 parents 49940e6 + 12f7847 commit e962e15
Show file tree
Hide file tree
Showing 80 changed files with 6,006 additions and 3,951 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@ tests/*/output_certified/results/mosdepth
tests/*/output_certified/results/samtools
tests/*/output_certified/results/picard
tests/*/output_certified/results/verifybamid2
tests/*/output_certified/picard_collect_multiple_metrics
tests/*/output_certified/picard_collect_wgs_metrics
tests/*/output_certified/bcftools


# MacOS
**/.DS*
25 changes: 12 additions & 13 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@ Requirements

* `Install Nextflow`_

v23.08.1-edge or higher ::
v23.10.4 or higher ::
# Download the executable package
wget -qO- https://github.com/nextflow-io/nextflow/releases/download/v23.08.1-edge/nextflow-23.08.1-edge-all | bash
wget -qO- https://github.com/nextflow-io/nextflow/releases/download/v23.10.4/nextflow-23.10.4-all | bash
# Make the binary executable on your system
chmod -x nextflow
# Optionally, move the nextflow file to a directory accessible by your $PATH variable
Expand Down Expand Up @@ -45,7 +45,7 @@ Run workflow on 45Mbp region around AKT1 gene, 30X, of sample NA12878 from the 1

This creates `output` directory with the results that can be compared to the content of `output_certified` ::

diff output_certified/results/metrics/NA12878-chr14-AKT1.metrics.json output/results/multiqc/NA12878-chr14-AKT1.metrics.json
diff output_certified/results/metrics/NA12878-chr14-AKT1.metrics.json output/results/metrics/NA12878-chr14-AKT1.metrics.json

Please refer to the workflow help for more information on its usage and access to additional options: ::

Expand All @@ -59,7 +59,7 @@ Resources

The workflow requires the following resources given in the ``conf/resources.config``

- *N-regions reference file*, used as an input for computing "non-gap regions autosome" coverages (mosdepth).
- *N-regions reference file*, used as an input for computing "non-gap regions autosome" coverages (picard, bcftools).

- Gaps in the GRCh38 (hg38) genome assembly, defined in the AGP file delivered with the sequence, are being closed during the finishing process on the human genome. GRCh38 (hg38) genome assembly still contains the following principal types of gaps:

Expand All @@ -75,6 +75,8 @@ The workflow requires the following resources given in the ``conf/resources.conf

- *FASTA file index*. This file can be downloaded from ``s3://1000genomes-dragen-3.7.6/references/fasta/hg38.fa.fai`` and not required to be specified in the config. The workflow will look fasta index file in a folder the fasta file is present.

- *Verify Bam ID 2 reference panel files*, 100K sites from 1000 Genome Project phase 3 build 38, downloaded from ``https://github.com/Griffan/VerifyBamID/tree/master/resource/``.

Inputs
------

Expand All @@ -100,14 +102,15 @@ Upon completion, the workflow will create the following files in the ``outdir``
timeline.html
trace.txt
results/ # final metrics.json and intermediate outputs
mosdepth/
multiqc/
bcftools/
metrics/
<sample_id>.metrics.json
picard_collect_multiple_metrics/
picard_collect_wgs_metrics/
samtools/
verifybamid2/

If ``keep_workdir`` has been specified, the contents of the Nextflow work directory (``work-dir``) will also be preserved.
If ``cleanup = true`` in the nextflow.config is commented out, the contents of the Nextflow work directory (``work-dir``) will also be preserved.

Docker image
------------
Expand All @@ -132,15 +135,11 @@ In a nutshell, this workflow generates QC metrics from single-sample WGS results

**Metrics calculation**

The current workflow combines widely-used third-party tools (samtools, picard, mosdepth) and custom scripts. Full details on which processes are run/when can be found in the actual workflow definition (``main.nf``). We also provide an example dag for a more visual representation (``tests/NA12878_1000genomes-dragen-3.7.6/dag.pdf``).
The current workflow combines widely-used third-party tools (samtools, picard, bcftools, verifybamid2) and custom scripts. Full details on which processes are run/when can be found in the actual workflow definition (``main.nf``). We also provide an example dag for a more visual representation (``tests/NA12878_1000genomes-dragen-3.7.6/dag.pdf``).

**Metrics parsing**

Next, output files from each individual tool are parsed and combined into a single json file. This is done by calling ``bin/multiqc_plugins/multiqc_npm/``, a MultiQC plugin that extends the base tool to support additional files.

**Metrics reporting**

Finally, the contents of the MultiQC json are formatted into a final metrics report, also in json format. The reporting logic lives in the ``bin/compile_metrics.py`` script, and whilst its contents are simple, it enables automatic documentation of metric definitions from code comments (see the **Metric definitions** section).
Next, output files from each individual tool are parsed and combined into a single json file.

Metric definitions
==================
Expand Down
64 changes: 64 additions & 0 deletions bin/compile_aln_variants_metrics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#!/usr/bin/env python3

import argparse
import json
import pprint as pp
import subprocess
import numpy as np
import sys
import os
from pathlib import Path

def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("--sample_id", dest="sample_id", required=True,
default=None,
help="Sample ID")
parser.add_argument("--input_aln_metrics", dest="input_aln_metrics", required=True,
default=None,
help="Path to input aln metrics list")
parser.add_argument("--input_variants_metrics", dest="input_variants_metrics", required=True,
default=None,
help="Path to input variants metrics list")
parser.add_argument("--output_json", dest="output_json", required=False,
default="./metrics.json",
help="Path to output file for variant metrics. Default: ./variant_counts.json")
parser.add_argument("--scratch_dir", dest="scratch_dir", required=False,
default="./",
help="Path to scratch dir. Default: ./")
args = parser.parse_args()

# create scratch dir if it doesn't exist
Path(args.scratch_dir).mkdir(parents=True, exist_ok=True)

return args


def data1(input_aln_metrics):
aln_input = {}
with open(input_aln_metrics, 'r') as f1:
aln_input = json.load(f1)
return aln_input

def data2(input_variants_metrics):
variants_input = {}
with open(input_variants_metrics, 'r') as f2:
variants_input = json.load(f2)
return variants_input


def save_output(data1, data2, outfile):
with open(outfile, "w") as f:
data1['wgs_qc_metrics']['variant_metrics'].update(data2['wgs_qc_metrics']['variant_metrics'])
#data1['wgs_qc_metrics']['variant_metrics'] = data2['wgs_qc_metrics']['variant_metrics']
print(data1)
json.dump(data1, f, sort_keys=True, indent=4)
#json.dump({"biosample": data1["biosample"], "wgs_qc_metrics": {**data1["wgs_qc_metrics"], **data2["wgs_qc_metrics"]}}, f, sort_keys=True, indent=4)
f.write("\n")

if __name__ == "__main__":
args = parse_args()

aln = data1(args.input_aln_metrics)
variants = data2(args.input_variants_metrics)
save_output(aln, variants, args.output_json)
86 changes: 0 additions & 86 deletions bin/compile_metrics.py

This file was deleted.

65 changes: 65 additions & 0 deletions bin/count_aln.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
#!/usr/bin/env python3

import argparse
import json
import pprint as pp
import subprocess
import numpy as np
import sys
import os
from pathlib import Path


def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("--sample_id", dest="sample_id", required=True,
default=None,
help="Sample ID")
parser.add_argument("--input_metrics", dest="input_metrics", required=True,
default=None,
help="Path to input aln metrics list")
parser.add_argument("--output_json", dest="output_json", required=False,
default="./variant_counts.json",
help="Path to output file for variant metrics. Default: ./variant_counts.json")
parser.add_argument("--scratch_dir", dest="scratch_dir", required=False,
default="./",
help="Path to scratch dir. Default: ./")
args = parser.parse_args()

# create scratch dir if it doesn't exist
Path(args.scratch_dir).mkdir(parents=True, exist_ok=True)

return args

def raw_data(input_metrics):
d = {}
with open(input_metrics) as f:
for line in f:
if not line.strip():
continue
row = line.split('\t')
key = row[0]
value_str = row[1]
#d[key] = value_str.replace("\n", "")
try:
#value = float(value_str.strip())
value = int(value_str)
except ValueError:
#value = value_str.strip()
value = float(value_str.strip())
d[key] = value
return d


def save_output(data_metrics, outfile):
with open(outfile, "w") as f:
# data_metrics = {"biosample" : {"id" : args.sample_id}, "wgs_qc_metrics" : data_metrics}
data_metrics = {"biosample" : {"id" : args.sample_id}, "wgs_qc_metrics" : {"aln_metrics" : data_metrics, "variant_metrics" : {}}}
json.dump(data_metrics, f, sort_keys=True, indent=4)
f.write("\n")

if __name__ == "__main__":
args = parse_args()

data_metrics = raw_data(args.input_metrics)
save_output(data_metrics, args.output_json)
Loading

0 comments on commit e962e15

Please sign in to comment.