Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run GIAB comparisons #7237

Merged
merged 6 commits into from
May 4, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 0 additions & 92 deletions scripts/variantstore/tieout/FULL_TIEOUT.md

This file was deleted.

175 changes: 175 additions & 0 deletions scripts/variantstore/tieout/GIAB_TIEOUT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# NOTES
- Previously there was an attempt to tie out the feature extract input to VQSR between WARP and GVS. However, that was abandoned in favor of the truth-sample based comparison here due to non-determinism and bugs in GenomicsDB that were hard to compensate for. However, the docs and scripts for doing this are in the branch `gvs_vqsr_input_tieout` under `scripts/variantstore/tieout/feature_extract`

## Prerequisites

- rtg
- bcftools
- tabix
- python 3.7+

Copy link
Contributor

@RoriCremer RoriCremer May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my immediate reaction was that I should pip install them, but that's clearly wrong.
I know I'm a noob with this project, but a little more context on the prereqs would be immensely helfpul and would have saved me a lot of time googling

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add samtools?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conda create --name gvs python=3.8
conda activate gvs
conda install -c bioconda samtools=1.9 --force-reinstall
conda install -c bioconda bcftools
conda install -c bioconda rtg-tools

Copy link
Contributor

@RoriCremer RoriCremer May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add GATK to prereqs -- should this be done through conda tho? isn't gatk technically part of this repository anyway?

## One time tasks
```

# create SDF for reference (used by rtg eval)
# use local path to GRC38 reference fasta
REFERENCE="/Users/kcibul/projects/references/hg38/v0/Homo_sapiens_assembly38.fasta"
rtg format --output human_REF_SDF $REFERENCE

# copy down truth data locally
mkdir -p truth && gsutil -m cp gs://broad-dsp-spec-ops/gvs/truth/* truth/
```

## Obtain Truth sample VCFs

First, create a full cohort extract (as described in README.md) using the desired filter_set_name. Assuming this is in a single gathered VCF of `gvs.vcf.gz`
Copy link
Contributor

@RoriCremer RoriCremer May 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be more like "once you have a full cohort extract you want to compare" ?


```
# subset to chr20
bcftools view -O z gvs.vcf.gz chr20 > gvs.chr20.vcf.gz
tabix gvs.chr20.vcf.gz
RoriCremer marked this conversation as resolved.
Show resolved Hide resolved

# extract each of the samples
# NOTE: excluding NO_VARIATION just means to drop sites that are reference in the selected sample
INPUT_VCF="gvs.chr20.vcf.gz"

gatk SelectVariants -V ${INPUT_VCF} --sample-name SM-G947Y --select-type-to-exclude NO_VARIATION -O NA12878.gvs.chr20.vcf.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may be minor, but can we document explicitly what --select-type-to-exclude NO_VARIATION does? i'm guessing it means only return sites at which this sample has a variant, i.e. exclude ref sites?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

gatk SelectVariants -V ${INPUT_VCF} --sample-name CHMI_CHMI3_WGS1 --select-type-to-exclude NO_VARIATION -O SYNDIP.gvs.chr20.vcf.gz
gatk SelectVariants -V ${INPUT_VCF} --sample-name 1202194294 --select-type-to-exclude NO_VARIATION -O BI_HG002.gvs.chr20.vcf.gz
gatk SelectVariants -V ${INPUT_VCF} --sample-name 1202243693 --select-type-to-exclude NO_VARIATION -O BI_HG003.gvs.chr20.vcf.gz
gatk SelectVariants -V ${INPUT_VCF} --sample-name 573673 --select-type-to-exclude NO_VARIATION -O UW_HG002.gvs.chr20.vcf.gz

```

## script to add "AS_MAX_VQSLOD" to VCFs
```
for sample in NA12878 SYNDIP BI_HG002 BI_HG003 UW_HG002
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow 😍 for this block

do
python ../add_max_as_vqslod.py ${sample}.gvs.chr20.vcf.gz | bgzip > ${sample}.gvs.chr20.maxas.vcf.gz && tabix ${sample}.gvs.chr20.maxas.vcf.gz
done
```

## Evaluate

First create the ROC curves using only the PASSing records
```
SOURCE="gvs"
BASE_CMD="rtg vcfeval --region chr20 --roc-subset snp,indel --vcf-score-field=INFO.MAX_AS_VQSLOD -t human_REF_SDF"
SUFFIX="_roc_filtered"

${BASE_CMD} -b truth/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz \
-e truth/HG001.gvs.evaluation.bed \
-c NA12878.${SOURCE}.chr20.maxas.vcf.gz -o NA12878_${SOURCE}${SUFFIX}

${BASE_CMD} -b truth/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -e truth/HG002.gvs.evaluation.bed -c BI_HG002.${SOURCE}.chr20.maxas.vcf.gz -o BI_HG002_${SOURCE}${SUFFIX}
${BASE_CMD} -b truth/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -e truth/HG002.gvs.evaluation.bed -c UW_HG002.${SOURCE}.chr20.maxas.vcf.gz -o UW_HG002_${SOURCE}${SUFFIX}
${BASE_CMD} -b truth/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -e truth/HG003.gvs.evaluation.bed -c BI_HG003.${SOURCE}.chr20.maxas.vcf.gz -o BI_HG003_${SOURCE}${SUFFIX}
${BASE_CMD} -b truth/CHM.full.38.vcf.gz -e truth/CHM.gvs.evaluation.bed -c SYNDIP.${SOURCE}.chr20.maxas.vcf.gz -o syndip_${SOURCE}${SUFFIX}
```

The do the same thing but use all records
Copy link
Contributor

@RoriCremer RoriCremer May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"then" ?

```
SOURCE="gvs"
BASE_CMD="rtg vcfeval --region chr20 --all-records --roc-subset snp,indel --vcf-score-field=INFO.MAX_AS_VQSLOD -t human_REF_SDF"
SUFFIX="_roc_all"

${BASE_CMD} -b truth/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz \
-e truth/HG001.gvs.evaluation.bed \
-c NA12878.${SOURCE}.chr20.maxas.vcf.gz -o NA12878_${SOURCE}${SUFFIX}

${BASE_CMD} -b truth/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -e truth/HG002.gvs.evaluation.bed -c BI_HG002.${SOURCE}.chr20.maxas.vcf.gz -o BI_HG002_${SOURCE}${SUFFIX}
${BASE_CMD} -b truth/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -e truth/HG002.gvs.evaluation.bed -c UW_HG002.${SOURCE}.chr20.maxas.vcf.gz -o UW_HG002_${SOURCE}${SUFFIX}
${BASE_CMD} -b truth/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -e truth/HG003.gvs.evaluation.bed -c BI_HG003.${SOURCE}.chr20.maxas.vcf.gz -o BI_HG003_${SOURCE}${SUFFIX}
${BASE_CMD} -b truth/CHM.full.38.vcf.gz -e truth/CHM.gvs.evaluation.bed -c SYNDIP.${SOURCE}.chr20.maxas.vcf.gz -o syndip_${SOURCE}${SUFFIX}
```

## Gather Precision and Sensitivity numbers

The columns here are: filename, type, FPs, FNs, precision, sensitivity

```
for type in "snp" "indel"
do
for suffix in "_all" "_filtered"
do
for f in $(ls -1 *${suffix}/${type}*.tsv.gz); do
d=$(cat $f | gunzip | tail -1 | cut -f3,5,6,7)
s=$(echo $f | cut -d"/" -f1 )
echo -e "$s\t$type\t$d"
done
done
done

```
## View ROCs

You can use wildcards to pull in multiple datasets into the graphical viewer. For example

```
rtg roc NA12878_*_roc*/snp_roc.tsv.gz
rtg roc NA12878_*_roc*/indel_roc.tsv.gz
```

## Appendix A: Processing WARP results against gvs_sci_tieout_acmg_cohort workspace

Starting with the output of FinalGatherVcf run from the WDLs and inputs in the `wdl/legacy` directory.

If this was run on the specops cromwell you can get that final VCF with:

```
WORKFLOW_ID=49f5c0a5-dbfb-4b05-b293-11ddadeae8e5
gsutil -m cp gs://broad-dsp-spec-ops-cromwell-execution/JointGenotyping/${WORKFLOW_ID}/call-FinalGatherVcf/warp_tieout_acmg_cohort_v1.vcf.gz* .
```

Next is to subset to chr20, and the select out the 5 truth samples into their own VCFs. Assuming a gathered initial VCF of "warp_tieout_acmg_cohort_v1.vcf.gz"

```
bcftools view -O z warp_tieout_acmg_cohort_v1.vcf.gz chr20 > warp_tieout_acmg_cohort_v1.chr20.vcf.gz
tabix warp_tieout_acmg_cohort_v1.chr20.vcf.gz

INPUT_VCF="warp_tieout_acmg_cohort_v1.chr20.vcf.gz"
SOURCE="warp"
~/gatk SelectVariants -V ${INPUT_VCF} --sample-name SM-G947Y --select-type-to-exclude NO_VARIATION -O NA12878.${SOURCE}.chr20.vcf.gz
~/gatk SelectVariants -V ${INPUT_VCF} --sample-name CHMI_CHMI3_WGS1 --select-type-to-exclude NO_VARIATION -O SYNDIP.${SOURCE}.chr20.vcf.gz
~/gatk SelectVariants -V ${INPUT_VCF} --sample-name 1202194294 --select-type-to-exclude NO_VARIATION -O BI_HG002.${SOURCE}.chr20.vcf.gz
~/gatk SelectVariants -V ${INPUT_VCF} --sample-name 1202243693 --select-type-to-exclude NO_VARIATION -O BI_HG003.${SOURCE}.chr20.vcf.gz
~/gatk SelectVariants -V ${INPUT_VCF} --sample-name 573673 --select-type-to-exclude NO_VARIATION -O UW_HG002.${SOURCE}.chr20.vcf.gz
```

From here it's a similar process to the evaluation of the GVS-based data

## Appendix B: AoU Tieout

The processing for extracting these samples from a large AoU joint callset is slightly different. First we inspect the shards to find the set of shards that encompass chr20 entirely.

E.g. looking at the following for different values of `shard-???`

```
gsutil cat gs://fc-secure-daa074a0-884f-4b67-a79b-cb5cd1556c9f/8331ab64-7bb5-4b1c-9c6f-06765bfd6b15/GvsExtractCallset/e045f320-4f58-440c-9680-6c557c6538ee/call-ExtractTask/shard-223/alpha1_1000_*.vcf.gz | gunzip | grep -v "#" | cut -f1-8 | head
```

Once the range of shards is known we can gather those into a single chr20 gVCF for all samples with something like:

```
gatk --java-options -Xms6g GatherVcfsCloud --ignore-safety-checks --gather-type BLOCK \
-I gs://fc-secure-daa074a0-884f-4b67-a79b-cb5cd1556c9f/e73809fa-9ef2-4f3b-965c-846eff4b905e/GvsExtractCallset/48c56780-2520-444f-b1b3-83091a74bb08/call-ExtractTask/shard-223/alpha1_1000_223.vcf.gz \
-I gs://fc-secure-daa074a0-884f-4b67-a79b-cb5cd1556c9f/e73809fa-9ef2-4f3b-965c-846eff4b905e/GvsExtractCallset/48c56780-2520-444f-b1b3-83091a74bb08/call-ExtractTask/shard-224/alpha1_1000_224.vcf.gz \
-I gs://fc-secure-daa074a0-884f-4b67-a79b-cb5cd1556c9f/e73809fa-9ef2-4f3b-965c-846eff4b905e/GvsExtractCallset/48c56780-2520-444f-b1b3-83091a74bb08/call-ExtractTask/shard-225/attempt-2/alpha1_1000_225.vcf.gz \
-I gs://fc-secure-daa074a0-884f-4b67-a79b-cb5cd1556c9f/e73809fa-9ef2-4f3b-965c-846eff4b905e/GvsExtractCallset/48c56780-2520-444f-b1b3-83091a74bb08/call-ExtractTask/shard-226/alpha1_1000_226.vcf.gz \
-I gs://fc-secure-daa074a0-884f-4b67-a79b-cb5cd1556c9f/e73809fa-9ef2-4f3b-965c-846eff4b905e/GvsExtractCallset/48c56780-2520-444f-b1b3-83091a74bb08/call-ExtractTask/shard-227/alpha1_1000_227.vcf.gz \
-I gs://fc-secure-daa074a0-884f-4b67-a79b-cb5cd1556c9f/e73809fa-9ef2-4f3b-965c-846eff4b905e/GvsExtractCallset/48c56780-2520-444f-b1b3-83091a74bb08/call-ExtractTask/shard-228/alpha1_1000_228.vcf.gz \
-I gs://fc-secure-daa074a0-884f-4b67-a79b-cb5cd1556c9f/e73809fa-9ef2-4f3b-965c-846eff4b905e/GvsExtractCallset/48c56780-2520-444f-b1b3-83091a74bb08/call-ExtractTask/shard-229/alpha1_1000_229.vcf.gz \
--output aou.alpha1v2.chr19-21.vcf.gz
```

Now we can create single sample gVCFs for the 3 GIAB samples in AoU

```
INPUT_VCF=aou.alpha1v2.chr19-21.vcf.gz
SOURCE=aou-bq
~/gatk SelectVariants -V ${INPUT_VCF} -L chr20 --sample-name BI_HG-002 --select-type-to-exclude NO_VARIATION -O BI_HG002.${SOURCE}.chr20.vcf.gz
~/gatk SelectVariants -V ${INPUT_VCF} -L chr20 --sample-name BI_HG-003 --select-type-to-exclude NO_VARIATION -O BI_HG003.${SOURCE}.chr20.vcf.gz
~/gatk SelectVariants -V ${INPUT_VCF} -L chr20 --sample-name UW_HG-002 --select-type-to-exclude NO_VARIATION -O UW_HG002.${SOURCE}.chr20.vcf.gz
```

These can be analyzed similar to the initial GVS tieout
45 changes: 35 additions & 10 deletions scripts/variantstore/tieout/add_max_as_vqslod.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,23 +19,48 @@

parts = line.split("\t")

if (parts[6] == "ExcessHet"):
if ("ExcessHet" in parts[6]):
continue

# if the "loose" argument is specified, we remove the non Excess Het site-specific filers
# of EXCESS_ALLELES and NO_HQ_GENOTYPES. This is used to measure the effect of these filters
# when comparing to WARP (which doesn't employ these filters).
if (len(sys.argv) > 2 and sys.argv[2] == "loose"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see this option in the readme. When should this be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documented

# undo
x = parts[6].replace("EXCESS_ALLELES","").replace("NO_HQ_GENOTYPES","").strip(";")
if x == "":
parts[6] = "PASS"
else:
parts[6] = x
else:
if ("EXCESS_ALLELES" in parts[6] or "NO_HQ_GENOTYPES" in parts[6]):
continue;

info = parts[7]
d = dict([ tuple(x.split("=")) for x in info.split(";") if "=" in x])

format_key = [x for x in parts[8].split(":")]
sample_data = dict(zip(format_key, parts[9].split(":")))

gt = sample_data['GT']

if gt == "0/0" or gt == "./.":
continue;

if "," in d['AS_VQSLOD']:
pieces = [x for x in d['AS_VQSLOD'].split(",") if (x != "." and x != "NaN") ]
if "AS_VQSLOD" not in d:
sys.exit(f"Cannot find AS_VQSLOD in {line}")
else:
if "," in d['AS_VQSLOD']:
pieces = [x for x in d['AS_VQSLOD'].split(",") if (x != "." and x != "NaN") ]

if (len(pieces) == 1):
m = pieces[0]
elif (len(pieces) == 0):
m = "."
if (len(pieces) == 1):
m = pieces[0]
elif (len(pieces) == 0):
m = "."
else:
m = max([float(x) for x in pieces])
else:
m = max([float(x) for x in pieces])
else:
m = d['AS_VQSLOD']
m = d['AS_VQSLOD']

parts[7] = f"{info};MAX_AS_VQSLOD={m}"
print("\t".join(parts))
Loading