Skip to content

Commit

Permalink
Run VariantFiltering for Exomes [VS-1305]
Browse files Browse the repository at this point in the history
  • Loading branch information
mcovarr committed Jul 22, 2024
1 parent 8c0b36e commit 335bb6f
Show file tree
Hide file tree
Showing 13 changed files with 97 additions and 39 deletions.
12 changes: 9 additions & 3 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ workflows:
branches:
- master
- ah_var_store
- vs_1375_off_target
tags:
- /.*/
- name: GvsBenchmarkExtractTask
Expand Down Expand Up @@ -126,6 +127,7 @@ workflows:
branches:
- master
- ah_var_store
- vs_1375_off_target
tags:
- /.*/
- name: GvsPopulateAltAllele
Expand All @@ -135,6 +137,7 @@ workflows:
branches:
- master
- ah_var_store
- vs_1375_off_target
tags:
- /.*/
- name: GvsCreateTables
Expand Down Expand Up @@ -164,7 +167,7 @@ workflows:
branches:
- master
- ah_var_store
- VS-1379-ploidy
- vs_1375_off_target
tags:
- /.*/
- name: GvsImportGenomes
Expand All @@ -174,6 +177,7 @@ workflows:
branches:
- master
- ah_var_store
- vs_1375_off_target
tags:
- /.*/
- name: GvsBulkIngestGenomes
Expand All @@ -183,6 +187,7 @@ workflows:
branches:
- master
- ah_var_store
- vs_1375_off_target
tags:
- /.*/
- name: GvsPrepareRangesCallset
Expand All @@ -194,6 +199,7 @@ workflows:
branches:
- master
- ah_var_store
- vs_1375_off_target
tags:
- /.*/
- name: GvsCreateVATfromVDS
Expand Down Expand Up @@ -280,7 +286,7 @@ workflows:
branches:
- master
- ah_var_store
- rsa_vs_1168_bge
- vs_1375_off_target
tags:
- /.*/
- name: GvsQuickstartVcfIntegration
Expand Down Expand Up @@ -308,7 +314,7 @@ workflows:
branches:
- master
- ah_var_store
- vs_1396_bq_query_audit
- vs_1375_off_target
tags:
- /.*/
- name: GvsIngestTieout
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,9 @@ To see details about input requirements, see [Run Your Own Samples](https://gith

While the GVS has been tested with almost 190,000 single sample exome GVCF files as input, only datasets of up to 100,000 exomes are currently supported by the beta workflow.

Note that, by default, the Exomes Beta workflow uses the Blended Genomes Interval List: `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list` for its calling regions. For information on how that interval list was generated, see `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list.README.md`.
Note that, by default, the Exomes Beta workflow uses two interval lists by default:
1. The calling region interval list defaults to the Blended Genomes Interval List: `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list` . For information on how that interval list was generated, see `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list.README.md`.
1. The target region interval list defaults to `gs://broad-gotc-test-storage/JointGenotyping/inputs/scientific/bge/TwistAllianceClinicalResearchExome_Covered_Targets_hg38.interval_list`. Any variants called outside these interval will be *genotype* filtered with `OUTSIDE_OF_TARGETS`.

### What does it return as output?

Expand Down
25 changes: 14 additions & 11 deletions scripts/variantstore/docs/RUNNING_EXOMES_ON_GVS.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,31 +31,34 @@ This document describes the changes necessary to run exome gVCFs through the GVS
- This will import the re-blocked gVCF files into GVS. The workflow will check whether data for that sample has already been loaded into GVS. It is designed to be re-run (with the same inputs) if there is a failure during one of the workflow tasks (e.g. BigQuery write API interrupts).
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
- You will want to set the `external_sample_names`, `input_vcfs` and `input_vcf_indexes` inputs based on the GCP locations of files that contain lists of these values (in the same order). For NA12878/HG001, the files you will need for ingest are:
- `input_vcfs`: `"gs://broad-gotc-test-storage/germline_single_sample/exome/scientific/truth/master/D5327.NA12878/NA12878.rb.g.vcf.gz"`
- `input_vcf_indexes`: `"gs://broad-gotc-test-storage/germline_single_sample/exome/scientific/truth/master/D5327.NA12878/NA12878.rb.g.vcf.gz.tbi"`
- `input_vcfs`: `"gs://broad-gotc-test-storage/ExomeGermlineSingleSample/truth/scientific/master/D5327.NA12878/NA12878.rb.g.vcf.gz"`
- `input_vcf_indexes`: `"gs://broad-gotc-test-storage/ExomeGermlineSingleSample/truth/scientific/master/D5327.NA12878/NA12878.rb.g.vcf.gz.tbi"`
- Set the `interval_list` input to `"gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list"`
1. `GvsPopulateAltAllele` workflow
- This step loads data into the `alt_allele` table from the `vet_*` tables in preparation for running the filtering step.
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
1. `GvsCreateFilterSet` workflow
- This step calculates features from the `alt_allele` table, and trains the VETS model along with site-level QC filters and loads them into BigQuery into a series of `filter_set_*` tables.
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
- Set the `interval_list` input to `"gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list"`
- Set the `use_VQSR_lite` input to `true` to use VETS (instead of VQSR)
- Set `GvsCreateFilterSet.interval_list` input to `"gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list"` *Make sure you're not setting `interval_list` on any of the called tasks by mistake!*
1. `GvsPrepareRangesCallset` workflow
- This workflow transforms the data in the vet tables into a schema optimized for VCF extraction.
- This workflow will need to be run once to extract the callset as a whole and an additional time to create the files used to calculate Precision and Sensitivity (with `control_samples` set to `true`).
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
1. `GvsExtractCallset` workflow
- This workflow is used to extract non-control samples only. See the `GvsCalculatePrecisionAndSensitivity` instructions below if running Precision and Sensitivity on control samples.
- This workflow takes the tables created in the `Prepare` step to output joint VCF shards.
- This workflow will need to be run once to extract the callset into VCF shards and an additional time to calculate Precision and Sensitivity (with `control_samples` set to `true`).
- This workflow is run only to extract the callset into VCF shards (`control_samples` set to `false` explicitly or left at its default `false` value).
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
- Set the parameter `use_interval_weights` to `false`. This avoids a bug seen in WeightedSplitIntervals when using exomes, forcing it to use the standard version of SplitIntervals.
- Set the `output_gcs_dir` to a GCS location to collect all the VCF shards into one place. If you are running it twice (to calculate Precision and Sensitivity), be sure to provide distinct locations for each.
- Set the `output_gcs_dir` to a GCS location to collect all the VCF shards into one place.
1. `GvsCalculatePrecisionAndSensitivity` workflow
- This workflow needs to be run with the control sample VCF shards from `GvsExtractCallset` step which were placed in the `output_gcs_dir` GCS location.
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
- Truth inputs for NA12878/HG001:
- `truth_beds`: `"gs://gvs-internal/truth/HG001.exome_evaluation_regions.v1.1.bed"`
- `truth_vcfs`: `"gs://gvs-internal/truth/HG001_exome_filtered.recode.vcf.gz"`
- Input values with truth inputs for NA12878/HG001:
- `truth_beds`: `[ "gs://gvs-internal/truth/HG001.exome_evaluation_regions.v1.1.bed" ]`
- `truth_vcfs`: `[ "gs://gvs-internal/truth/HG001_exome_filtered.recode.vcf.gz.tbi" ]`
- `truth_vcf_indices`: `"gs://gvs-internal/truth/HG001_exome_filtered.recode.vcf.gz.tbi"`
- `vcf_eval_bed_file`: `"gs://gvs-internal/truth/HG001.exome_evaluation_regions.v1.1.bed"`
- `chromosomes`: `[ "chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chr21", "chr22" ]`
- `interval_list`: `"gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list"`
- `target_interval_list`: `"gs://broad-gotc-test-storage/JointGenotyping/inputs/scientific/bge/TwistAllianceClinicalResearchExome_Covered_Targets_hg38.interval_list"`
- `is_wgs`: `false`
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ workflow GvsCalculatePrecisionAndSensitivity {
String dataset_name
String filter_set_name
File interval_list = "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"
File? target_interval_list
File? vcf_eval_bed_file
Array[String] chromosomes = ["chr20"]
String project_id
Expand Down Expand Up @@ -37,6 +38,7 @@ workflow GvsCalculatePrecisionAndSensitivity {
chromosomes: "The chromosome(s) on which to analyze precision and sensitivity. The default value for this is `['chr20']`."
dataset_name: "The GVS BigQuery dataset name."
filter_set_name: "The filter_set_name used to generate the callset."
target_interval_list: "The intervals outside of which sites will be OUTSIDE_OF_TARGETS filtered."
interval_list: "The intervals over which to extract VCFs for calculating precision and sensitivity."
project_id: "The Google Project ID where the GVS lives."
sample_names: "A list of the sample names that are controls and that will be used for the analysis. For every element on the list of sample names there must be a corresponding element on the list of `truth_vcfs`, `truth_vcf_indices`, and `truth_beds`."
Expand Down Expand Up @@ -79,6 +81,7 @@ workflow GvsCalculatePrecisionAndSensitivity {
call_set_identifier = call_set_identifier,
filter_set_name = filter_set_name,
control_samples = true,
target_interval_list = target_interval_list,
interval_list = interval_list,
extract_scatter_count_override = extract_scatter_count_override,
cloud_sdk_docker = effective_cloud_sdk_docker,
Expand Down
22 changes: 22 additions & 0 deletions scripts/variantstore/wdl/GvsExtractCallset.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ workflow GvsExtractCallset {
File interval_list = "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"
File interval_weights_bed = "gs://gvs_quickstart_storage/weights/gvs_full_vet_weights_1kb_padded_orig.bed"

File? target_interval_list

String? variants_docker
String? cloud_sdk_docker
String? gatk_docker
Expand Down Expand Up @@ -252,6 +254,7 @@ workflow GvsExtractCallset {
convert_filtered_genotypes_to_nocalls = convert_filtered_genotypes_to_nocalls,
write_cost_to_db = write_cost_to_db,
maximum_alternate_alleles = maximum_alternate_alleles,
target_interval_list = target_interval_list,
}
}

Expand Down Expand Up @@ -348,6 +351,8 @@ task ExtractTask {
Int? local_sort_max_records_in_ram = 10000000
Int? maximum_alternate_alleles

File? target_interval_list

# for call-caching -- check if DB tables haven't been updated since the last run
String max_last_modified_timestamp
}
Expand Down Expand Up @@ -422,6 +427,23 @@ task ExtractTask {
--shard-identifier ~{intervals_name} \
~{cost_observability_line}

if [[ -n "~{target_interval_list}" ]]
then
pre_off_target_vcf="pre_off_target_~{output_file}"
mv ~{output_file} ${pre_off_target_vcf}
mv ~{output_file}.tbi "${pre_off_target_vcf}.tbi"

gatk --java-options "-Xmx${memory_mb}m" \
IndexFeatureFile \
-I ~{target_interval_list}

gatk --java-options "-Xmx${memory_mb}m" \
VariantFiltration \
~{"--filter-not-in-mask --mask-name OUTSIDE_OF_TARGETS --mask-description 'Outside of sequencing target intervals' --mask " + target_interval_list} \
-O ~{output_file} \
-V ${pre_off_target_vcf}
fi

# Drop trailing slash if one exists
OUTPUT_GCS_DIR=$(echo ~{output_gcs_dir} | sed 's/\/$//')

Expand Down
34 changes: 21 additions & 13 deletions scripts/variantstore/wdl/GvsExtractCohortFromSampleNames.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ workflow GvsExtractCohortFromSampleNames {
# cohort_sample_names_array will take precedence over cohort_sample_names if both are set
Array[String]? cohort_sample_names_array
File? cohort_sample_names
Boolean is_wgs = true

String gvs_project
String gvs_dataset
Expand Down Expand Up @@ -42,6 +43,8 @@ workflow GvsExtractCohortFromSampleNames {
Int? split_intervals_disk_size_override
Int? split_intervals_mem_override

File? target_interval_list

String? git_branch_or_tag
String? gatk_docker
File? gatk_override
Expand Down Expand Up @@ -79,15 +82,6 @@ workflow GvsExtractCohortFromSampleNames {
cloud_sdk_docker = effective_cloud_sdk_docker
}

Int effective_scatter_count = if defined(extract_scatter_count_override) then select_first([extract_scatter_count_override])
else if GetNumSamplesLoaded.num_samples < 100 then 50 # Quickstart
else if GetNumSamplesLoaded.num_samples < 1000 then 250
else if GetNumSamplesLoaded.num_samples < 5000 then 500
else if GetNumSamplesLoaded.num_samples < 20000 then 1000 # Stroke Anderson
else if GetNumSamplesLoaded.num_samples < 50000 then 5000
else if GetNumSamplesLoaded.num_samples < 100000 then 10000 # Charlie
else 20000

# writing the array to a file has to be done in a task
# https://support.terra.bio/hc/en-us/community/posts/360071465631-write-lines-write-map-write-tsv-write-json-fail-when-run-in-a-workflow-rather-than-in-a-task
if (defined(cohort_sample_names_array)) {
Expand All @@ -99,9 +93,22 @@ workflow GvsExtractCohortFromSampleNames {
}

File cohort_sample_names_file = select_first([write_array_task.output_file, cohort_sample_names])
Int effective_sample_count = length(read_lines(cohort_sample_names_file))

Int effective_scatter_count = if defined(extract_scatter_count_override) then select_first([extract_scatter_count_override])
else if is_wgs then
if effective_sample_count < 5000 then 1 # This results in 1 VCF per chromosome.
else if effective_sample_count < 20000 then 2000 # Stroke Anderson
else if effective_sample_count < 50000 then 10000
else 20000
else
if effective_sample_count < 5000 then 1 # This results in 1 VCF per chromosome.
else if effective_sample_count < 20000 then 1000
else if effective_sample_count < 50000 then 2500
else 7500

# allow an interval list to be passed in, but default it to our standard one if no args are here
File working_interval_list = select_first([interval_list, "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"])
File effective_interval_list = select_first([interval_list, "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"])


call GvsPrepareCallset.GvsPrepareCallset {
Expand All @@ -119,7 +126,7 @@ workflow GvsExtractCohortFromSampleNames {
write_cost_to_db = write_cost_to_db,
cloud_sdk_docker = effective_cloud_sdk_docker,
enable_extract_table_ttl = true,
interval_list = working_interval_list,
interval_list = effective_interval_list,
control_samples = control_samples,
cloud_sdk_docker = effective_cloud_sdk_docker,
git_hash = effective_git_hash,
Expand Down Expand Up @@ -150,7 +157,7 @@ workflow GvsExtractCohortFromSampleNames {
split_intervals_mem_override = split_intervals_mem_override,
extract_memory_override_gib = extract_memory_override,
disk_override = extract_disk_override,
interval_list = working_interval_list,
interval_list = effective_interval_list,
control_samples = control_samples,

cloud_sdk_docker = effective_cloud_sdk_docker,
Expand All @@ -159,7 +166,8 @@ workflow GvsExtractCohortFromSampleNames {
gatk_docker = effective_gatk_docker,
git_hash = effective_git_hash,
variants_docker = effective_variants_docker,
write_cost_to_db = write_cost_to_db
write_cost_to_db = write_cost_to_db,
target_interval_list = target_interval_list,
}

output {
Expand Down
13 changes: 8 additions & 5 deletions scripts/variantstore/wdl/GvsJointVariantCalling.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@ workflow GvsJointVariantCalling {
String? extract_table_prefix
String? filter_set_name

File? target_interval_list

# Overrides to be passed to GvsCreateFilterSet
Int? INDEL_VQSR_CLASSIC_max_gaussians_override = 4
Int? INDEL_VQSR_CLASSIC_mem_gb_override
Expand All @@ -70,15 +72,15 @@ workflow GvsJointVariantCalling {
Int? maximum_alternate_alleles
}

# If is_wgs is true, we'll use the WGS interval list else, we'll use the Exome interval list. We'll currently use
# the same weighted bed file for whole genomes and exomes.
# But, if interval_list is defined, we'll use that instead of choosing based on is_wgs
# If `is_wgs` is true we'll use the WGS interval list else, otherwise we'll use the Exome interval list.
# We currently use the same weighted bed file for whole genomes and exomes. However if `interval_list` is defined,
# we'll use that instead of choosing based on `is_wgs`.
File default_interval_list = if (is_wgs) then "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"
else "gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list"
File interval_list_to_use = select_first([interval_list, default_interval_list])

# the call_set_identifier string is used to name many different things throughout this workflow (BQ tables, vcfs etc),
# and so make sure nothing is broken by creative users, we replace spaces and underscores with hyphens
# The `call_set_identifier` string is used to name many different things throughout this workflow (BQ tables, vcfs etc).
# To make sure nothing is broken by creative users we replace spaces and underscores with hyphens.
String effective_extract_output_file_base_name = select_first([extract_output_file_base_name, sub(call_set_identifier, "\\s+|\_+", "-")])
String effective_extract_table_prefix = select_first([extract_table_prefix, sub(call_set_identifier, "\\s+|\_+", "-")])
String effective_filter_set_name = select_first([filter_set_name, sub(call_set_identifier, "\\s+|\_+", "-")])
Expand Down Expand Up @@ -228,6 +230,7 @@ workflow GvsJointVariantCalling {
bgzip_output_vcfs = bgzip_output_vcfs,
is_wgs = is_wgs,
maximum_alternate_alleles = maximum_alternate_alleles,
target_interval_list = target_interval_list,
}

output {
Expand Down
Loading

0 comments on commit 335bb6f

Please sign in to comment.