Run VariantFiltering for Exomes [VS-1305]

broadinstitute · Jul 22, 2024 · 335bb6f · 335bb6f
1 parent 8c0b36e
commit 335bb6f
Show file tree

Hide file tree

Showing 13 changed files with 97 additions and 39 deletions.
diff --git a/.dockstore.yml b/.dockstore.yml
@@ -97,6 +97,7 @@ workflows:
          branches:
              - master
              - ah_var_store
+             - vs_1375_off_target
          tags:
              - /.*/
    - name: GvsBenchmarkExtractTask
@@ -126,6 +127,7 @@ workflows:
          branches:
              - master
              - ah_var_store
+             - vs_1375_off_target
          tags:
              - /.*/
    - name: GvsPopulateAltAllele
@@ -135,6 +137,7 @@ workflows:
          branches:
              - master
              - ah_var_store
+             - vs_1375_off_target
          tags:
              - /.*/
    - name: GvsCreateTables
@@ -164,7 +167,7 @@ workflows:
          branches:
              - master
              - ah_var_store
-             - VS-1379-ploidy
+             - vs_1375_off_target
          tags:
              - /.*/
    - name: GvsImportGenomes
@@ -174,6 +177,7 @@ workflows:
          branches:
              - master
              - ah_var_store
+             - vs_1375_off_target
          tags:
              - /.*/
    - name: GvsBulkIngestGenomes
@@ -183,6 +187,7 @@ workflows:
          branches:
              - master
              - ah_var_store
+             - vs_1375_off_target
          tags:
              - /.*/
    - name: GvsPrepareRangesCallset
@@ -194,6 +199,7 @@ workflows:
          branches:
              - master
              - ah_var_store
+             - vs_1375_off_target
          tags:
              - /.*/
    - name: GvsCreateVATfromVDS
@@ -280,7 +286,7 @@ workflows:
          branches:
              - master
              - ah_var_store
-             - rsa_vs_1168_bge
+             - vs_1375_off_target
          tags:
              - /.*/
    - name: GvsQuickstartVcfIntegration
@@ -308,7 +314,7 @@ workflows:
          branches:
              - master
              - ah_var_store
-             - vs_1396_bq_query_audit
+             - vs_1375_off_target
          tags:
              - /.*/
    - name: GvsIngestTieout

diff --git a/scripts/variantstore/beta_docs/gvs-exome-workspace-description.md b/scripts/variantstore/beta_docs/gvs-exome-workspace-description.md
@@ -35,7 +35,9 @@ To see details about input requirements, see [Run Your Own Samples](https://gith
 
 While the GVS has been tested with almost 190,000 single sample exome GVCF files as input, only datasets of up to 100,000 exomes are currently supported by the beta workflow.
 
-Note that, by default, the Exomes Beta workflow uses the Blended Genomes Interval List: `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list` for its calling regions. For information on how that interval list was generated, see `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list.README.md`.
+Note that, by default, the Exomes Beta workflow uses two interval lists by default:
+1. The calling region interval list defaults to the Blended Genomes Interval List: `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list` . For information on how that interval list was generated, see `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list.README.md`.
+1. The target region interval list defaults to `gs://broad-gotc-test-storage/JointGenotyping/inputs/scientific/bge/TwistAllianceClinicalResearchExome_Covered_Targets_hg38.interval_list`. Any variants called outside these interval will be *genotype* filtered with `OUTSIDE_OF_TARGETS`.  
 
 ### What does it return as output?
 

diff --git a/scripts/variantstore/docs/RUNNING_EXOMES_ON_GVS.md b/scripts/variantstore/docs/RUNNING_EXOMES_ON_GVS.md
@@ -31,31 +31,34 @@ This document describes the changes necessary to run exome gVCFs through the GVS
       - This will import the re-blocked gVCF files into GVS. The workflow will check whether data for that sample has already been loaded into GVS. It is designed to be re-run (with the same inputs) if there is a failure during one of the workflow tasks (e.g. BigQuery write API interrupts).
       - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
       - You will want to set the `external_sample_names`, `input_vcfs` and `input_vcf_indexes` inputs based on the GCP locations of files that contain lists of these values (in the same order). For NA12878/HG001, the files you will need for ingest are:
-          - `input_vcfs`: `"gs://broad-gotc-test-storage/germline_single_sample/exome/scientific/truth/master/D5327.NA12878/NA12878.rb.g.vcf.gz"`
-          - `input_vcf_indexes`: `"gs://broad-gotc-test-storage/germline_single_sample/exome/scientific/truth/master/D5327.NA12878/NA12878.rb.g.vcf.gz.tbi"`
+          - `input_vcfs`: `"gs://broad-gotc-test-storage/ExomeGermlineSingleSample/truth/scientific/master/D5327.NA12878/NA12878.rb.g.vcf.gz"`
+          - `input_vcf_indexes`: `"gs://broad-gotc-test-storage/ExomeGermlineSingleSample/truth/scientific/master/D5327.NA12878/NA12878.rb.g.vcf.gz.tbi"`
       - Set the `interval_list` input to `"gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list"`
 1. `GvsPopulateAltAllele` workflow
    - This step loads data into the `alt_allele` table from the `vet_*` tables in preparation for running the filtering step.
    - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
 1. `GvsCreateFilterSet` workflow
    - This step calculates features from the `alt_allele` table, and trains the VETS model along with site-level QC filters and loads them into BigQuery into a series of `filter_set_*` tables.
    - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
-   - Set the `interval_list` input to `"gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list"`
-   - Set the `use_VQSR_lite` input to `true` to use VETS (instead of VQSR)
+   - Set `GvsCreateFilterSet.interval_list` input to `"gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list"` *Make sure you're not setting `interval_list` on any of the called tasks by mistake!*
 1. `GvsPrepareRangesCallset` workflow
    - This workflow transforms the data in the vet tables into a schema optimized for VCF extraction.
    - This workflow will need to be run once to extract the callset as a whole and an additional time to create the files used to calculate Precision and Sensitivity (with `control_samples` set to `true`).
    - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
 1. `GvsExtractCallset` workflow
+   - This workflow is used to extract non-control samples only. See the `GvsCalculatePrecisionAndSensitivity` instructions below if running Precision and Sensitivity on control samples.
    - This workflow takes the tables created in the `Prepare` step to output joint VCF shards.
-   - This workflow will need to be run once to extract the callset into VCF shards and an additional time to calculate Precision and Sensitivity (with `control_samples` set to `true`).
+   - This workflow is run only to extract the callset into VCF shards (`control_samples` set to `false` explicitly or left at its default `false` value).
    - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
-   - Set the parameter `use_interval_weights` to `false`.  This avoids a bug seen in WeightedSplitIntervals when using exomes, forcing it to use the standard version of SplitIntervals.
-   - Set the `output_gcs_dir` to a GCS location to collect all the VCF shards into one place.  If you are running it twice (to calculate Precision and Sensitivity), be sure to provide distinct locations for each.
+   - Set the `output_gcs_dir` to a GCS location to collect all the VCF shards into one place.
 1. `GvsCalculatePrecisionAndSensitivity` workflow
-   - This workflow needs to be run with the control sample VCF shards from `GvsExtractCallset` step which were placed in the `output_gcs_dir` GCS location.
    - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
-   - Truth inputs for NA12878/HG001:
-     - `truth_beds`:  `"gs://gvs-internal/truth/HG001.exome_evaluation_regions.v1.1.bed"`
-     - `truth_vcfs`: `"gs://gvs-internal/truth/HG001_exome_filtered.recode.vcf.gz"`
+   - Input values with truth inputs for NA12878/HG001:
+     - `truth_beds`:  `[ "gs://gvs-internal/truth/HG001.exome_evaluation_regions.v1.1.bed" ]`
+     - `truth_vcfs`: `[ "gs://gvs-internal/truth/HG001_exome_filtered.recode.vcf.gz.tbi" ]`
      - `truth_vcf_indices`: `"gs://gvs-internal/truth/HG001_exome_filtered.recode.vcf.gz.tbi"`
+     - `vcf_eval_bed_file`: `"gs://gvs-internal/truth/HG001.exome_evaluation_regions.v1.1.bed"`
+     - `chromosomes`: `[ "chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chr21", "chr22" ]`
+     - `interval_list`: `"gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list"`
+     - `target_interval_list`: `"gs://broad-gotc-test-storage/JointGenotyping/inputs/scientific/bge/TwistAllianceClinicalResearchExome_Covered_Targets_hg38.interval_list"`
+     - `is_wgs`: `false`
diff --git a/scripts/variantstore/wdl/GvsCalculatePrecisionAndSensitivity.wdl b/scripts/variantstore/wdl/GvsCalculatePrecisionAndSensitivity.wdl
@@ -10,6 +10,7 @@ workflow GvsCalculatePrecisionAndSensitivity {
     String dataset_name
     String filter_set_name
     File interval_list = "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"
+    File? target_interval_list
     File? vcf_eval_bed_file
     Array[String] chromosomes = ["chr20"]
     String project_id
@@ -37,6 +38,7 @@ workflow GvsCalculatePrecisionAndSensitivity {
     chromosomes: "The chromosome(s) on which to analyze precision and sensitivity. The default value for this is `['chr20']`."
     dataset_name: "The GVS BigQuery dataset name."
     filter_set_name: "The filter_set_name used to generate the callset."
+    target_interval_list: "The intervals outside of which sites will be OUTSIDE_OF_TARGETS filtered."
     interval_list: "The intervals over which to extract VCFs for calculating precision and sensitivity."
     project_id: "The Google Project ID where the GVS lives."
     sample_names: "A list of the sample names that are controls and that will be used for the analysis. For every element on the list of sample names there must be a corresponding element on the list of `truth_vcfs`, `truth_vcf_indices`, and `truth_beds`."
@@ -79,6 +81,7 @@ workflow GvsCalculatePrecisionAndSensitivity {
       call_set_identifier = call_set_identifier,
       filter_set_name = filter_set_name,
       control_samples = true,
+      target_interval_list = target_interval_list,
       interval_list = interval_list,
       extract_scatter_count_override = extract_scatter_count_override,
       cloud_sdk_docker = effective_cloud_sdk_docker,

diff --git a/scripts/variantstore/wdl/GvsExtractCallset.wdl b/scripts/variantstore/wdl/GvsExtractCallset.wdl
@@ -29,6 +29,8 @@ workflow GvsExtractCallset {
     File interval_list = "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"
     File interval_weights_bed = "gs://gvs_quickstart_storage/weights/gvs_full_vet_weights_1kb_padded_orig.bed"
 
+    File? target_interval_list
+
     String? variants_docker
     String? cloud_sdk_docker
     String? gatk_docker
@@ -252,6 +254,7 @@ workflow GvsExtractCallset {
         convert_filtered_genotypes_to_nocalls = convert_filtered_genotypes_to_nocalls,
         write_cost_to_db                      = write_cost_to_db,
         maximum_alternate_alleles             = maximum_alternate_alleles,
+        target_interval_list                  = target_interval_list,
     }
   }
 
@@ -348,6 +351,8 @@ task ExtractTask {
     Int? local_sort_max_records_in_ram = 10000000
     Int? maximum_alternate_alleles
 
+    File? target_interval_list
+
     # for call-caching -- check if DB tables haven't been updated since the last run
     String max_last_modified_timestamp
   }
@@ -422,6 +427,23 @@ task ExtractTask {
         --shard-identifier ~{intervals_name} \
         ~{cost_observability_line}
 
+    if [[ -n "~{target_interval_list}" ]]
+    then
+      pre_off_target_vcf="pre_off_target_~{output_file}"
+      mv ~{output_file} ${pre_off_target_vcf}
+      mv ~{output_file}.tbi "${pre_off_target_vcf}.tbi"
+
+      gatk --java-options "-Xmx${memory_mb}m" \
+        IndexFeatureFile \
+          -I ~{target_interval_list}
+
+      gatk --java-options "-Xmx${memory_mb}m" \
+        VariantFiltration \
+          ~{"--filter-not-in-mask --mask-name OUTSIDE_OF_TARGETS --mask-description 'Outside of sequencing target intervals' --mask " + target_interval_list} \
+          -O ~{output_file} \
+          -V ${pre_off_target_vcf}
+    fi
+
     # Drop trailing slash if one exists
     OUTPUT_GCS_DIR=$(echo ~{output_gcs_dir} | sed 's/\/$//')
 

diff --git a/scripts/variantstore/wdl/GvsExtractCohortFromSampleNames.wdl b/scripts/variantstore/wdl/GvsExtractCohortFromSampleNames.wdl
@@ -12,6 +12,7 @@ workflow GvsExtractCohortFromSampleNames {
     # cohort_sample_names_array will take precedence over cohort_sample_names if both are set
     Array[String]? cohort_sample_names_array
     File? cohort_sample_names
+    Boolean is_wgs = true
 
     String gvs_project
     String gvs_dataset
@@ -42,6 +43,8 @@ workflow GvsExtractCohortFromSampleNames {
     Int? split_intervals_disk_size_override
     Int? split_intervals_mem_override
 
+    File? target_interval_list
+
     String? git_branch_or_tag
     String? gatk_docker
     File? gatk_override
@@ -79,15 +82,6 @@ workflow GvsExtractCohortFromSampleNames {
       cloud_sdk_docker = effective_cloud_sdk_docker
   }
 
-  Int effective_scatter_count = if defined(extract_scatter_count_override) then select_first([extract_scatter_count_override])
-                                else if GetNumSamplesLoaded.num_samples < 100 then 50 # Quickstart
-                                     else if GetNumSamplesLoaded.num_samples < 1000 then 250
-                                          else if GetNumSamplesLoaded.num_samples < 5000 then 500
-                                               else if GetNumSamplesLoaded.num_samples < 20000 then 1000 # Stroke Anderson
-                                                    else if GetNumSamplesLoaded.num_samples < 50000 then 5000
-                                                         else if GetNumSamplesLoaded.num_samples < 100000 then 10000 # Charlie
-                                                              else 20000
-
   # writing the array to a file has to be done in a task
   # https://support.terra.bio/hc/en-us/community/posts/360071465631-write-lines-write-map-write-tsv-write-json-fail-when-run-in-a-workflow-rather-than-in-a-task
   if (defined(cohort_sample_names_array)) {
@@ -99,9 +93,22 @@ workflow GvsExtractCohortFromSampleNames {
   }
 
   File cohort_sample_names_file = select_first([write_array_task.output_file, cohort_sample_names])
+  Int effective_sample_count = length(read_lines(cohort_sample_names_file))
+
+  Int effective_scatter_count = if defined(extract_scatter_count_override) then select_first([extract_scatter_count_override])
+                                else if is_wgs then
+                                     if effective_sample_count < 5000 then 1 # This results in 1 VCF per chromosome.
+                                     else if effective_sample_count < 20000 then 2000 # Stroke Anderson
+                                          else if effective_sample_count < 50000 then 10000
+                                               else 20000
+                                     else
+                                     if effective_sample_count < 5000 then 1 # This results in 1 VCF per chromosome.
+                                     else if effective_sample_count < 20000 then 1000
+                                          else if effective_sample_count < 50000 then 2500
+                                               else 7500
 
   # allow an interval list to be passed in, but default it to our standard one if no args are here
-  File working_interval_list = select_first([interval_list, "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"])
+  File effective_interval_list = select_first([interval_list, "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"])
 
 
   call GvsPrepareCallset.GvsPrepareCallset {
@@ -119,7 +126,7 @@ workflow GvsExtractCohortFromSampleNames {
       write_cost_to_db = write_cost_to_db,
       cloud_sdk_docker = effective_cloud_sdk_docker,
       enable_extract_table_ttl = true,
-      interval_list = working_interval_list,
+      interval_list = effective_interval_list,
       control_samples = control_samples,
       cloud_sdk_docker = effective_cloud_sdk_docker,
       git_hash = effective_git_hash,
@@ -150,7 +157,7 @@ workflow GvsExtractCohortFromSampleNames {
       split_intervals_mem_override = split_intervals_mem_override,
       extract_memory_override_gib = extract_memory_override,
       disk_override = extract_disk_override,
-      interval_list = working_interval_list,
+      interval_list = effective_interval_list,
       control_samples = control_samples,
 
       cloud_sdk_docker = effective_cloud_sdk_docker,
@@ -159,7 +166,8 @@ workflow GvsExtractCohortFromSampleNames {
       gatk_docker = effective_gatk_docker,
       git_hash = effective_git_hash,
       variants_docker = effective_variants_docker,
-      write_cost_to_db = write_cost_to_db
+      write_cost_to_db = write_cost_to_db,
+      target_interval_list = target_interval_list,
   }
 
   output {

diff --git a/scripts/variantstore/wdl/GvsJointVariantCalling.wdl b/scripts/variantstore/wdl/GvsJointVariantCalling.wdl
@@ -58,6 +58,8 @@ workflow GvsJointVariantCalling {
         String? extract_table_prefix
         String? filter_set_name
 
+        File? target_interval_list
+
         # Overrides to be passed to GvsCreateFilterSet
         Int? INDEL_VQSR_CLASSIC_max_gaussians_override = 4
         Int? INDEL_VQSR_CLASSIC_mem_gb_override
@@ -70,15 +72,15 @@ workflow GvsJointVariantCalling {
         Int? maximum_alternate_alleles
     }
 
-    # If is_wgs is true, we'll use the WGS interval list else, we'll use the Exome interval list.  We'll currently use
-    # the same weighted bed file for whole genomes and exomes.
-    # But, if interval_list is defined, we'll use that instead of choosing based on is_wgs
+    # If `is_wgs` is true we'll use the WGS interval list else, otherwise we'll use the Exome interval list.
+    # We currently use the same weighted bed file for whole genomes and exomes. However if `interval_list` is defined,
+    # we'll use that instead of choosing based on `is_wgs`.
     File default_interval_list = if (is_wgs) then "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"
                                              else "gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list"
     File interval_list_to_use = select_first([interval_list, default_interval_list])
 
-    # the call_set_identifier string is used to name many different things throughout this workflow (BQ tables, vcfs etc),
-    # and so make sure nothing is broken by creative users, we replace spaces and underscores with hyphens
+    # The `call_set_identifier` string is used to name many different things throughout this workflow (BQ tables, vcfs etc).
+    # To make sure nothing is broken by creative users we replace spaces and underscores with hyphens.
     String effective_extract_output_file_base_name = select_first([extract_output_file_base_name, sub(call_set_identifier, "\\s+|\_+", "-")])
     String effective_extract_table_prefix = select_first([extract_table_prefix, sub(call_set_identifier, "\\s+|\_+", "-")])
     String effective_filter_set_name = select_first([filter_set_name, sub(call_set_identifier, "\\s+|\_+", "-")])
@@ -228,6 +230,7 @@ workflow GvsJointVariantCalling {
             bgzip_output_vcfs = bgzip_output_vcfs,
             is_wgs = is_wgs,
             maximum_alternate_alleles = maximum_alternate_alleles,
+            target_interval_list = target_interval_list,
     }
 
     output {