Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate various docs for AoU callset generation into one to rule them all [VS-553] #7971

Merged
merged 12 commits into from
Aug 3, 2022
4 changes: 0 additions & 4 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,6 @@ workflows:
branches:
- master
- ah_var_store
- rsa_metadata_from_python
- name: GvsAoUReblockGvcf
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsAoUReblockGvcf.wdl
Expand Down Expand Up @@ -98,7 +97,6 @@ workflows:
branches:
- master
- ah_var_store
- rsa_metadata_from_python
- name: GvsCreateTables
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsCreateTables.wdl
Expand Down Expand Up @@ -132,7 +130,6 @@ workflows:
branches:
- master
- ah_var_store
- rsa_metadata_from_python
- name: GvsCreateVAT
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsCreateVAT.wdl
Expand Down Expand Up @@ -191,7 +188,6 @@ workflows:
branches:
- master
- ah_var_store
- rsa_metadata_from_python
- gg_VS-492_BetaUserJarRelease
- name: GvsJointVariantCalling
subclass: WDL
Expand Down
107 changes: 77 additions & 30 deletions scripts/variantstore/AOU_DELIVERABLES.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,87 @@
# Running the Genome Variant Store (GVS) Pipeline for AoU

## Prerequisites
- If this is the first time running the GVS pipeline in a particular Google billing project, use your GCP account team to create a support ticket for the BigQuery team that includes "enabling cluster metadata pruning support for the BQ Read API." This enables a pre-GA feature that dramatically reduces the amount of data scanned reducing both cost and runtime.
- This list assumes you are starting off with a Terra workspace that contains the re-blocked (version 2) gVCFs for your callset as "sample sets" in the "Data" tab. You will also want to increase the Google quotas for the workspace project to these levels (all in region `us-central1`):
1. Persistent Disk Standard (GB): 1,000,000 GB (1 PB)
2. CPUs: 64,000
3. In-use IP addresses: 5,000 (this is the most challenging one and will probably require contacting your GCP account team to facilitate)
4. VM instances: 64,000
- Make a note of the Google project id (`aou-genomics-curation-prod`), dataset name (`aou_wgs`), and up-to-date service account JSON path, as these will be inputs to all the GVS workflows. The [naming conventions for other aspects of GVS datasets are outlined here](https://docs.google.com/document/d/1pNtuv7uDoiOFPbwe4zx5sAGH7MyxwKqXkyrpNmBxeow).

## Workflows
1. `GvsAssignIds` at the `sample set` level ("Step 1" in workflow submission) with a sample set of all the new samples to be included in the callset (if new controls are being added, they need to be done in a separate run, with the `samples_are_controls` input set to "true")
2. `GvsImportGenomes` at the `sample set` level ("Step 1" in workflow submission) with sample sets of the new samples (currently we divide them up into sets of max 4000 each, but [we want to raise or remove this limit](https://broadworkbench.atlassian.net/browse/VS-344)).
3. `GvsWithdrawSamples` if there are any samples to withdraw from the last callset.
4. GvsCreateAltAllele
5. GvsCreateFilterSet (see [naming conventions doc](https://docs.google.com/document/d/1pNtuv7uDoiOFPbwe4zx5sAGH7MyxwKqXkyrpNmBxeow) for guidance on what to name the filter set, which you will need to keep track of for the `GvsExtractCallset` WDL).
6. GvsPrepareRangesCallset needs to be run twice, once with `control_samples` set to "true" (see [naming conventions doc](https://docs.google.com/document/d/1pNtuv7uDoiOFPbwe4zx5sAGH7MyxwKqXkyrpNmBxeow) for guidance on what to use for `extract_table_prefix` or cohort prefix, which you will need to keep track of for the `GvsExtractCallset` WDL).
7. GvsExtractCallset needs to be run twice, once with `control_samples` set to "true", and with the `filter_set_name` and `extract_table_prefix` from step 5 & 6. Include a valid (and secure) "output_gcs_dir" parameter, which is where the VCF and interval list files will go.
8. Run [process to create callset statistics](callset_QC/README.md), for which you need
1. permission to query table `spec-ops-aou:gvs_public_reference_data.gnomad_v3_sites`
2. the data_project you used for all the GVS WDLs
3. the default_dataset you used for all the GVS WDLs
4. the `extract_table_prefix` input from `GvsExtractCallset` step
5. the `filter_set_name` input from `GvsCreateFilterSet` step

## Deliverables (via email once the above steps are complete)
1. location of the VCFs and interval_list files (`output_gcs_dir` input from GvsExtractCallset)
2. fully qualified name of the BigQuery dataset (input `dataset_name` in the workflows)
3. [callset statistics](callset_QC/README.md) CSV
4. [precision and sensitivity results](tieout/AoU_PRECISION_SENSITIVITY.md) TSV
## Setup
- Create a Terra workspace
- using your @pmi-ops.org account
- with the `allofus-drc-wgs-dev` Terra Billing account
- in the `AOU_DRC_WGS_OPERATORS` Authorization Domain
- Share it with `[email protected]` so the rest of the team can see it (Reader) and with the individual users who will be running the workflows (Owner).
- Populate the workspace with the following:
- [Fetch WGS metadata for samples from list](http://app.terra.bio/#workspaces/allofus-drc-wgs-dev/GVS%20AoU%20WGS%20Charlie/notebooks/launch/Fetch%20WGS%20metadata%20for%20samples%20from%20list.ipynb) notebook
- [GvsAssignIds](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsAssignIds) workflow
- [GvsImportGenomes](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsImportGenomes) workflow
- [GvsWithdrawSamples](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsWithdrawSamples) workflow
- [GvsCreateAltAllele](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsCreateAltAllele) workflow
- [GvsCreateFilterSet](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsCreateFilterSet) workflow
- [GvsPrepareRangesCallset](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsPrepareRangesCallset) workflow (VCF output)
- [GvsExtractCallset](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsExtractCallset) workflow (VCF output)
- [GvsCalculatePrecisionAndSensitivity](https://dockstore.org/workflows/github.com/broadinstitute/gatk/GvsCalculatePrecisionAndSensitivity) workflow
rsasch marked this conversation as resolved.
Show resolved Hide resolved
- [GvsCallsetCost](https://dockstore.org/workflows/github.com/broadinstitute/gatk/GvsCallsetCost) workflow
- **TBD VDS Prepare WDL/notebook/??**
- **TBD VDS Extract WDL/notebook/??**
- Run the "Fetch WGS metadata for samples from list" notebook after you have placed the file with the list of the new samples to ingest in a GCS location the notebook (running with your @pmi-ops account) will have access to. This will grab the samples from the workspace where they were reblocked and bring them into this callset workspace.
- Set the `sample_list_file_path` variable in that notebook to the path of the file
- Run the "now that the data have been copied, you can make sample sets if you wish" step if you want to automatically break up the new samples into smaller sample sets. Set the `SUBSET_SIZE` and `set_name` variables to customize.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what circumstances would we want to do this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't want to throw all 100K (or whatever number) samples at GvsImportGenomes at once.

- **TBD This is based on VCF output; VDS output might need different increases.** For extracting VCFs as the final output for the callset, you will want to increase the Google quotas for the workspace project (you can find this in the workspace dashboard under Cloud Information > Google Project ID) to these levels (all in the workspace region):
- Persistent Disk Standard (GB): 1,000,000 GB (1 PB)
- CPUs: 64,000
- In-use IP addresses: 5,000 (this is the most challenging one and will probably require contacting the GCP account team to facilitate)
- VM instances: 64,000
- Make a note of the Google project ID ("aou-genomics-curation-prod"), dataset name ("aou_wgs") and callset identifier (e.g. "Bravo") as these will be inputs (`project_id`, `dataset_name` and `call_set_identifier`) to all or most of the GVS workflows. The [naming conventions for other aspects of GVS datasets are outlined here](https://docs.google.com/document/d/1pNtuv7uDoiOFPbwe4zx5sAGH7MyxwKqXkyrpNmBxeow).

## The Pipeline
1. `GvsAssignIds` workflow
- To optimize the GVS internal queries, each sample must have a unique and consecutive integer ID assigned. Running the `GvsAssignIds` will create a unique GVS ID for each sample (`sample_id`) and update the BQ `sample_info` table (creating it if it doesn't exist). This workflow takes care of creating the BQ `vet_*`, `ref_ranges_*` and `cost_observability` tables needed for the sample IDs generated.
- Run at the `sample set` level ("Step 1" in workflow submission) with a sample set of all the new samples to be included in the callset (created by the "Fetch WGS metadata for samples from list" notebook mentioned above).
- You will want to set the `external_sample_names` input based on the column in the workspace Data table, e.g. "this.samples.research_id".
- If new controls are being added, they need to be done in a separate run, with the `samples_are_controls` input set to "true" (the referenced Data columns may also be different, e.g. "this.control_samples.control_sample_id" instead of "this.samples.research_id").
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly confused as what this means - the external_sample_name may be pulled from the control_sample_id?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The external_sample_name workflow input is an array of strings, usually grabbed from the Terra Data store in the workspace. The columns in that store can be called anything, and in the past, control samples and participant samples were in different tables with different column names.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we know if new controls are being added?

Copy link
Author

@rsasch rsasch Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past, Lee/AoU has let us know that we should add additional (or different) controls to a callset.

2. `GvsImportGenomes` workflow
- This will import the re-blocked gVCF files into GVS. The workflow will check whether data for that sample has already been loaded into GVS. It is designed to be re-run (with the same inputs) if there is a failure during one of the workflow tasks (e.g. BigQuery write API interrupts).
- Run at the `sample set` level ("Step 1" in workflow submission). You can either run this on a sample_set of all the samples and rely on the workflow logic to break it up into batches (or manually set the `load_data_batch_size` input) or run it on smaller sample_sets created by the "Fetch WGS metadata for samples from list" notebook mentioned above.
- You will want to set the `external_sample_names`, `input_vcfs` and `input_vcf_indexes` inputs based on the columns in the workspace Data table, e.g. "this.samples.research_id", "this.samples.reblocked_gvcf_v2" and "this.samples.reblocked_gvcf_index_v2".
3. `GvsWithdrawSamples` workflow
- Run if there are any samples to withdraw from the last callset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know if there are samples to withdraw?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We compare the sample list for the callset we are creating (which we get from Lee/Aou) with the samples already in the database.

4. **TBD Workflow to soft delete samples**
5. `GvsCreateAltAllele` workflow
- **TODO:** needs to be made cumulative so that it can add data to the existing table instead of creating it from scratch on each run (see [VS-52](https://broadworkbench.atlassian.net/browse/VS-52))
- This step loads data into the `alt_allele` table from the `vet_*` tables in preparation for running the filtering step.
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
6. `GvsCreateFilterSet` workflow
- This step calculates features from the `alt_allele` table, and trains the VQSR filtering model along with site-level QC filters and loads them into BigQuery into a series of `filter_set_*` tables.
- See [naming conventions doc](https://docs.google.com/document/d/1pNtuv7uDoiOFPbwe4zx5sAGH7MyxwKqXkyrpNmBxeow) for guidance on what to use for `filter_set_name`, which you will need to keep track of for the `GvsExtractCallset` WDL. If, for some reason, this step needs to be run multiple times, be sure to use a different `filter_set_name` (the doc has guidance for this, as well).
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
7. `GvsPrepareRangesCallset` workflow
- This workflow transforms the data in the vet and ref_ranges tables into a schema optimized for VCF generation during the Extract step.
- It will need to be run twice, once with `control_samples` set to "true" (see [naming conventions doc](https://docs.google.com/document/d/1pNtuv7uDoiOFPbwe4zx5sAGH7MyxwKqXkyrpNmBxeow) for guidance on what to use for `extract_table_prefix` or cohort prefix, which you will need to keep track of for the `GvsExtractCallset` WDL); the default value is `false`.
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
8. `GvsExtractCallset` workflow
- This workflow extracts the data in BigQuery and transforms it into a sharded joint called VCF incorporating the VQSR filter set data.
- It also needs to be run twice, once with `control_samples` set to "true", and with the `filter_set_name` and `extract_table_prefix` from step 5 & 6. Include a valid (and secure) "output_gcs_dir" parameter, which is where the VCF, interval list, manifest, and sample name list files will go.
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
9. **TBD VDS Prepare WDL/notebook/??**
10. **TBD VDS Extract WDL/notebook/??**
11. Run the notebook to create callset stats (**TODO**, see [VS-388](https://broadworkbench.atlassian.net/browse/VS-388)), for which you need
- "BigQuery Data Viewer" role for your @pmi-ops proxy group on the `spec-ops-aou:gvs_public_reference_data.gnomad_v3_sites` table
- the Google project ID you used for all the GVS WDLs (`project_id` input)
- the name of the BigQuery dataset you used for all the GVS WDLs (`dataset_name` input)
- the `extract_table_prefix` input from `GvsExtractCallset` step
- the `filter_set_name` input from `GvsCreateFilterSet` step
12. [GvsCalculatePrecisionAndSensitivity](tieout/AoU_PRECISION_SENSITIVITY.md#generating-callset-precision-and-sensitivity-values) workflow
13. `GvsCallsetCost`
- The cost from this callset, which represents the total BigQuery cost (which is not represented in the Terra UI total workflow cost) from the GVS pipeline workflows, is used to calculate the cost of the callset as a whole and by sample.


## Deliverables (via email to stakeholders once the above steps are complete)
1. GCS locations of the VCFs, indexes and interval_list files (subpaths of the `output_gcs_dir` input from `GvsExtractCallset`)
2. fully qualified name of the BigQuery dataset (composed of the `project_id` and `dataset_name` inputs from the workflows)
3. callset statistics CSV file (see step #11 above)
4. TSV output from `GvsCalculatePrecisionAndSensitivity` workflow (see step #12 above)

## Running the VAT pipeline
To create a BigQuery table of variant annotations, you may follow the instructions here:
[process to create variant annotations table](variant_annotations_table/README.md)

The pipeline takes in a jointVCF and outputs a variant annotations table in BigQuery.
The pipeline takes in jointVCF shards and outputs a variant annotations table in BigQuery.



Loading