Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track BigQuery costs of GVS python VS-480 #7915

Merged
merged 25 commits into from
Jun 29, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
368c5c5
first stab at prepare ranges callset
rsasch Jun 15, 2022
b31a253
Merge branch 'ah_var_store' into rsa_metadata_from_python
rsasch Jun 15, 2022
e080975
clean up changes to create_ranges_cohort_extract_data_table
rsasch Jun 16, 2022
5003cef
add cost tracking to populate_alt_allele_table and GvsQuickstartInteg…
rsasch Jun 16, 2022
af32179
all together now
rsasch Jun 16, 2022
d04740c
once more, with feeling
rsasch Jun 16, 2022
56c7bfe
dependency
rsasch Jun 16, 2022
8fbbb81
Merge branch 'ah_var_store' into rsa_metadata_from_python
rsasch Jun 17, 2022
77ce3d9
use query_labels_map instead of query_labels
rsasch Jun 21, 2022
9448b54
updated image
rsasch Jun 22, 2022
fb5c851
Merge branch 'ah_var_store' into rsa_metadata_from_python
rsasch Jun 22, 2022
44d19da
use the right file name for diff
rsasch Jun 23, 2022
ac5f31f
cleanup
rsasch Jun 23, 2022
b32b6a0
update all docker images
rsasch Jun 23, 2022
7025cd0
more cleanup
rsasch Jun 23, 2022
2c1bbe8
oops
mcovarr Jun 23, 2022
617732c
PR feedback
rsasch Jun 23, 2022
4b92cdc
PR feedback
rsasch Jun 27, 2022
de02b59
add flag to diff
rsasch Jun 28, 2022
651384a
small updates, added output for cost check
rsasch Jun 28, 2022
b7f7b02
Merge branch 'ah_var_store' into rsa_metadata_from_python
rsasch Jun 29, 2022
7700f04
added call_set_identifier input to quickstart directions
rsasch Jun 29, 2022
a7fafed
thread cohort_table_prefix through other WDLs
rsasch Jun 29, 2022
c8b1fbe
Merge branch 'ah_var_store' into rsa_metadata_from_python
rsasch Jun 29, 2022
7364e9e
words
rsasch Jun 29, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ workflows:
branches:
- master
- ah_var_store
- vs_441_cost_observability
- rsa_metadata_from_python
- name: GvsAoUReblockGvcf
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsAoUReblockGvcf.wdl
Expand Down Expand Up @@ -98,7 +98,7 @@ workflows:
branches:
- master
- ah_var_store
- vs_447_fixup_non_fq_invocations
- rsa_metadata_from_python
- name: GvsCreateTables
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsCreateTables.wdl
Expand Down Expand Up @@ -132,7 +132,7 @@ workflows:
branches:
- master
- ah_var_store
- rsa_query_labels
- rsa_metadata_from_python
- name: GvsCreateVAT
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsCreateVAT.wdl
Expand Down Expand Up @@ -188,6 +188,7 @@ workflows:
branches:
- master
- ah_var_store
- rsa_metadata_from_python
- name: GvsJointVariantCalling
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsJointVariantCalling.wdl
Expand All @@ -210,7 +211,7 @@ workflows:
branches:
- master
- ah_var_store
- rsa_reblock_quickstart_v2
- rsa_metadata_from_python
- name: GvsIngestTieout
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsIngestTieout.wdl
Expand Down
1 change: 1 addition & 0 deletions scripts/variantstore/TERRA_QUICKSTART.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ This is done by running the `GvsPrepareRangesCallset` workflow with the followin

| Parameter | Description |
|--------------------- | ----------- |
| call_set_identifier | a unique, descriptive name for the callset, this should be the same as the `call_set_identifier` from step 2 |
| dataset_name | the name of the dataset you created above |
| extract_table_prefix | A unique, descriptive name for the tables containing the callset (for simplicity, you can use the same name you used for `filter_set_name` in step 3); you will want to make note of this for use in the next step |
| project_id | the name of the google project containing the dataset |
Expand Down
18 changes: 9 additions & 9 deletions scripts/variantstore/wdl/GvsAssignIds.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -182,14 +182,14 @@ task CreateCostObservabilityTable {
}

String cost_observability_json =
'[ { "name": "call_set_identifier", "type": "STRING", "mode": "REQUIRED" }, ' + # The name by which we refer to the callset
' { "name": "step", "type": "STRING", "mode": "REQUIRED" }, ' + # The name of the core GVS workflow to which this belongs
' { "name": "call", "type": "STRING", "mode": "NULLABLE" }, ' + # The WDL call to which this belongs
' { "name": "shard_identifier", "type": "STRING", "mode": "NULLABLE" }, ' + # A unique identifier for this shard, may or may not be its index
' { "name": "call_start_timestamp", "type": "TIMESTAMP", "mode": "REQUIRED" }, ' + # When the call logging this event was started
' { "name": "event_timestamp", "type": "TIMESTAMP", "mode": "REQUIRED" }, ' + # When the observability event was logged
' { "name": "event_key", "type": "STRING", "mode": "REQUIRED" }, ' + # The type of observability event being logged
' { "name": "event_bytes", "type": "INTEGER", "mode": "REQUIRED" } ] ' # Number of bytes reported for this observability event
'[ { "name": "call_set_identifier", "type": "STRING", "mode": "REQUIRED", "description": "The name by which we refer to the callset" }, ' +
rsasch marked this conversation as resolved.
Show resolved Hide resolved
' { "name": "step", "type": "STRING", "mode": "REQUIRED", "description": "The name of the core GVS workflow to which this belongs" }, ' +
' { "name": "call", "type": "STRING", "mode": "NULLABLE", "description": "The WDL call to which this belongs" }, ' +
' { "name": "shard_identifier", "type": "STRING", "mode": "NULLABLE", "description": "A unique identifier for this shard, may or may not be its index" }, ' +
' { "name": "call_start_timestamp", "type": "TIMESTAMP", "mode": "REQUIRED", "description": "When the call logging this event was started" }, ' +
' { "name": "event_timestamp", "type": "TIMESTAMP", "mode": "REQUIRED", "description": "When the observability event was logged" }, ' +
' { "name": "event_key", "type": "STRING", "mode": "REQUIRED", "description": "The type of observability event being logged" }, ' +
' { "name": "event_bytes", "type": "INTEGER", "mode": "REQUIRED", "description": "Number of bytes reported for this observability event" } ] '
rsasch marked this conversation as resolved.
Show resolved Hide resolved

meta {
# not volatile: true, always run this when asked
Expand All @@ -214,7 +214,7 @@ task CreateCostObservabilityTable {
fi
>>>
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:rsa_query_labels_20220621"
docker: "us.gcr.io/broad-dsde-methods/variantstore:rsa_metadata_from_python_20220628"
}
output {
Boolean done = true
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -264,7 +264,7 @@ task Add_AS_MAX_VQSLOD_ToVcf {
File input_vcf
String output_basename

String docker = "us.gcr.io/broad-dsde-methods/variantstore:rsa_query_labels_20220621"
String docker = "us.gcr.io/broad-dsde-methods/variantstore:rsa_metadata_from_python_20220628"
Int cpu = 1
Int memory_mb = 3500
Int disk_size_gb = ceil(2*size(input_vcf, "GiB")) + 50
Expand Down
2 changes: 1 addition & 1 deletion scripts/variantstore/wdl/GvsCallsetCost.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ task WorkflowComputeCosts {
>>>

runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:vs_472_workflow_compute_costs-2022-06-21"
docker: "us.gcr.io/broad-dsde-methods/variantstore:rsa_metadata_from_python_20220628"
}

output {
Expand Down
6 changes: 5 additions & 1 deletion scripts/variantstore/wdl/GvsCreateAltAllele.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ workflow GvsCreateAltAllele {
Boolean go = true
String dataset_name
String project_id
String call_set_identifier

String? service_account_json_path
}
Expand Down Expand Up @@ -38,6 +39,7 @@ workflow GvsCreateAltAllele {
scatter (idx in range(length(GetVetTableNames.vet_tables))) {
call PopulateAltAlleleTable {
input:
call_set_identifier = call_set_identifier,
dataset_name = dataset_name,
project_id = project_id,
create_table_done = CreateAltAlleleTable.done,
Expand Down Expand Up @@ -173,6 +175,7 @@ task PopulateAltAlleleTable {

String create_table_done
String vet_table_name
String call_set_identifier

String? service_account_json_path

Expand All @@ -194,13 +197,14 @@ task PopulateAltAlleleTable {
fi

python3 /app/populate_alt_allele_table.py \
--call_set_identifier ~{call_set_identifier} \
--query_project ~{project_id} \
--vet_table_name ~{vet_table_name} \
--fq_dataset ~{project_id}.~{dataset_name} \
$SERVICE_ACCOUNT_STANZA
>>>
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:rsa_query_labels_20220621"
docker: "us.gcr.io/broad-dsde-methods/variantstore:rsa_metadata_from_python_20220628"
memory: "3 GB"
disks: "local-disk 10 HDD"
cpu: 1
Expand Down
2 changes: 1 addition & 1 deletion scripts/variantstore/wdl/GvsCreateVAT.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ task MakeSubpopulationFiles {
# ------------------------------------------------
# Runtime settings:
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:rsa_query_labels_20220621"
docker: "us.gcr.io/broad-dsde-methods/variantstore:rsa_metadata_from_python_20220628"
memory: "1 GB"
preemptible: 3
cpu: "1"
Expand Down
4 changes: 2 additions & 2 deletions scripts/variantstore/wdl/GvsCreateVATAnnotations.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ task ExtractAnAcAfFromVCF {
# ------------------------------------------------
# Runtime settings:
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:rsa_query_labels_20220621"
docker: "us.gcr.io/broad-dsde-methods/variantstore:rsa_metadata_from_python_20220628"
maxRetries: 3
memory: "16 GB"
preemptible: 3
Expand Down Expand Up @@ -317,7 +317,7 @@ task PrepAnnotationJson {
# ------------------------------------------------
# Runtime settings:
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:rsa_query_labels_20220621"
docker: "us.gcr.io/broad-dsde-methods/variantstore:rsa_metadata_from_python_20220628"
memory: "8 GB"
preemptible: 5
cpu: "1"
Expand Down
4 changes: 2 additions & 2 deletions scripts/variantstore/wdl/GvsCreateVATFromAnnotations.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ task GetAnnotations {
# ------------------------------------------------
# Runtime settings:
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:rsa_query_labels_20220621"
docker: "us.gcr.io/broad-dsde-methods/variantstore:rsa_metadata_from_python_20220628"
memory: "1 GB"
preemptible: 3
cpu: "1"
Expand Down Expand Up @@ -151,7 +151,7 @@ task PrepAnnotationJson {
# ------------------------------------------------
# Runtime settings:
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:rsa_query_labels_20220621"
docker: "us.gcr.io/broad-dsde-methods/variantstore:rsa_metadata_from_python_20220628"
memory: "8 GB"
preemptible: 5
cpu: "1"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ workflow GvsExtractCohortFromSampleNames {

call GvsPrepareCallset.GvsPrepareCallset {
input:
call_set_identifier = cohort_table_prefix,
extract_table_prefix = cohort_table_prefix,
sample_names_to_extract = cohort_sample_names,
project_id = gvs_project,
Expand Down
2 changes: 1 addition & 1 deletion scripts/variantstore/wdl/GvsImportGenomes.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -399,7 +399,7 @@ task CurateInputLists {
--output_files True
>>>
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:rsa_query_labels_20220621"
docker: "us.gcr.io/broad-dsde-methods/variantstore:rsa_metadata_from_python_20220628"
memory: "3 GB"
disks: "local-disk 100 HDD"
bootDiskSizeGb: 15
Expand Down
1 change: 1 addition & 0 deletions scripts/variantstore/wdl/GvsJointVariantCalling.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ workflow GvsJointVariantCalling {

call GvsUnified.GvsUnified {
input:
call_set_identifier = call_set_identifier,
dataset_name = dataset_name,
project_id = project_id,
external_sample_names = external_sample_names,
Expand Down
16 changes: 10 additions & 6 deletions scripts/variantstore/wdl/GvsPrepareRangesCallset.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,9 @@ workflow GvsPrepareCallset {
# true for control samples only, false for participant samples only
Boolean control_samples = false

String extract_table_prefix
String call_set_identifier

String extract_table_prefix = call_set_identifier
String query_project = project_id
String destination_project = project_id
String destination_dataset = dataset_name
Expand All @@ -22,17 +23,18 @@ workflow GvsPrepareCallset {
}

String full_extract_prefix = if (control_samples) then "~{extract_table_prefix}_controls" else extract_table_prefix
String fq_petvet_dataset = "~{project_id}.~{dataset_name}"
String fq_refvet_dataset = "~{project_id}.~{dataset_name}"
rsasch marked this conversation as resolved.
Show resolved Hide resolved
String fq_sample_mapping_table = "~{project_id}.~{dataset_name}.sample_info"
String fq_destination_dataset = "~{destination_project}.~{destination_dataset}"

call PrepareRangesCallsetTask {
input:
call_set_identifier = call_set_identifier,
destination_cohort_table_prefix = full_extract_prefix,
sample_names_to_extract = sample_names_to_extract,
query_project = query_project,
query_labels = query_labels,
fq_petvet_dataset = fq_petvet_dataset,
fq_refvet_dataset = fq_refvet_dataset,
fq_sample_mapping_table = fq_sample_mapping_table,
fq_temp_table_dataset = fq_temp_table_dataset,
fq_destination_dataset = fq_destination_dataset,
Expand All @@ -49,12 +51,13 @@ workflow GvsPrepareCallset {

task PrepareRangesCallsetTask {
input {
String call_set_identifier
String destination_cohort_table_prefix
File? sample_names_to_extract
String query_project

Boolean control_samples
String fq_petvet_dataset
String fq_refvet_dataset
String fq_sample_mapping_table
String fq_temp_table_dataset
String fq_destination_dataset
Expand Down Expand Up @@ -95,8 +98,9 @@ task PrepareRangesCallsetTask {
fi

python3 /app/create_ranges_cohort_extract_data_table.py \
--call_set_identifier ~{call_set_identifier} \
--control_samples ~{control_samples} \
--fq_ranges_dataset ~{fq_petvet_dataset} \
--fq_ranges_dataset ~{fq_refvet_dataset} \
--fq_temp_table_dataset ~{fq_temp_table_dataset} \
--fq_destination_dataset ~{fq_destination_dataset} \
--destination_cohort_table_prefix ~{destination_cohort_table_prefix} \
Expand All @@ -112,7 +116,7 @@ task PrepareRangesCallsetTask {
}

runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:rsa_query_labels_20220621"
docker: "us.gcr.io/broad-dsde-methods/variantstore:rsa_metadata_from_python_20220628"
memory: "3 GB"
disks: "local-disk 100 HDD"
bootDiskSizeGb: 15
Expand Down
55 changes: 54 additions & 1 deletion scripts/variantstore/wdl/GvsQuickstartIntegration.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ workflow GvsQuickstartIntegration {

Int load_data_batch_size = 1
}
String project_id = "gvs-internal"

call Utils.BuildGATKJarAndCreateDataset {
input:
Expand All @@ -61,8 +62,9 @@ workflow GvsQuickstartIntegration {

call Unified.GvsUnified {
input:
call_set_identifier = branch_name,
dataset_name = BuildGATKJarAndCreateDataset.dataset_name,
project_id = "gvs-internal",
project_id = project_id,
external_sample_names = external_sample_names,
gatk_override = BuildGATKJarAndCreateDataset.jar,
input_vcfs = input_vcfs,
Expand All @@ -82,6 +84,14 @@ workflow GvsQuickstartIntegration {
actual_vcfs = GvsUnified.output_vcfs
}

call AssertCostIsTrackedAndExpected {
input:
go = GvsUnified.done,
dataset_name = BuildGATKJarAndCreateDataset.dataset_name,
project_id = project_id,
expected_output_csv = expected_output_prefix + "cost_observability_expected.csv"
}

output {
Array[File] output_vcfs = GvsUnified.output_vcfs
Array[File] output_vcf_indexes = GvsUnified.output_vcf_indexes
Expand Down Expand Up @@ -182,3 +192,46 @@ task AssertIdenticalOutputs {
Boolean done = true
}
}

task AssertCostIsTrackedAndExpected {
meta {
# we want to check the databbase each time this runs
volatile: true
}

input {
Boolean go = true
String dataset_name
String project_id
File expected_output_csv
}

command <<<
set -o errexit
set -o nounset
set -o pipefail
set -o xtrace

echo "project_id = ~{project_id}" > ~/.bigqueryrc
bq query --location=US --project_id=~{project_id} --format=csv --use_legacy_sql=false "SELECT step, call, event_key, event_bytes FROM ~{dataset_name}.cost_observability" > cost_observability_output.csv
set +o errexit
diff -w cost_observability_output.csv ~{expected_output_csv} > differences.txt
set -o errexit

if [[ -s differences.txt ]]; then
echo "Differences found:"
cat differences.txt
exit 1
fi
>>>

runtime {
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:latest"
disks: "local-disk 10 HDD"
}

output {
File cost_observability_output_csv = "cost_observability_output.csv"
File differences = "differences.txt"
}
}
Loading