Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework Hail script generation [VS-616] #8034

Merged
merged 45 commits into from
Sep 30, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
bfff9f4
Hail script generation updates
RoriCremer Sep 1, 2022
6714e51
more
mcovarr Sep 20, 2022
3b4c0f0
fix
mcovarr Sep 20, 2022
56754c7
some cleanup, still much broken
mcovarr Sep 20, 2022
c14fd9f
wip
mcovarr Sep 20, 2022
558c52e
more
mcovarr Sep 20, 2022
4d557cd
wip
mcovarr Sep 21, 2022
305d822
more
mcovarr Sep 21, 2022
46e393d
fixes?
mcovarr Sep 22, 2022
758fb4c
fixes
mcovarr Sep 23, 2022
3cc4bb4
doh
mcovarr Sep 23, 2022
3e05c30
fixes, cleanup
mcovarr Sep 23, 2022
65d853b
cleanup
mcovarr Sep 23, 2022
8909149
fixes
mcovarr Sep 23, 2022
8dd0b44
oops
mcovarr Sep 23, 2022
a5f7809
cleanup
mcovarr Sep 23, 2022
2c26d63
cleanup
mcovarr Sep 23, 2022
7eae490
minor cleanup
mcovarr Sep 23, 2022
937f0bd
doh
mcovarr Sep 23, 2022
b327a28
gahhh
mcovarr Sep 23, 2022
96dff80
shrink Docker image a lot
mcovarr Sep 26, 2022
b4e6485
new Docker image
mcovarr Sep 26, 2022
71a7642
add integration
mcovarr Sep 27, 2022
33407f6
use non-custom image if possible, larger disk
mcovarr Sep 27, 2022
9829b51
cleanup
mcovarr Sep 27, 2022
736be17
cleaner cleanup
mcovarr Sep 27, 2022
da9517c
whoops
mcovarr Sep 27, 2022
e9a2da9
try alpine
mcovarr Sep 27, 2022
bb8759c
alpine friendly command
mcovarr Sep 27, 2022
7b4da0c
grr
mcovarr Sep 27, 2022
1be9ee5
omg
mcovarr Sep 27, 2022
b2a0fc0
missed one
mcovarr Sep 27, 2022
4b1cc35
needs to use variantstore image
mcovarr Sep 27, 2022
4998421
doh
mcovarr Sep 27, 2022
1ecad81
explicit init
mcovarr Sep 27, 2022
20ce2b6
cleanup
mcovarr Sep 27, 2022
5d70894
dockstore
mcovarr Sep 27, 2022
74f60b5
Alpine the variantstore image
mcovarr Sep 28, 2022
0e84170
alpine version
mcovarr Sep 28, 2022
fe6e940
oops
mcovarr Sep 28, 2022
032dbc8
pip fixup for alt allele
mcovarr Sep 28, 2022
9841c27
revert to slim
mcovarr Sep 29, 2022
e62fa5e
404: found!
mcovarr Sep 29, 2022
97e141a
Alpine / Debian portable pseudorandomness
mcovarr Sep 29, 2022
926955c
fix words
mcovarr Sep 29, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,7 @@ workflows:
branches:
- master
- ah_var_store
- vs_616_split_hail
- name: GvsIngestTieout
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsIngestTieout.wdl
Expand All @@ -229,13 +230,15 @@ workflows:
branches:
- master
- ah_var_store
- vs_616_split_hail
- name: GvsCallsetStatistics
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsCallsetStatistics.wdl
filters:
branches:
- master
- ah_var_store
- vs_616_split_hail
- name: MitochondriaPipeline
subclass: WDL
primaryDescriptorPath: /scripts/mitochondria_m2_wdl/MitochondriaPipeline.wdl
Expand Down
4 changes: 2 additions & 2 deletions scripts/variantstore/wdl/GvsAssignIds.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,7 @@ task CreateCostObservabilityTable {
# not volatile: true, always run this when asked
}
command <<<
set -o xtrace
set -o errexit -o nounset -o xtrace -o pipefail

echo "project_id = ~{project_id}" > ~/.bigqueryrc
TABLE="~{dataset_name}.cost_observability"
Expand All @@ -202,7 +202,7 @@ task CreateCostObservabilityTable {
fi
>>>
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:ah_var_store_2022_08_16"
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:404.0.0-alpine"
}
output {
Boolean done = true
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -264,7 +264,7 @@ task Add_AS_MAX_VQSLOD_ToVcf {
File input_vcf
String output_basename

String docker = "us.gcr.io/broad-dsde-methods/variantstore:ah_var_store_2022_08_22"
String docker = "us.gcr.io/broad-dsde-methods/variantstore:2022-09-28-slim"
Int cpu = 1
Int memory_mb = 3500
Int disk_size_gb = ceil(2*size(input_vcf, "GiB")) + 50
Expand Down
2 changes: 1 addition & 1 deletion scripts/variantstore/wdl/GvsCallsetCost.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ task WorkflowComputeCosts {
>>>

runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:ah_var_store_2022_08_22"
docker: "us.gcr.io/broad-dsde-methods/variantstore:2022-09-28-slim"
}

output {
Expand Down
13 changes: 7 additions & 6 deletions scripts/variantstore/wdl/GvsCallsetStatistics.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,8 @@ task CreateTables {
command <<<
set -o errexit -o nounset -o xtrace -o pipefail

apk add jq

set +o errexit
bq --project_id=~{project_id} show ~{dataset_name}.~{metrics_table}
BQ_SHOW_METRICS=$?
Expand Down Expand Up @@ -378,8 +380,7 @@ task CreateTables {
fi
>>>
runtime {
# Can't use plain Google cloud sdk as this requires jq.
docker: "us.gcr.io/broad-dsde-methods/variantstore:vs_560_callset_statistics"
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:404.0.0-alpine"
disks: "local-disk 500 HDD"
}
output {
Expand Down Expand Up @@ -515,7 +516,7 @@ task CollectMetricsForChromosome {
Boolean done = true
}
runtime {
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:402.0.0-alpine"
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:404.0.0-alpine"
disks: "local-disk 500 HDD"
}
}
Expand Down Expand Up @@ -588,7 +589,7 @@ task AggregateMetricsAcrossChromosomes {
Boolean done = true
}
runtime {
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:402.0.0-alpine"
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:404.0.0-alpine"
disks: "local-disk 500 HDD"
}
}
Expand Down Expand Up @@ -731,7 +732,7 @@ task CollectStatistics {
Boolean done = true
}
runtime {
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:402.0.0-alpine"
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:404.0.0-alpine"
disks: "local-disk 500 HDD"
}
}
Expand Down Expand Up @@ -760,7 +761,7 @@ task ExportToCSV {
File callset_statistics = "~{statistics_table}.csv"
}
runtime {
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:402.0.0-alpine"
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:404.0.0-alpine"
disks: "local-disk 500 HDD"
}
}
2 changes: 1 addition & 1 deletion scripts/variantstore/wdl/GvsCreateVAT.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ task MakeSubpopulationFilesAndReadSchemaFiles {
# ------------------------------------------------
# Runtime settings:
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:ah_var_store_2022_08_22"
docker: "us.gcr.io/broad-dsde-methods/variantstore:2022-09-28-slim"
memory: "1 GB"
preemptible: 3
cpu: "1"
Expand Down
4 changes: 2 additions & 2 deletions scripts/variantstore/wdl/GvsCreateVATAnnotations.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ task ExtractAnAcAfFromVCF {
# ------------------------------------------------
# Runtime settings:
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:ah_var_store_2022_08_22"
docker: "us.gcr.io/broad-dsde-methods/variantstore:2022-09-28-slim"
maxRetries: 3
memory: "16 GB"
preemptible: 3
Expand Down Expand Up @@ -291,7 +291,7 @@ task PrepAnnotationJson {
# ------------------------------------------------
# Runtime settings:
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:ah_var_store_2022_08_22"
docker: "us.gcr.io/broad-dsde-methods/variantstore:2022-09-28-slim"
memory: "8 GB"
preemptible: 5
cpu: "1"
Expand Down
120 changes: 56 additions & 64 deletions scripts/variantstore/wdl/GvsExtractAvroFilesForHail.wdl
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
version 1.0

import "GvsUtils.wdl" as Utils


workflow GvsExtractAvroFilesForHail {
input {
String project_id
String dataset
String filter_set_name
String gcs_temporary_path
String scatter_width = 10
Int scatter_width = 10
}

call OutputPath { input: go = true }
Expand All @@ -19,38 +21,37 @@ workflow GvsExtractAvroFilesForHail {
avro_sibling = OutputPath.out
}

call CountSamples {
call Utils.CountSuperpartitions {
input:
project_id = project_id,
dataset = dataset
dataset_name = dataset
}

Int num_samples = CountSamples.num_samples
# First superpartition contains samples 1 to 4000, second 4001 to 8000 etc; add one to quotient unless exactly 4000.
Int num_superpartitions = if (num_samples % 4000 == 0) then num_samples / 4000 else (num_samples / 4000 + 1)

scatter (i in range(scatter_width)) {
call ExtractFromSuperpartitionedTables {
input:
project_id = project_id,
dataset = dataset,
filter_set_name = filter_set_name,
avro_sibling = OutputPath.out,
num_superpartitions = num_superpartitions,
num_superpartitions = CountSuperpartitions.num_superpartitions,
shard_index = i,
num_shards = scatter_width
}
}

call GenerateHailScript {
call GenerateHailScripts {
input:
go_non_superpartitioned = ExtractFromNonSuperpartitionedTables.done,
go_superpartitioned = ExtractFromSuperpartitionedTables.done,
avro_prefix = ExtractFromNonSuperpartitionedTables.output_prefix,
gcs_temporary_path = gcs_temporary_path
}
output {
File hail_script = GenerateHailScript.hail_script
File hail_gvs_import_script = GenerateHailScripts.hail_gvs_import_script
File hail_create_vat_inputs_script = GenerateHailScripts.hail_create_vat_inputs_script
String vds_output_path = GenerateHailScripts.vds_output_path
String sites_only_vcf_output_path = GenerateHailScripts.sites_only_vcf_output_path
String vat_inputs_output_path = GenerateHailScripts.vat_inputs_output_path
}
}

Expand All @@ -70,40 +71,8 @@ task OutputPath {
File out = stdout()
}
runtime {
docker: "ubuntu:latest"
}
}


task CountSamples {
meta {
description: "Counts the number of samples in the sample_info table efficiently."
# Not dealing with caching for now as that would introduce a lot of complexity.
volatile: true
}
input {
String project_id
String dataset
}
command <<<
python3 <<FIN

from google.cloud import bigquery

client = bigquery.Client(project="~{project_id}")
sample_info_table_id = f'~{project_id}.~{dataset}.sample_info'
sample_info_table = client.get_table(sample_info_table_id)

print(str(sample_info_table.num_rows))

FIN
>>>

output {
Int num_samples = read_int(stdout())
}
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:ah_var_store_2022_08_22"
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:404.0.0-alpine"
disks: "local-disk 500 HDD"
}
}

Expand Down Expand Up @@ -173,7 +142,8 @@ task ExtractFromNonSuperpartitionedTables {
}

runtime {
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:398.0.0"
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:404.0.0-alpine"
disks: "local-disk 500 HDD"
}
}

Expand Down Expand Up @@ -239,14 +209,14 @@ task ExtractFromSuperpartitionedTables {
}

runtime {
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:398.0.0"
docker: "gcr.io/google.com/cloudsdktool/cloud-sdk:404.0.0-alpine"
disks: "local-disk 500 HDD"
}
}

task GenerateHailScript {
task GenerateHailScripts {
input {
String avro_prefix
String gcs_temporary_path
Boolean go_non_superpartitioned
Array[Boolean] go_superpartitioned
}
Expand All @@ -255,37 +225,59 @@ task GenerateHailScript {
volatile: true
}
parameter_meta {
gcs_temporary_path: "Path to network-visible temporary directory/bucket for intermediate file storage"
go_non_superpartitioned: "Sync on completion of non-superpartitioned extract"
go_superpartitioned: "Sync on completion of all superpartitioned extract shards"
}

command <<<
set -o errexit -o nounset -o xtrace -o pipefail

vds_output_path="$(dirname ~{avro_prefix})/gvs_export.vds"
# 4 random hex bytes to not clobber outputs if this is run multiple times for the same avro_prefix.
# Unlike many implementations, at the time of this writing this works on both Debian and Alpine based images
# so it should continue to work even if the `variantstore` image switches to Alpine:
# https://stackoverflow.com/a/34329799
rand=$(od -vN 4 -An -tx1 /dev/urandom | tr -d " ")

# The write prefix will be a sibling to the Avro "directory" that embeds the current date and some randomness.
write_prefix="$(dirname ~{avro_prefix})/$(date -Idate)-${rand}"

vds_output_path="${write_prefix}/gvs_export.vds"
echo $vds_output_path > vds_output_path.txt

vcf_output_path="$(dirname ~{avro_prefix})/gvs_export.vcf"
echo $vcf_output_path > vcf_output_path.txt
tmpfile=$(mktemp)
# `sed` can use delimiters other than `/`. This is required here since the replacement GCS paths will
# contain `/` characters.
cat /app/hail_gvs_import.py |
sed "s;@AVRO_PREFIX@;~{avro_prefix};" |
sed "s;@WRITE_PREFIX@;${write_prefix};" > ${tmpfile}
mv ${tmpfile} hail_gvs_import.py

gsutil ls -r '~{avro_prefix}' > avro_listing.txt
vcf_output_path="${write_prefix}/gvs_export.vcf"
echo $vcf_output_path > vcf_output_path.txt
sites_only_vcf_output_path="${write_prefix}/gvs_sites_only.vcf"
echo $sites_only_vcf_output_path > sites_only_vcf_output_path.txt
vat_tsv_output_path="${write_prefix}/vat_inputs.tsv"
echo $vat_tsv_output_path > vat_inputs_output_path.txt

tmpfile=$(mktemp)
cat /app/hail_create_vat_inputs.py |
sed "s;@VDS_INPUT_PATH@;${vds_output_path};" |
sed "s;@SITES_ONLY_VCF_OUTPUT_PATH@;${sites_only_vcf_output_path};" |
sed "s;@VAT_CUSTOM_ANNOTATIONS_OUTPUT_PATH@;${vat_tsv_output_path};" > ${tmpfile}
mv ${tmpfile} hail_create_vat_inputs.py

python3 /app/generate_hail_gvs_import.py \
--avro_prefix '~{avro_prefix}' \
--avro_listing_file avro_listing.txt \
--vds_output_path "${vds_output_path}" \
--vcf_output_path "${vcf_output_path}" \
--gcs_temporary_path ~{gcs_temporary_path} > hail_script.py
>>>

output {
Boolean done = true
String vds_output_path = read_string('vds_output_path.txt')
String vcf_output_path = read_string('vcf_output_path.txt')
File hail_script = 'hail_script.py'
String sites_only_vcf_output_path = read_string('sites_only_vcf_output_path.txt')
String vat_inputs_output_path = read_string('vat_inputs_output_path.txt')
File hail_gvs_import_script = 'hail_gvs_import.py'
File hail_create_vat_inputs_script = 'hail_create_vat_inputs.py'
}
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:rc_616_var_store_2022_09_06"
docker: "us.gcr.io/broad-dsde-methods/variantstore:2022-09-28-slim"
disks: "local-disk 500 HDD"
}
}
2 changes: 1 addition & 1 deletion scripts/variantstore/wdl/GvsImportGenomes.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -404,7 +404,7 @@ task CurateInputLists {
--output_files True
>>>
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:ah_var_store_2022_08_22"
docker: "us.gcr.io/broad-dsde-methods/variantstore:2022-09-28-slim"
memory: "3 GB"
disks: "local-disk 100 HDD"
bootDiskSizeGb: 15
Expand Down
2 changes: 1 addition & 1 deletion scripts/variantstore/wdl/GvsPopulateAltAllele.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ task PopulateAltAlleleTable {
done
>>>
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:vs_581_fix_withdrawn"
docker: "us.gcr.io/broad-dsde-methods/variantstore:2022-09-28-slim"
memory: "3 GB"
disks: "local-disk 10 HDD"
cpu: 1
Expand Down
2 changes: 1 addition & 1 deletion scripts/variantstore/wdl/GvsPrepareRangesCallset.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ task PrepareRangesCallsetTask {
}

runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:vs_581_fix_withdrawn"
docker: "us.gcr.io/broad-dsde-methods/variantstore:2022-09-28-slim"
memory: "3 GB"
disks: "local-disk 100 HDD"
bootDiskSizeGb: 15
Expand Down
Loading