-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batched Avro export [VS-630] #8020
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## ah_var_store #8020 +/- ##
================================================
Coverage ? 86.244%
Complexity ? 35197
================================================
Files ? 2173
Lines ? 165004
Branches ? 17792
================================================
Hits ? 142306
Misses ? 16372
Partials ? 6326 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Test run with AoU 194k at https://job-manager.dsde-prod.broadinstitute.org/jobs/da1e40cb-e10b-476b-9c5e-3673582795b7.
bq query --nouse_legacy_sql --project_id=~{project_id} " | ||
EXPORT DATA OPTIONS( | ||
uri='${avro_prefix}/vets/vet_${str_table_index}/vet_${str_table_index}_*.avro', format='AVRO', compression='SNAPPY') AS | ||
SELECT location, sample_id, ref, REPLACE(alt,',<NON_REF>','') alt, call_GT as GT, call_AD as AD, call_GQ as GQ, cast(SPLIT(call_pl,',')[OFFSET(0)] as int64) as RGQ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think we need to extract PLs? it doesn't look like Tim uses them. Let's ask on Wednesday. (Fine to include for now, I just happened to notice it in this pr)
# appropriate partition, the outer '+ 1' is to iterate over the correct number of partitions. | ||
scatter (i in range(((CountSamples.num_samples - 1) / 4000) + 1)) { | ||
Int num_samples = CountSamples.num_samples | ||
Int num_superpartitions = if (num_samples % 4000 == 0) then num_samples / 4000 else (num_samples / 4000 + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a comment on this math would be helpful
eed5a98
to
5e8e597
Compare
To address scalability failings with unbatched Avro exports.