WIP extract for ranges #7640

kcibul · 2022-01-18T18:22:33Z

Some notes from the 10k tieout:

Prepare Step

~20 min per full ref_ranges table to insert
~7 min per full vet table to insert
"bytes scanned" are same as data table size

Extract

Original Run - 293 min

103 minutes pulling down data, scanning 237 GB
- 43 min on 20m vet records (20:26 - > 21:09)
- 60 min on 291m vet records (21:09 -> 22:10)
190 minutes writing the VCF

Prepare Extract with minor tuning of sorting - 134 min

25 minutes pulling down data ( faster), scanning 10 GB (50x reduction)
- 4 min on 20m vet records(02:43 -> 02:47) - NOTE 103s of that was sorting (44s) and spilling to disk (59 s)
- 21 min on 291m vet records (02:47 -> 03:08) - NOTE 9 min of that was sorting (6 min) and spilling to disk (3 min)
109 minutes writing the VCF (this is the change to pre-sort the sample set merged to ah_var_store on 1/12/22)

Tieout is identical

kcibul@kc-specops-tiny:~/stroke_tieout$ md5sum gold.jointcallset_0.vcf.gz
496178eae4afe63c4391d8eba64a9947  gold.jointcallset_0.vcf.gz

kcibul@kc-specops-tiny:~/stroke_tieout$ md5sum trial.full.jointcallset_0.vcf.gz
496178eae4afe63c4391d8eba64a9947  trial.full.jointcallset_0.vcf.gz

RoriCremer · 2022-01-19T13:44:10Z

src/main/java/org/broadinstitute/hellbender/tools/gvs/extract/ExtractCohortEngine.java

@@ -321,7 +330,7 @@ public int compare( GenericRecord o1, GenericRecord o2 ) {

        for (final GenericRecord queryRow : avroReader) {
            long location = (Long) queryRow.get(SchemaUtils.LOCATION_FIELD_NAME);
-            int length = Integer.parseInt(queryRow.get(SchemaUtils.LENGTH_FIELD_NAME).toString());
+            int length = ((Long) queryRow.get(SchemaUtils.LENGTH_FIELD_NAME)).intValue();


(for my edu only) Is it better to not convert to a string in the first place?

it's expensive… to convert a number to a string and then parse the string to get back another number. the result from get is already a Long we just have to cast it as such. BigQuery doesn't return int, but we know it is an int and want it as such so we call intValue() on it.

RoriCremer · 2022-01-19T13:49:23Z

src/main/java/org/broadinstitute/hellbender/tools/gvs/extract/ExtractCohortEngine.java

+    }
+
+    private SortingCollection<GenericRecord> createSortedReferenceRangeCollectionFromExtractTableBigQuery(final String projectID,
+                                                                                              final String fqRefTable,


nit: spacing looks odd here

rsasch

LGTM 👍🏻

kcibul added 8 commits January 18, 2022 13:22

WIP extract for ranges

b7a6bba

fixed signature in test

95280a4

updated WDLs

83a32af

updated name

4d475c8

type

ee0ea96

bash typos

7ba3a74

stop converting long -> string -> long

c0e684c

WDL bump

d0bd8dd

RoriCremer reviewed Jan 19, 2022

View reviewed changes

RoriCremer approved these changes Jan 19, 2022

View reviewed changes

rsasch approved these changes Jan 19, 2022

View reviewed changes

kcibul merged commit 3aa74a5 into ah_var_store Jan 19, 2022

kcibul deleted the kc_ranges_prepare branch January 19, 2022 15:25

This was referenced Mar 17, 2023

lb merge gvs branch #8248

Closed

testing something, please ignore #8251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP extract for ranges #7640

WIP extract for ranges #7640

kcibul commented Jan 18, 2022 •

edited

Loading

RoriCremer Jan 19, 2022

kcibul Jan 19, 2022

RoriCremer Jan 19, 2022

rsasch left a comment

WIP extract for ranges #7640

WIP extract for ranges #7640

Conversation

kcibul commented Jan 18, 2022 • edited Loading

Prepare Step

Extract

RoriCremer Jan 19, 2022

Choose a reason for hiding this comment

kcibul Jan 19, 2022

Choose a reason for hiding this comment

RoriCremer Jan 19, 2022

Choose a reason for hiding this comment

rsasch left a comment

Choose a reason for hiding this comment

kcibul commented Jan 18, 2022 •

edited

Loading