Extract Performance Improvements #7686

kcibul · 2022-02-18T16:33:38Z

Three main performance optimizations:

Avro Parsing: More efficient parsing and representation of primitive types in Avro-based records (ExtractCohortRecord, ReferenceRecord). We previously called toString() and then parseLong() on everything, even though it was already the right datatype
Inferred State: we keep track of which samples have been seen, so that later we can determine which samples have not been seen for each site. The data structures here were slow with 100k samples and lots of variants. Moved to using a TreeSet and BitSet
Reference Genotypes: Add reference genotypes in bulk (via ReferenceGenotypeInfo, rather than a heavy Variant Context) rather than one at a time

More Details from profiling

https://docs.google.com/spreadsheets/d/1aA7LKgPsaELiGurw95qVX1PwGt54I5rn1h_fAAhkPMo/edit#gid=0

gbggrant · 2022-04-06T19:42:24Z

scripts/variantstore/wdl/GvsExtractCallset.wdl

@@ -15,7 +15,7 @@ workflow GvsExtractCallset {

    File interval_list = "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list"
    File interval_weights_bed = "gs://broad-public-datasets/gvs/weights/gvs_vet_weights_1kb.bed"
-    File gatk_override = "gs://broad-dsp-spec-ops/scratch/bigquery-jointcalling/jars/ah_var_store_20220406/gatk-package-4.2.0.0-480-gb62026a-SNAPSHOT-local.jar"
+    File gatk_override = "gs:////broad-dsp-spec-ops/scratch/bigquery-jointcalling/jars/kc_extract_perf_20220404/gatk-package-4.2.0.0-485-g86fd5ac-SNAPSHOT-local.jar"


Extra slashes?

Suggested change

File gatk_override = "gs:////broad-dsp-spec-ops/scratch/bigquery-jointcalling/jars/kc_extract_perf_20220404/gatk-package-4.2.0.0-485-g86fd5ac-SNAPSHOT-local.jar"

File gatk_override = "gs://broad-dsp-spec-ops/scratch/bigquery-jointcalling/jars/kc_extract_perf_20220404/gatk-package-4.2.0.0-485-g86fd5ac-SNAPSHOT-local.jar"

lol I love me some extra slashes!

gbggrant · 2022-04-06T19:44:24Z

src/main/java/org/broadinstitute/hellbender/tools/gvs/extract/ExtractCohortEngine.java

+            throw new GATKException("Sample Ids > " + Integer.MAX_VALUE + " are not supported");
+        }
+
+        this.sampleIdsToExtractBitSet = new BitSet(sampleIdsToExtract.last().intValue());


Since you already have it in a local?

Suggested change

this.sampleIdsToExtractBitSet = new BitSet(sampleIdsToExtract.last().intValue());

this.sampleIdsToExtractBitSet = new BitSet(maxSampleId.intValue());

mcovarr · 2022-04-06T22:06:59Z

src/main/java/org/broadinstitute/hellbender/tools/gvs/extract/ExtractCohortEngine.java

-                case "u":   // unknown GQ used for array data
-                    unmergedCalls.add(createRefSiteVariantContext(sampleName, contig, currentPosition, refAllele));
-                    break;


Are we sure we'll really never see a "u" anymore (especially given the explodey default)?

yeah 'u' isn't a state we encode anywhere… it was for arrays support which we removed ages ago

mcovarr · 2022-04-06T22:29:37Z

src/main/java/org/broadinstitute/hellbender/tools/gvs/extract/ReferenceGenotypeInfo.java

+    private String sampleName;
+    private int GQ;


my IntelliJ points out these could be final 🤷

kcibul · 2022-04-07T18:27:25Z

src/main/java/org/broadinstitute/hellbender/tools/gvs/extract/ExtractCohortEngine.java

+            throw new GATKException("Sample Ids > " + Integer.MAX_VALUE + " are not supported");
+        }
+
+        this.sampleIdsToExtractBitSet = new BitSet((int) maxSampleId);


Maybe add +1… this is zero-based

gbggrant

Looks good - thanks for the walk through

gbggrant · 2022-04-07T19:03:52Z

src/main/java/org/broadinstitute/hellbender/tools/gvs/extract/ExtractCohortEngine.java

+        samplesNotEncountered.xor(sampleIdsToExtractBitSet);
+
+        // Iterate through the samples not encountered
+        for (int sampleId = samplesNotEncountered.nextSetBit(0); sampleId >= 0; sampleId = samplesNotEncountered.nextSetBit(sampleId+1)) {


I find this for loop kind of confusing - might be clearer to:
for (long sampleId : samplesNotEncountered.toLongArray()) {
(and then you wouldn't need to Long.valueOf(sampleId) on line 600.

But that might not end up scaling so well?

yeah -- it is a little confusing, but I think what you're proposing would give back an array of longs that back the bitset, and then you're iterate through those values. I'm going to pretend a long is 8-bits for a minute. If you made a BitSet(8) and then set bits 0,1,2 you would get back a single long with a value of "7" (bits 11100000).

Sorry, yeah, I completely misunderstood what that method does. By the name, I though it returned [0L,1L, 2L], which would be useful I think.

gbggrant · 2022-04-07T19:13:50Z

src/main/java/org/broadinstitute/hellbender/tools/gvs/extract/ReferenceRecord.java

+
+        int length = Math.toIntExact((Long) genericRecord.get("length"));
+        this.end = this.start + length - 1;
+        this.endLocation = this.location + + length - 1;


What is "+ +"??

kcibul added 5 commits April 4, 2022 17:10

presorted avro files, fix performance issue

9baea97

PR feedback

5b9fc64

remove unnecessary annotations, change seen samples data structure

7311429

performance improvements (see spreadsheet)

cfb64f1

fixed casting bug

86fd5ac

kcibul force-pushed the kc_extract_perf branch from f0a3049 to 86fd5ac Compare April 4, 2022 21:14

kcibul added 6 commits April 4, 2022 17:23

new jar; remove threading

0dd0aa5

removed annotation optimization

c180981

remove unused imports

420437f

simplify

bdda1cb

refactoring

a3f0bfc

Merge branch 'ah_var_store' into kc_extract_perf

92dd85e

kcibul marked this pull request as ready for review April 6, 2022 19:22

gbggrant reviewed Apr 6, 2022

View reviewed changes

mcovarr reviewed Apr 6, 2022

View reviewed changes

kcibul added 2 commits April 6, 2022 20:48

PR comments

ba6d91e

typo

94f4fd3

kcibul commented Apr 7, 2022

View reviewed changes

gbggrant approved these changes Apr 7, 2022

View reviewed changes

PR feedback

d33bd9f

kcibul merged commit 1f490f0 into ah_var_store Apr 7, 2022

kcibul deleted the kc_extract_perf branch April 7, 2022 20:37

This was referenced Mar 17, 2023

lb merge gvs branch #8248

Closed

testing something, please ignore #8251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract Performance Improvements #7686

Extract Performance Improvements #7686

kcibul commented Feb 18, 2022 •

edited

Loading

gbggrant Apr 6, 2022

kcibul Apr 7, 2022

gbggrant Apr 6, 2022

kcibul Apr 7, 2022

mcovarr Apr 6, 2022

kcibul Apr 7, 2022

mcovarr Apr 6, 2022

kcibul Apr 7, 2022

kcibul Apr 7, 2022

gbggrant left a comment

gbggrant Apr 7, 2022

kcibul Apr 7, 2022

gbggrant Apr 7, 2022

gbggrant Apr 7, 2022

	File gatk_override = "gs:////broad-dsp-spec-ops/scratch/bigquery-jointcalling/jars/kc_extract_perf_20220404/gatk-package-4.2.0.0-485-g86fd5ac-SNAPSHOT-local.jar"
	File gatk_override = "gs://broad-dsp-spec-ops/scratch/bigquery-jointcalling/jars/kc_extract_perf_20220404/gatk-package-4.2.0.0-485-g86fd5ac-SNAPSHOT-local.jar"

	this.sampleIdsToExtractBitSet = new BitSet(sampleIdsToExtract.last().intValue());
	this.sampleIdsToExtractBitSet = new BitSet(maxSampleId.intValue());

Extract Performance Improvements #7686

Extract Performance Improvements #7686

Conversation

kcibul commented Feb 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gbggrant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kcibul commented Feb 18, 2022 •

edited

Loading