-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract Performance Improvements #7686
Conversation
@@ -15,7 +15,7 @@ workflow GvsExtractCallset { | |||
|
|||
File interval_list = "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list" | |||
File interval_weights_bed = "gs://broad-public-datasets/gvs/weights/gvs_vet_weights_1kb.bed" | |||
File gatk_override = "gs://broad-dsp-spec-ops/scratch/bigquery-jointcalling/jars/ah_var_store_20220406/gatk-package-4.2.0.0-480-gb62026a-SNAPSHOT-local.jar" | |||
File gatk_override = "gs:////broad-dsp-spec-ops/scratch/bigquery-jointcalling/jars/kc_extract_perf_20220404/gatk-package-4.2.0.0-485-g86fd5ac-SNAPSHOT-local.jar" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra slashes?
File gatk_override = "gs:////broad-dsp-spec-ops/scratch/bigquery-jointcalling/jars/kc_extract_perf_20220404/gatk-package-4.2.0.0-485-g86fd5ac-SNAPSHOT-local.jar" | |
File gatk_override = "gs://broad-dsp-spec-ops/scratch/bigquery-jointcalling/jars/kc_extract_perf_20220404/gatk-package-4.2.0.0-485-g86fd5ac-SNAPSHOT-local.jar" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol I love me some extra slashes!
throw new GATKException("Sample Ids > " + Integer.MAX_VALUE + " are not supported"); | ||
} | ||
|
||
this.sampleIdsToExtractBitSet = new BitSet(sampleIdsToExtract.last().intValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you already have it in a local?
this.sampleIdsToExtractBitSet = new BitSet(sampleIdsToExtract.last().intValue()); | |
this.sampleIdsToExtractBitSet = new BitSet(maxSampleId.intValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
case "u": // unknown GQ used for array data | ||
unmergedCalls.add(createRefSiteVariantContext(sampleName, contig, currentPosition, refAllele)); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we'll really never see a "u" anymore (especially given the explodey default
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah 'u' isn't a state we encode anywhere… it was for arrays support which we removed ages ago
private String sampleName; | ||
private int GQ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my IntelliJ points out these could be final
🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
throw new GATKException("Sample Ids > " + Integer.MAX_VALUE + " are not supported"); | ||
} | ||
|
||
this.sampleIdsToExtractBitSet = new BitSet((int) maxSampleId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add +1… this is zero-based
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - thanks for the walk through
samplesNotEncountered.xor(sampleIdsToExtractBitSet); | ||
|
||
// Iterate through the samples not encountered | ||
for (int sampleId = samplesNotEncountered.nextSetBit(0); sampleId >= 0; sampleId = samplesNotEncountered.nextSetBit(sampleId+1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this for loop kind of confusing - might be clearer to:
for (long sampleId : samplesNotEncountered.toLongArray()) {
(and then you wouldn't need to Long.valueOf(sampleId) on line 600.
But that might not end up scaling so well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah -- it is a little confusing, but I think what you're proposing would give back an array of longs that back the bitset, and then you're iterate through those values. I'm going to pretend a long is 8-bits for a minute. If you made a BitSet(8) and then set bits 0,1,2 you would get back a single long with a value of "7" (bits 11100000).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, yeah, I completely misunderstood what that method does. By the name, I though it returned [0L,1L, 2L], which would be useful I think.
|
||
int length = Math.toIntExact((Long) genericRecord.get("length")); | ||
this.end = this.start + length - 1; | ||
this.endLocation = this.location + + length - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "+ +"??
Three main performance optimizations:
Avro Parsing: More efficient parsing and representation of primitive types in Avro-based records (ExtractCohortRecord, ReferenceRecord). We previously called toString() and then parseLong() on everything, even though it was already the right datatype
Inferred State: we keep track of which samples have been seen, so that later we can determine which samples have not been seen for each site. The data structures here were slow with 100k samples and lots of variants. Moved to using a TreeSet and BitSet
Reference Genotypes: Add reference genotypes in bulk (via ReferenceGenotypeInfo, rather than a heavy Variant Context) rather than one at a time
More Details from profiling
https://docs.google.com/spreadsheets/d/1aA7LKgPsaELiGurw95qVX1PwGt54I5rn1h_fAAhkPMo/edit#gid=0