-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP extract for ranges #7640
WIP extract for ranges #7640
Conversation
@@ -321,7 +330,7 @@ public int compare( GenericRecord o1, GenericRecord o2 ) { | |||
|
|||
for (final GenericRecord queryRow : avroReader) { | |||
long location = (Long) queryRow.get(SchemaUtils.LOCATION_FIELD_NAME); | |||
int length = Integer.parseInt(queryRow.get(SchemaUtils.LENGTH_FIELD_NAME).toString()); | |||
int length = ((Long) queryRow.get(SchemaUtils.LENGTH_FIELD_NAME)).intValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(for my edu only) Is it better to not convert to a string in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's expensive… to convert a number to a string and then parse the string to get back another number. the result from get
is already a Long we just have to cast it as such. BigQuery doesn't return int, but we know it is an int and want it as such so we call intValue()
on it.
} | ||
|
||
private SortingCollection<GenericRecord> createSortedReferenceRangeCollectionFromExtractTableBigQuery(final String projectID, | ||
final String fqRefTable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spacing looks odd here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍🏻
Some notes from the 10k tieout:
Prepare Step
Extract
Original Run - 293 min
103 minutes pulling down data, scanning 237 GB
190 minutes writing the VCF
Prepare Extract with minor tuning of sorting - 134 min
25 minutes pulling down data ( faster), scanning 10 GB (50x reduction)
109 minutes writing the VCF (this is the change to pre-sort the sample set merged to ah_var_store on 1/12/22)
Tieout is identical