You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
UPDATE: @lbergelson and I started creating this as an extension to SplitIntervals, but it quickly because very complex to fit it into that framework/abstraction so we decided to create a specialized tool for GVS
It would be valuable for SplitIntervals to be able to split intervals based not on number of genomic bases, but by using a set of weights.
Ideally this new mode would read in a BED file containing the weights in the score field and attempt to produce a series of intervals that have equal total weights.
Note: --dont-mix-contigs should still continue to work
** Why? **
In the Genomic Variant Store, we have found that scattering work by "# of genomic bases" does not lead to even runtimes for the shards.
Instead we have found that an excellent proxy for runtime is the number of variants contained in a given interval:
And furthermore, that this generalizes even when we use a subset of a different dataset
The text was updated successfully, but these errors were encountered:
The weights can be generated from a BQ data set like:
CREATE OR REPLACE TABLE `example.mydataset.vet_weight_100k` AS
SELECT CAST(TRUNC(location / 100000) * 100000 AS INT64) bin, count(*) entries
FROM `example.mydataset.vet_001`
GROUP BY 1 ORDER BY 1;
FWIW -- the intervals we're trying to divide are gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list
kcibul
changed the title
Add new INTERVAL_SUBDIVISION_BY_WEIGHT IntervalListScatterMode for SplitIntervals
Create new tool to split interval list by weighted intervals
Jan 20, 2022
UPDATE: @lbergelson and I started creating this as an extension to SplitIntervals, but it quickly because very complex to fit it into that framework/abstraction so we decided to create a specialized tool for GVS
It would be valuable for SplitIntervals to be able to split intervals based not on number of genomic bases, but by using a set of weights.
Ideally this new mode would read in a BED file containing the weights in the score field and attempt to produce a series of intervals that have equal total weights.
Note:
--dont-mix-contigs
should still continue to work** Why? **
In the Genomic Variant Store, we have found that scattering work by "# of genomic bases" does not lead to even runtimes for the shards.
Instead we have found that an excellent proxy for runtime is the number of variants contained in a given interval:
And furthermore, that this generalizes even when we use a subset of a different dataset
The text was updated successfully, but these errors were encountered: