Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new tool to split interval list by weighted intervals #7622

Open
kcibul opened this issue Jan 3, 2022 · 2 comments
Open

Create new tool to split interval list by weighted intervals #7622

kcibul opened this issue Jan 3, 2022 · 2 comments
Assignees

Comments

@kcibul
Copy link
Contributor

kcibul commented Jan 3, 2022

UPDATE: @lbergelson and I started creating this as an extension to SplitIntervals, but it quickly because very complex to fit it into that framework/abstraction so we decided to create a specialized tool for GVS

It would be valuable for SplitIntervals to be able to split intervals based not on number of genomic bases, but by using a set of weights.

Ideally this new mode would read in a BED file containing the weights in the score field and attempt to produce a series of intervals that have equal total weights.

Note: --dont-mix-contigs should still continue to work

** Why? **
In the Genomic Variant Store, we have found that scattering work by "# of genomic bases" does not lead to even runtimes for the shards.

image

Instead we have found that an excellent proxy for runtime is the number of variants contained in a given interval:

image

And furthermore, that this generalizes even when we use a subset of a different dataset

image

@kcibul
Copy link
Contributor Author

kcibul commented Jan 3, 2022

The weights can be generated from a BQ data set like:

CREATE OR REPLACE TABLE `example.mydataset.vet_weight_100k` AS
SELECT CAST(TRUNC(location / 100000) * 100000 AS INT64) bin, count(*) entries
FROM `example.mydataset.vet_001`
GROUP BY 1 ORDER BY 1;

and then converted to bed with this python:

import pandas as pd

location_offset = 1000000000000
binsize_kb = 100
infile = f"40K_vet_weight_{binsize_kb}k.csv"

w = pd.read_csv(infile)
w['contig'] = "chr" + (w['bin'].astype(int) / location_offset).astype(int).astype(str).str.replace("23","X").replace("24","Y")
w['start_position'] = w['bin'].astype(int) - (w['bin'] / location_offset).astype(int) * location_offset
w['end_position'] = w['start_position'] + binsize_kb*1000
w['name'] = "."
o = w[['contig', 'start_position','end_position', 'name', 'entries' ]]

o.to_csv(f"gvs_vet_weights_{binsize_kb}kb.bed",sep='\t',index=False,header=False)

I've generated 3 BED files for 100kb, 10kb and 1kb and placed them into

929.42 KiB  2022-01-03T18:56:48Z  gs://broad-dsp-spec-ops/gvs/weights/gvs_vet_weights_100kb.bed
  8.78 MiB  2022-01-03T18:56:50Z  gs://broad-dsp-spec-ops/gvs/weights/gvs_vet_weights_10kb.bed
 84.65 MiB  2022-01-03T18:56:56Z  gs://broad-dsp-spec-ops/gvs/weights/gvs_vet_weights_1kb.bed

@kcibul
Copy link
Contributor Author

kcibul commented Jan 3, 2022

FWIW -- the intervals we're trying to divide are gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list

@kcibul kcibul changed the title Add new INTERVAL_SUBDIVISION_BY_WEIGHT IntervalListScatterMode for SplitIntervals Create new tool to split interval list by weighted intervals Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant