new workflow: subsample_by_metadata_with_focal #161

dpark01 · 2020-10-11T01:35:42Z

This PR adds a new WDL workflow, subsample_by_metadata_with_focal, which provides a functionality I've been doing manually/offline until recently. This subsets an externally provided dataset of sequences to a more manageable sample size with even representation across metadata-defined categories (e.g. geographic). It does so by splitting the external dataset into two categories: one termed focal and one termed global, and applying different sample count limits (and bin resolutions) to each, before remerging. Current defaults are based on GISAID column headers and use North America as a focal region--but these defaults might be removed in the future to remain data source and user base agnostic.

This workflow can optionally accept a "priorities" matrix so that the random sampling is biased towards the user's genomes of interest. However, a future version of this workflow should just accept the user's genomes of interest instead and compute the priorities matrix for them.

This PR additionally includes adjustments to the filter_subsample_sequences WDL task to convert the include_where and exclude_where input types from String? to Array[String]?, which better reflects how they would be used, and keeps a cleaner way of describing input values with spaces and special characters, which are better tolerated now.

dpark01 added 4 commits October 9, 2020 10:43

first attempt at a longer workflow for focal subsetting

cdc8075

updates to first draft of subsample with focal

102ae8a

add two more Int outputs to workflow

18d66f1

Merge branch 'master' into dp-nextstrain

02217d2

dpark01 merged commit c784732 into master Oct 12, 2020

dpark01 deleted the dp-nextstrain branch November 3, 2020 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new workflow: subsample_by_metadata_with_focal #161

new workflow: subsample_by_metadata_with_focal #161

dpark01 commented Oct 11, 2020

new workflow: subsample_by_metadata_with_focal #161

new workflow: subsample_by_metadata_with_focal #161

Conversation

dpark01 commented Oct 11, 2020