-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow weighted subsampling #1318
Comments
There's been lots of internal discussions on this feature. Contrary to the proposal in the issue description, it does seem reasonable to encode multi-dimensional weights in a CSV/TSV format, though it's likely that this type of file must be generated via a script.
Some more notes:
|
Thanks for spelling things out in such detail @victorlin. A couple thoughts:
I could easily write this YAML file for ncov, while for the fully specified TSV example, I'd need a script that generates a large number of combinations (that I don't actually care about). Note that you could still encode interactions terms in a YAML file, eg:
Again, I believe that independent columns will cover >90% of use cases and then won't force people to write intermediate scripts if they have multiple columns they care about.
To avoid enforced verbosity I could also imagine assuming a weight of |
|
After working on nextstrain/ncov@0fd6861 I've realized that in order to reduce the number of samples (i.e. calls to |
Here's an idea: implement weighted subsampling as a part of Using the currently proposed YAML as-is would look something like: samples:
north_america_6m:
size: 4000
weights:
# Region weighting: 4:1 for North America to rest of world
region:
North America: 4
# Africa: 1
# Asia: 1
# Europe: 1
# …
# Time weighting: 4:1 for recent sequences to early sequences (2M threshold)
month:
# 2020-01: 1
# 2020-02: 1
# …
2024-02: 4
2024-03: 4 Issues:
Here's an alternative which addresses those issues: samples:
north_america_6m:
size: 4000
partitions:
# Region weighting: 4:1 for North America to rest of world
region:
- query: region == 'North America'
weight: 4
uniform_sampling: division
- query: region != 'North America'
weight: 1
uniform_sampling: country
# Time weighting: 4:1 for recent sequences to early sequences (2M threshold)
month:
- query: date >= 2M
weight: 4
uniform_sampling: week
- query: date < 2M
weight: 1
uniform_sampling: month |
After thinking more along the lines of implementing this in
I think these can be implemented separately, where (1) can be YAML-based (2) can be TSV-based. I've added more detail and examples in the subsampling doc. |
Thanks for the thoughts @victorlin. I'll try to pull together a more cohesive thread for how I'd see this working for the |
Lines 65 to 66 in d8faf01
It is different in that it only allows weighting on a single column (defined by I've considered the idea of swapping
|
I think that |
|
Context
Currently,
--subsample-max-sequences
effectively calculates a value for--sequences-per-group
which applies to all groups specified by--group-by
.This behavior does not work for all scenarios. Example: There are 5 countries with vastly different population sizes. 60 sequences are requested for subsampling. The command would be:
This means 12 sequences will be sampled from each country, regardless of population size. One may want the sample to be representative of country population sizes and not uniform. This limitation is what prompts higher-level subsampling workarounds such as nextstrain/ncov#1074.
Tasks
Rollout
Original proposed solution
Implement an option
--subsample-weights
, which reads a file that specifies weights per--group-by
column. A simple example:weights.yaml
:With this information, a different amount of sequences can be calculated per group.
A
would have 60*1000/3000 = 20 sequences.C
would have 60*300/3000 = 6 sequences.The absence of a column from the weights file can imply equal weighting. In other words, the example can be updated to use
--group-by country month
while keepingweights.yaml
as-is to have weightedcountry
sampling for each time bin.Or, a more complex example where time is also weighted:
weights.yaml
:Notes:
The text was updated successfully, but these errors were encountered: