Allow weighted subsampling #1318

victorlin · 2023-09-19T22:12:06Z

Context

Currently, --subsample-max-sequences effectively calculates a value for --sequences-per-group which applies to all groups specified by --group-by.

This behavior does not work for all scenarios. Example: There are 5 countries with vastly different population sizes. 60 sequences are requested for subsampling. The command would be:

augur filter \
  --group-by country \
  --subsample-max-sequences 60

This means 12 sequences will be sampled from each country, regardless of population size. One may want the sample to be representative of country population sizes and not uniform. This limitation is what prompts higher-level subsampling workarounds such as nextstrain/ncov#1074.

Tasks

Implement weighted sampling #1454
Release in a new version of Augur: 25.3.0
Add weighted sampling docs docs.nextstrain.org#223

Rollout

Use it in ncov workflow: Use weighted sampling ncov#1141
Use it in other workflows?

Original proposed solution

Implement an option --subsample-weights, which reads a file that specifies weights per --group-by column. A simple example:

augur filter \
  --group-by country \
  --subsample-max-sequences 60 \
  --subsample-weights weights.yaml

weights.yaml:

# Weight countries by population size.
country:
    A: 1000
    B: 1000
    C: 300
    D: 100
    E: 600

With this information, a different amount of sequences can be calculated per group.

A would have 60*1000/3000 = 20 sequences.
C would have 60*300/3000 = 6 sequences.

The absence of a column from the weights file can imply equal weighting. In other words, the example can be updated to use --group-by country month while keeping weights.yaml as-is to have weighted country sampling for each time bin.

Or, a more complex example where time is also weighted:

augur filter \
  --group-by country month \
  --subsample-max-sequences 60 \
  --subsample-weights weights.yaml

weights.yaml:

# Weight countries by population size.
country:
    A: 1000
    B: 1000
    C: 300
    D: 100
    E: 600

# Get twice the amount of sequences from 2021 compared to 2020.
month:
    2020-01: 1
    2020-02: 1
    2020-03: 1
    # … all months in 2020 are weighted with 1
    2020-01: 2
    2020-02: 2
    2020-03: 2
    # … all months in 2021 are weighted with 2

Notes:

The file format is up for debate. At the least, it can be JSON or YAML, but not anything tabular (not enough dimensions to cover multiple group by columns).
This would necessitate all possible values of a column to have a weight so that the denominator can be calculated by summing up all the values.
Weights should be relative within each column.
(I think) as long as the weights are non-negative, the values can be multiplied across columns to get effective weighting for all combinations.

The text was updated successfully, but these errors were encountered:

victorlin · 2024-03-11T23:48:23Z

There's been lots of internal discussions on this feature. Contrary to the proposal in the issue description, it does seem reasonable to encode multi-dimensional weights in a CSV/TSV format, though it's likely that this type of file must be generated via a script.

country     month       weight
A           2020-01     N
A           2020-02     N
A           2020-03     N
…
B           2020-01     N
B           2020-02     N
B           2020-03     N
…

Some more notes:

The weights file should be mutually exclusive with --group-by (determined by weights file columns) and --sequences-per-group (calculated dynamically using weights).
The list of weights should partition the data completely. In the example above, all combinations of country and month present in the data must be provided a weight. Raise user error if it doesn't.

In the initial implementation, all cells of the weights file must have a value. In the future, this can be extended to allow partitioning of the data at different resolutions. Here's an example with geographically even sampling on two different resolutions:

country     division    weight
A                       <1/n_countries>
B                       <1/n_countries>
C                       <1/n_countries>
…
USA         WA          <1/n_countries * 1/n_divisions>
USA         WA          <1/n_countries * 1/n_divisions>
USA         WA          <1/n_countries * 1/n_divisions>
…
USA         OR          <1/n_countries * 1/n_divisions>
USA         OR          <1/n_countries * 1/n_divisions>
…

trvrb · 2024-03-21T20:19:48Z

Thanks for spelling things out in such detail @victorlin. A couple thoughts:

I really like the behavior in the original YAML version of being able to specify independent weights for column 1 (eg country) vs column 2 (eg month). The situations where we have an interaction effect between weights seem quite limited (I can't think of an immediate example in existing subsampling routines).

I could easily write this YAML file for ncov, while for the fully specified TSV example, I'd need a script that generates a large number of combinations (that I don't actually care about).

Note that you could still encode interactions terms in a YAML file, eg:

# Weight countries by population size.
country month:
    A 2020-01: 10
    B 2020-01: 10
    C 2020-01: 3
    D 2020-01: 1
    E 2020-02: 6
    A 2020-02: 10
    B 2020-02: 10
    C 2020-02: 3
    D 2020-02: 1
    E 2020-02: 6

Again, I believe that independent columns will cover >90% of use cases and then won't force people to write intermediate scripts if they have multiple columns they care about.

This would necessitate all possible values of a column to have a weight so that the denominator can be calculated by summing up all the values.

The list of weights should partition the data completely. In the example above, all combinations of country and month present in the data must be provided a weight. Raise user error if it doesn't.

To avoid enforced verbosity I could also imagine assuming a weight of 1 for any missing entries. But raise a warning saying that missing values have been assumed to be 1.

victorlin · 2024-03-28T02:50:08Z

I could easily write this YAML file for ncov

My speculative hesitation with YAML is that it'll be hard to translate from a source file e.g. case counts which are typically in TSV format (but I haven't actually tried). YAML would definitely be easier to manually define simple weighting logic such as "2x sequences from region A compared to B".

I'd need a script that generates a large number of combinations (that I don't actually care about).

Good point. The combinations need to be programmatically generated somewhere along the lines. If providing weights as YAML, the subsampling tool would internally generate weights per group analogous to the TSV.

I think it'd be manageable to first implement the underlying logic and allow configuration via both YAML and TSV to get a feel for what works better under different scenarios.
To avoid enforced verbosity I could also imagine assuming a weight of 1 for any missing entries.

My (again speculative) concern is that there may be few cases in which 1 is a useful default, especially if weights are based on case counts or population size.

This seems like a small behavioral detail in which we'll only know what to do once we have an implementation to test against real world usage. We could start with errors to notice if enforced verbosity is overkill.

victorlin · 2024-03-28T02:57:31Z

After working on nextstrain/ncov@0fd6861 I've realized that in order to reduce the number of samples (i.e. calls to augur filter) in the workflow, augur filter will need the extended implementation that allows partitioning of the data at different resolutions. I don't see how the initial implementation will simplify the ncov workflow.

victorlin · 2024-03-28T18:01:01Z

Here's an idea: implement weighted subsampling as a part of augur subsample and configure it in the new YAML.

Using the currently proposed YAML as-is would look something like:

samples:
  north_america_6m:
    size: 4000

    weights:
      # Region weighting: 4:1 for North America to rest of world
      region:
        North America: 4
        # Africa: 1
        # Asia: 1
        # Europe: 1
        # …

      # Time weighting: 4:1 for recent sequences to early sequences (2M threshold)
      month:
        # 2020-01: 1
        # 2020-02: 1
        # …
        2024-02: 4
        2024-03: 4

Issues:

For region weighting, assuming a weight of 1 for missing entries, the weighting will change from 4:1 North America to rest of the world to 4:1 North America to every other region (i.e. 4:6 North America to rest of the world). Time weighting is similarly affected.
Time weighting is verbose and lacks ability to use relative dates.
For both region and time weighting, the column to group by for uniform sampling within each group is no longer encoded. In the current ncov workflow, this is encoded as different --group-by columns for individual samples.

Here's an alternative which addresses those issues:

samples:
  north_america_6m:
    size: 4000

    partitions:
      # Region weighting: 4:1 for North America to rest of world
      region:
      - query: region == 'North America'
        weight: 4
        uniform_sampling: division
      - query: region != 'North America'
        weight: 1
        uniform_sampling: country

      # Time weighting: 4:1 for recent sequences to early sequences (2M threshold)
      month:
      - query: date >= 2M
        weight: 4
        uniform_sampling: week
      - query: date < 2M
        weight: 1
        uniform_sampling: month

victorlin · 2024-03-29T00:02:13Z

After thinking more along the lines of implementing this in augur subsample, I've realized there's two types of weighted sampling:

Weighted sampling between intermediate samples (e.g. 4:1 between North America vs. rest of the world)
Weighted sampling within an intermediate sample (e.g. dynamic sequences per group based on geo-temporal case counts).

I think these can be implemented separately, where (1) can be YAML-based (2) can be TSV-based. I've added more detail and examples in the subsampling doc.

trvrb · 2024-04-12T19:41:46Z

Thanks for the thoughts @victorlin. I'll try to pull together a more cohesive thread for how I'd see this working for the ncov example. But broadly, I like the general idea of encoding weights independently between categories (country vs month for example) and assuming no interaction between categories. Ie if you have weight of 4 in North America and weight of 1 for global context and if you have weight of 4 for recent samples and weight of 1 for older samples, then I'd assume sampling weight of 4x4 = 16 for recent North America, 1x4 = 4 for recent global, 1x4 = 4 for older North America and 1x1 = 1 for older global.

victorlin · 2024-08-21T18:00:39Z

augur frequencies has a weights interface which was never discussed here:

augur/augur/frequencies.py

Lines 65 to 66 in d8faf01

    
           parser.add_argument("--weights", help="a dictionary of key/value mappings in JSON format used to weight KDE tip frequencies") 
        
           parser.add_argument("--weights-attribute", help="name of the attribute on each tip whose values map to the given weights dictionary")

It is different in that it only allows weighting on a single column (defined by --weights-attribute) and the file format is JSON instead of TSV.

I've considered the idea of swapping --group-by-weights with --weights + --weights-attribute for the sake of consistency across Augur. It's definitely possible, but I'm going against it for the following reasons ordered from most to least important:

The swap is compatible with the only confirmed use case which uses a single column. However, we've considered use cases of multiple columns which would require some careful changes to the --weights + --weights-attribute interface. Those are already supported by the TSV format and I would rather leave the support built-in than pending a redesign of the --weights + --weights-attribute interface.
The name --group-by-weights pairs nicely with --group-by. This is useful because all weighted columns must be passed to --group-by, i.e. --group-by-weights is an extension of --group-by. It wouldn't be so obvious with --weights-attribute.
The --group-by-weights in Implement weighted sampling #1454 has already implemented various checks for the TSV format.

trvrb · 2024-08-21T18:25:10Z

I think that --group-by-weights is helpfully clear when paired with the familiar --group-by. I support your decision to keep the interface as currently implemented.

victorlin · 2024-08-22T18:37:21Z

augur filter --group-by-weights was released in Augur 25.3.0.

victorlin added the enhancement New feature or request label Sep 19, 2023

nextstrain-bot added this to Nextstrain planning (archived) Sep 20, 2023

github-project-automation bot moved this to New in Nextstrain planning (archived) Sep 20, 2023

trvrb mentioned this issue Mar 21, 2024

Generate subsampling config with a script nextstrain/ncov#1102

Closed

5 tasks

This was referenced Apr 12, 2024

filter: Split filtering and subsampling #1432

Draft

augur subsample command #635

Open

victorlin changed the title ~~filter: Allow weighted subsampling~~ Allow weighted subsampling Apr 18, 2024

victorlin mentioned this issue Apr 25, 2024

Implement weighted sampling #1454

Merged

12 tasks

victorlin mentioned this issue Jun 20, 2024

Improved subsampling support #1481

Open

6 tasks

This was referenced Aug 14, 2024

Use weighted sampling nextstrain/ncov#1141

Closed

Add weighted sampling docs nextstrain/docs.nextstrain.org#223

Merged

genehack mentioned this issue Oct 16, 2024

Potential blog posts meta-issue nextstrain/nextstrain.org#1050

Open

victorlin closed this as completed Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow weighted subsampling #1318

Allow weighted subsampling #1318

victorlin commented Sep 19, 2023 •

edited

Loading

victorlin commented Mar 11, 2024

trvrb commented Mar 21, 2024

victorlin commented Mar 28, 2024

victorlin commented Mar 28, 2024

victorlin commented Mar 28, 2024 •

edited

Loading

victorlin commented Mar 29, 2024

trvrb commented Apr 12, 2024

victorlin commented Aug 21, 2024 •

edited

Loading

trvrb commented Aug 21, 2024

victorlin commented Aug 22, 2024

Allow weighted subsampling #1318

Allow weighted subsampling #1318

Comments

victorlin commented Sep 19, 2023 • edited Loading

Context

Tasks

Rollout

Original proposed solution

victorlin commented Mar 11, 2024

trvrb commented Mar 21, 2024

victorlin commented Mar 28, 2024

victorlin commented Mar 28, 2024

victorlin commented Mar 28, 2024 • edited Loading

victorlin commented Mar 29, 2024

trvrb commented Apr 12, 2024

victorlin commented Aug 21, 2024 • edited Loading

trvrb commented Aug 21, 2024

victorlin commented Aug 22, 2024

victorlin commented Sep 19, 2023 •

edited

Loading

victorlin commented Mar 28, 2024 •

edited

Loading

victorlin commented Aug 21, 2024 •

edited

Loading