Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement weighted sampling #1454

Merged
merged 4 commits into from
Aug 21, 2024
Merged

Implement weighted sampling #1454

merged 4 commits into from
Aug 21, 2024

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Apr 25, 2024

Description of proposed changes

This new option takes a weights file to determine group sizes. Summarized by the help text:

subsample_group.add_argument('--group-by-weights', type=str, metavar="FILE", help="""
TSV file defining weights for grouping. Requirements:
(1) Lines starting with '#' are treated as comment lines.
(2) The first non-comment line must be a header row.
(3) There must be a numeric ``weight`` column (weights can take on any
non-negative values).
(4) Other columns must be a subset of columns used in ``--group-by``,
with combinations of values covering all combinations present in the
metadata.
(5) This option only applies when ``--group-by`` and
``--subsample-max-sequences`` are provided.
(6) This option cannot be used with ``--no-probabilistic-sampling``.
Notes:
(1) Any ``--group-by`` columns absent from this file will be given equal
weighting across all values *within* groups defined by the other
weighted columns.
(2) An entry with the value ``default`` under all columns will be
treated as the default weight for specific groups present in the
metadata but missing from the weights file. If there is no default
weight and the metadata contains rows that are not covered by the
given weights, augur filter will exit with an error.
""")

Related issue(s)

Notable comment threads

Checklist

@victorlin victorlin self-assigned this Apr 25, 2024
@victorlin
Copy link
Member Author

victorlin commented Apr 30, 2024

This should be in a working state, at least based on limited testing in subsample-weighted.t. Planning to test using #1444 + nextstrain/ncov#1102 to get a feel for how add an additional weights.tsv file fits in a pathogen repo and make sure it works as expected on real data.


UPDATE: first testing in ncov independently of augur subsample

@victorlin victorlin force-pushed the victorlin/weighted-sampling branch 2 times, most recently from 335325d to dc1d1ab Compare May 3, 2024 22:23
@victorlin victorlin mentioned this pull request May 3, 2024
3 tasks
@victorlin victorlin changed the base branch from master to victorlin/prep-weighted-sampling May 3, 2024 22:25
@victorlin victorlin force-pushed the victorlin/prep-weighted-sampling branch from c18acb5 to d5004bf Compare May 3, 2024 22:27
@victorlin victorlin force-pushed the victorlin/weighted-sampling branch from dc1d1ab to 0a321a6 Compare May 3, 2024 22:27
@victorlin victorlin force-pushed the victorlin/prep-weighted-sampling branch from d5004bf to 80eb23b Compare May 6, 2024 22:44
@victorlin victorlin force-pushed the victorlin/weighted-sampling branch from 0a321a6 to 9be1e84 Compare May 6, 2024 22:44
augur/filter/subsample.py Outdated Show resolved Hide resolved
Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be really powerful @victorlin, nice work. I tested this out a little bit, but haven't gone through every part of it. There's a few aspects that this brought up for which I'll think through a little more, but I think the overall direction is sound.

augur/filter/__init__.py Outdated Show resolved Hide resolved
augur/filter/subsample.py Outdated Show resolved Hide resolved
augur/filter/subsample.py Outdated Show resolved Hide resolved
augur/filter/subsample.py Outdated Show resolved Hide resolved
augur/filter/subsample.py Show resolved Hide resolved
augur/filter/subsample.py Outdated Show resolved Hide resolved
victorlin added a commit that referenced this pull request May 10, 2024
@genehack
Copy link
Contributor

Having just read this from scratch, and hoping to help it get moving again, it seems like the remaining outstanding discussion items are:

#1 adding an (optional?) output with requested and calculated group sizes – which seems like a pretty valuable debugging tool to me, but maybe non-blocking for this particular PR
#2 validating that the current behavior when a group-by group is empty is desired convo continued in second commentseems like this is the case but maybe a bit more explicit validation/commentary would be good?

Did I miss anything @victorlin ?

@victorlin
Copy link
Member Author

Thanks for bumping this @genehack. I added a comment on the thread for (1). Testing and discussion in nextstrain/ncov#1106 should address (2) and impact the overall direction of this PR. I want to make sure we have at least one solid use case for this feature.

genehack
genehack previously approved these changes Jun 7, 2024
@victorlin victorlin force-pushed the victorlin/weighted-sampling branch from 19c685f to b88a8c4 Compare July 5, 2024 18:56
victorlin added a commit that referenced this pull request Jul 5, 2024
@victorlin victorlin force-pushed the victorlin/weighted-sampling branch from b88a8c4 to 7d041b4 Compare July 17, 2024 18:28
@victorlin victorlin force-pushed the victorlin/prep-weighted-sampling branch from 80eb23b to a8f8827 Compare July 17, 2024 18:38
@victorlin victorlin force-pushed the victorlin/weighted-sampling branch 2 times, most recently from e81cdc5 to 32627ad Compare July 17, 2024 18:45
Copy link

codecov bot commented Jul 17, 2024

Codecov Report

Attention: Patch coverage is 98.72611% with 2 lines in your changes missing coverage. Please review.

Project coverage is 70.98%. Comparing base (d8faf01) to head (9a2090b).
Report is 120 commits behind head on master.

Files with missing lines Patch % Lines
augur/filter/weights_file.py 94.59% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1454      +/-   ##
==========================================
+ Coverage   70.51%   70.98%   +0.46%     
==========================================
  Files          78       79       +1     
  Lines        8107     8244     +137     
  Branches     1966     2002      +36     
==========================================
+ Hits         5717     5852     +135     
- Misses       2100     2101       +1     
- Partials      290      291       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@victorlin victorlin force-pushed the victorlin/weighted-sampling branch from 9e0a4b8 to 5c318c0 Compare July 17, 2024 19:25
@victorlin victorlin marked this pull request as ready for review July 17, 2024 19:38
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't fully grokked what's going on in get_weighted_group_sizes, but the behaviors described in the functional tests make sense to me.

Once this feature is released, it definitely deserves it's own section in the Filtering and Subsampling docs.

augur/filter/weights_file.py Outdated Show resolved Hide resolved
augur/filter/weights_file.py Outdated Show resolved Hide resolved
Base automatically changed from victorlin/prep-weighted-sampling to master August 5, 2024 22:03
@victorlin victorlin force-pushed the victorlin/weighted-sampling branch 2 times, most recently from 648e53b to 87df694 Compare August 5, 2024 22:04

Using 1:1 weights is similarly straightforward, with 50 sequences from each location.

$ cat >weights.tsv <<~~
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[general question about test best practices]

I commonly use cram --keep-tmpdir and then inspect the files themselves to understand the tests & thus behaviour of augur. This test overwrites files (e.g. weights.tsv) so that approach doesn't work. It also seems like it'd hide bugs where a command fails, but exits code 0, and thus we're not using the version of the file we think we are. Thoughts on using weights1.tsv, weights2.tsv etc?

Copy link
Member Author

@victorlin victorlin Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added more specific file names: b7fc345.

The other approach which I've found better for debugging tests is to split the test file into multiple files. However, in this case it's better to keep all in the same file to share the temporary metadata.tsv.

It also seems like it'd hide bugs where a command fails, but exits code 0, and thus we're not using the version of the file we think we are.

I assume "command" = "Augur command", but I'm having trouble understanding your scenario.

  • Weights files are created separately from the Augur commands.
  • Augur command failures will exit with a nonzero exit code and be shown as a test failure.

@victorlin victorlin force-pushed the victorlin/weighted-sampling branch 3 times, most recently from 18dbff0 to 6b3c63b Compare August 6, 2024 22:34
@victorlin
Copy link
Member Author

victorlin commented Aug 7, 2024

[@joverlee521] I haven't fully grokked what's going on in get_weighted_group_sizes

I've split this function up into smaller parts and added types to make it more readable: c305fe8

@victorlin victorlin force-pushed the victorlin/weighted-sampling branch 4 times, most recently from c860492 to 66d0df4 Compare August 14, 2024 17:19
@victorlin victorlin mentioned this pull request Aug 14, 2024
5 tasks
augur/filter/__init__.py Outdated Show resolved Hide resolved
@trvrb trvrb self-requested a review August 20, 2024 20:25
Copy link
Member

@trvrb trvrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@victorlin: This is great! I appreciate all the documentation and tests. I'm good for this to be merged as it stands. However, I do think we should follow up shortly with a separate PR that implements support for --probabilistic-sampling + --group-by-weights per #1454 (comment). I was confused. Nothing more to be done with probabilistic sampling.

This reverts "Rename probabilistic sampling group" (a8f8827). That
change was made prematurely and it is no longer necessary.
This makes it easier to weight on month or week values from a TSV file.
No reason to use a Python list object when all that's needed is a unique
value per year-month or year-week.
@victorlin
Copy link
Member Author

Finished addressing all conversations and cleaned up commits. I'm going to re-run trial builds once more on nextstrain/ncov#1106 with latest changes then merge this tomorrow if all looks good.

Add a new option --group-by-weights and --output-group-by-sizes to
support the new feature.
@victorlin victorlin merged commit 1e9d131 into master Aug 21, 2024
28 checks passed
@victorlin victorlin deleted the victorlin/weighted-sampling branch August 21, 2024 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants