-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement weighted sampling #1454
Conversation
96a7a0a
to
adc1d0b
Compare
This should be in a working state, at least based on limited testing in subsample-weighted.t. Planning to test using #1444 + nextstrain/ncov#1102 to get a feel for how add an additional weights.tsv file fits in a pathogen repo and make sure it works as expected on real data. UPDATE: first testing in ncov independently of |
335325d
to
dc1d1ab
Compare
c18acb5
to
d5004bf
Compare
dc1d1ab
to
0a321a6
Compare
d5004bf
to
80eb23b
Compare
0a321a6
to
9be1e84
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be really powerful @victorlin, nice work. I tested this out a little bit, but haven't gone through every part of it. There's a few aspects that this brought up for which I'll think through a little more, but I think the overall direction is sound.
Having just read this from scratch, and hoping to help it get moving again, it seems like the remaining outstanding discussion items are: #1 adding an (optional?) output with requested and calculated group sizes – which seems like a pretty valuable debugging tool to me, but maybe non-blocking for this particular PR Did I miss anything @victorlin ? |
Thanks for bumping this @genehack. I added a comment on the thread for (1). Testing and discussion in nextstrain/ncov#1106 should address (2) and impact the overall direction of this PR. I want to make sure we have at least one solid use case for this feature. |
19c685f
to
b88a8c4
Compare
b88a8c4
to
7d041b4
Compare
80eb23b
to
a8f8827
Compare
e81cdc5
to
32627ad
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1454 +/- ##
==========================================
+ Coverage 70.51% 70.98% +0.46%
==========================================
Files 78 79 +1
Lines 8107 8244 +137
Branches 1966 2002 +36
==========================================
+ Hits 5717 5852 +135
- Misses 2100 2101 +1
- Partials 290 291 +1 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
9e0a4b8
to
5c318c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't fully grokked what's going on in get_weighted_group_sizes
, but the behaviors described in the functional tests make sense to me.
Once this feature is released, it definitely deserves it's own section in the Filtering and Subsampling docs.
648e53b
to
87df694
Compare
|
||
Using 1:1 weights is similarly straightforward, with 50 sequences from each location. | ||
|
||
$ cat >weights.tsv <<~~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[general question about test best practices]
I commonly use cram --keep-tmpdir
and then inspect the files themselves to understand the tests & thus behaviour of augur. This test overwrites files (e.g. weights.tsv
) so that approach doesn't work. It also seems like it'd hide bugs where a command fails, but exits code 0, and thus we're not using the version of the file we think we are. Thoughts on using weights1.tsv
, weights2.tsv
etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added more specific file names: b7fc345.
The other approach which I've found better for debugging tests is to split the test file into multiple files. However, in this case it's better to keep all in the same file to share the temporary metadata.tsv
.
It also seems like it'd hide bugs where a command fails, but exits code 0, and thus we're not using the version of the file we think we are.
I assume "command" = "Augur command", but I'm having trouble understanding your scenario.
- Weights files are created separately from the Augur commands.
- Augur command failures will exit with a nonzero exit code and be shown as a test failure.
tests/functional/filter/cram/subsample-weighted-and-uniform-mix.t
Outdated
Show resolved
Hide resolved
18dbff0
to
6b3c63b
Compare
I've split this function up into smaller parts and added types to make it more readable: c305fe8 |
c860492
to
66d0df4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@victorlin: This is great! I appreciate all the documentation and tests. I'm good for this to be merged as it stands. However, I do think we should follow up shortly with a separate PR that implements support for I was confused. Nothing more to be done with probabilistic sampling.--probabilistic-sampling
+ --group-by-weights
per #1454 (comment).
This reverts "Rename probabilistic sampling group" (a8f8827). That change was made prematurely and it is no longer necessary.
This makes it easier to weight on month or week values from a TSV file. No reason to use a Python list object when all that's needed is a unique value per year-month or year-week.
66d0df4
to
8e3d6be
Compare
Finished addressing all conversations and cleaned up commits. I'm going to re-run trial builds once more on nextstrain/ncov#1106 with latest changes then merge this tomorrow if all looks good. |
Add a new option --group-by-weights and --output-group-by-sizes to support the new feature.
8e3d6be
to
9a2090b
Compare
Description of proposed changes
This new option takes a weights file to determine group sizes. Summarized by the help text:
augur/augur/filter/__init__.py
Lines 73 to 97 in 9a2090b
Related issue(s)
Notable comment threads
Checklist