Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof-of-concept: warn that only last --exclude-where used when user passes multiple #1229

Closed
wants to merge 1 commit into from

Conversation

corneliusroemer
Copy link
Member

@corneliusroemer corneliusroemer commented May 17, 2023

See https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1684360772513889
for discussion leading up to this

This change is entirely backwards compatible. All it does is add a helpful warning

Description of proposed changes

Warn users that only last --exclude-where is used when user passes the same optional arg multiple times.

This is a proof of concept. It would make sense to roll this out more systematically to all arguments where we have nargs="+"

The warning looks as follows:

$ augur filter             --sequences data/sequences.fasta.zst             --metadata results/metadata.tsv             --exclude config/exclude_accessions_mpxv.txt             --output-sequences results/hmpxv1/good_sequences.fasta.zst             --output-metadata results/hmpxv1/good_metadata.tsv             --min-date 2017             --min-length 180000             --exclude-where outbreak!='hMPXV-1' clade!='IIb' --exclude-where 'coverage<0.95' --exclude-where 'divergence<550' --exclude-where 'missing_data>10000' --exclude-where 'divergence>650' --exclude-where 'QC_rare_mutations!=good'             --output-log results/hmpxv1/good_filter.log

WARNING: You passed `--exclude-where` 6 times. Only the last argument ['QC_rare_mutations!=good'] will be used.
The following arguments {'divergence>650', 'divergence<550', 'coverage<0.95', 'outbreak!=hMPXV-1', 'clade!=IIb', 'missing_data>10000'} will be ignored.
Note: You did not provide a sequence index, so Augur will generate one. You can generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`.
519 strains were dropped during filtering
        42 of these were dropped because of 'QC_rare_mutations!=good'
        113 of these were dropped because they were earlier than 2017.0 or missing a date
        364 of these were dropped because they were shorter than minimum length of 180000bp
6553 strains passed all filters

Testing

What steps should be taken to test the changes you've proposed?
If you added or changed behavior in the codebase, did you update the tests, or do you need help with this?

Checklist

  • Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

…passes multiple

See https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1684360772513889
for discussion leading up to this

This change is entirely backwards compatible. All it does is add a helpful warning
@codecov
Copy link

codecov bot commented May 17, 2023

Codecov Report

Patch coverage: 60.00% and project coverage change: -0.02 ⚠️

Comparison is base (5c1fc41) 68.87% compared to head (ea609d1) 68.86%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1229      +/-   ##
==========================================
- Coverage   68.87%   68.86%   -0.02%     
==========================================
  Files          64       64              
  Lines        6937     6940       +3     
  Branches     1693     1694       +1     
==========================================
+ Hits         4778     4779       +1     
- Misses       1854     1855       +1     
- Partials      305      306       +1     
Impacted Files Coverage Δ
augur/filter/validate_arguments.py 80.00% <33.33%> (-8.24%) ⬇️
augur/filter/__init__.py 100.00% <100.00%> (ø)
augur/filter/include_exclude_rules.py 97.72% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@joverlee521
Copy link
Contributor

Since --exclude-where already accepts multiple arguments, I would expect multiple uses of it to build up the list of conditions instead of just using the last provided.

I think this would mirror how we handle multiple --node-data args for export.

@corneliusroemer
Copy link
Member Author

@joverlee521 Interesting, so you're saying we should change the behaviour? I'm open for that, but maybe it would be best to first start warning about that change then. This PR is more about the ability to warn than a case for changing it. I'm fine with the current behaviour as long as users are warned.

@corneliusroemer corneliusroemer requested a review from a team May 18, 2023 20:21
Comment on lines +539 to +540
# Use only last exclude_where argument for backward compatibility.
# User warned in validate_arguments.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like how the validation logic is separate from the actual logic. It's not stated explicitly anywhere, but the way I see it is that validate_arguments.py contains logic done ahead of time to avoid unnecessary computation at the cost of logical duplication. For example, there's no point in going through filtering only to realize subsampling is not possible.

Since this change is done during filter construction which doesn't come after any computationally expensive tasks, I suggest doing the validation here and not validate_arguments.py.

Comment on lines +49 to +51
if args.exclude_where and len(args.exclude_where) > 1:
print_err(f"WARNING: You passed `--exclude-where` {len(args.exclude_where)} times. Only the last argument {args.exclude_where[-1]} will be used.\n"
f"The following arguments {set(sum(args.exclude_where,[])).difference(set(args.exclude_where[-1]))} will be ignored.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the current behavior also applies to --include-where and other filter options. Should those get the same treatment?

@victorlin
Copy link
Member

+1 for directly changing the behavior. I don't see it as a breaking change - if a user is specifying --exclude-where multiple times, I doubt they are relying on the "feature" that only the last one is used. I created a PR to change it everywhere: #1445

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

3 participants