Move hardcode filter query into configuration files #273

huddlej · 2024-08-20T17:35:52Z

Description

The following hardcoded filter parameter appears at the start of the phylogenetic workflow:

mpox/phylogenetic/rules/prepare_sequences.smk

Line 85 in 2ce0d92

--query "(QC_rare_mutations == 'good' | QC_rare_mutations == 'mediocre')" \

When the user's metadata does not have the two columns referenced in that query (as happens when analyzing data from GISAID, for example), augur filter produces the following output:

WARNING: Column 'QC_rare_mutations' does not exist in the metadata file. Ignoring it.
ERROR: Query contains a column that does not exist in metadata.

Although that output comes across as an augur bug (that a warning is also an error), the proximal issue is that the workflow hardcodes parameters that the user cannot override without changing the workflow itself.

Proposed solution

I suggest moving the query string into the config files for the various workflows, specifically moving the hardcoded query into the top-level filter section of each config file (e.g., defaults/mpxv/config.yaml). Then users who want to analyze data without the fields referenced in that query can create their own config file.

The text was updated successfully, but these errors were encountered:

corneliusroemer · 2024-08-22T19:20:12Z

Even simpler: check for presence of that column and make the filter dependent on whether it's present or not.

I'd prefer automatic stuff over configure.

Also, people should in general just fork things and make the changes they want themselves.

joverlee521 · 2025-02-13T18:23:37Z

A workaround is to use the private_sequences and private_metadata config options added in for the INRB builds. (Thanks @tsibley for pointing this out in Zoho).

The workflow will merge in the private data after the filter rule so the private data bypasses the QC filters and only go through subsampling.

tsibley · 2025-02-13T18:27:55Z

Even simpler: check for presence of that column and make the filter dependent on whether it's present or not.

I'd prefer automatic stuff over configure.

Sure, automatic stuff can be nice. It can also be the cause of subtle problems. For example, if the column name changes in the TSV then the filter suddenly and silently fails to apply instead of raising an error. Explicit vs. implicit is a delicate tradeoff and is highly context dependent.

Also, people should in general just fork things and make the changes they want themselves.

What's easy for your is not always easy for others. Copying a repo and making a small change is certainly within the grasp of most folks, but maintaining that fork over time while taking updates from us is not a trivial process for most folks.

huddlej mentioned this issue Aug 20, 2024

Support for GISAID data #63

Open

victorlin mentioned this issue Aug 21, 2024

filter: Confusing warning and error combination when query contains a missing column nextstrain/augur#1592

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move hardcode filter query into configuration files #273

Move hardcode filter query into configuration files #273

huddlej commented Aug 20, 2024

corneliusroemer commented Aug 22, 2024

joverlee521 commented Feb 13, 2025

tsibley commented Feb 13, 2025

Move hardcode filter query into configuration files #273

Move hardcode filter query into configuration files #273

Comments

huddlej commented Aug 20, 2024

Description

Proposed solution

corneliusroemer commented Aug 22, 2024

joverlee521 commented Feb 13, 2025

tsibley commented Feb 13, 2025