Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest: Updates for augur curate and augur merge #65

Closed
joverlee521 opened this issue Oct 2, 2024 · 1 comment · Fixed by #67
Closed

ingest: Updates for augur curate and augur merge #65

joverlee521 opened this issue Oct 2, 2024 · 1 comment · Fixed by #67
Assignees

Comments

@joverlee521
Copy link
Contributor

Port changes from nextstrain/measles#52

@tsibley tsibley self-assigned this Oct 3, 2024
@tsibley
Copy link
Member

tsibley commented Oct 3, 2024

Since this repo is expected to be used to stamp out new repos, and we don't have any particular process/plan for how to update previously stamped out repos, I'm going to drop this bit of backwards compatibility that was in my measles implementation:

diff --git a/ingest/rules/nextclade.smk b/ingest/rules/nextclade.smk
index a70d994..41a2f75 100644
--- a/ingest/rules/nextclade.smk
+++ b/ingest/rules/nextclade.smk
@@ -18,6 +18,8 @@ See Nextclade docs for more details on usage, inputs, and outputs if you would
 like to customize the rules:
 https://docs.nextstrain.org/projects/nextclade/page/user/nextclade-cli.html
 """
+import sys
+
 DATASET_NAME = config["nextclade"]["dataset_name"]
 
 
@@ -61,6 +63,14 @@ rule run_nextclade:
         """
 
 
+if isinstance(config["nextclade"]["field_map"], str):
+    print(f"Converting config['nextclade']['field_map'] from TSV file ({config['nextclade']['field_map']}) to dictionary; "
+          f"consider putting the field map directly in the config file.", file=sys.stderr)
+
+    with open(config["nextclade"]["field_map"], "r") as f:
+        config["nextclade"]["field_map"] = dict(line.rstrip("\n").split("\t", 1) for line in f if not line.startswith("#"))
+
+
 rule join_metadata_and_nextclade:
     input:
         nextclade="results/nextclade.tsv",

tsibley added a commit that referenced this issue Oct 3, 2024
Preserving the line-breaks makes the command much more readable in
Snakemake output¹, which is important since I'm changing this rule right
now.

The \n previously interpreted by Python is now interpreted by `tr`,
which is preferable.

¹ <https://docs.nextstrain.org/en/latest/reference/snakemake-style-guide.html#use-triple-quoted-command-definitions>

Ported-from: <nextstrain/measles@762acdb>
Related-to: <nextstrain/measles#52>
Related-to: <#65>
tsibley added a commit that referenced this issue Oct 3, 2024
This construction reads much clearer and cleaner.

Moves the Nextclade field map directly and more conveniently into the
YAML config instead of referencing a separate TSV file.  Putting the
field map into a separate file seemed to be only for the sake of the
--kv-file (-k) interface provided by `cvstk rename2`, which we're no
longer using here.  Backwards compatibility with configs that name a TSV
file is not preserved since this pathogen-repo-guide is expected to be
used to stamp out new repos, and we don't have any particular
process/plan for how to update previously stamped out repos.

Note that `augur curate` commands currently emit CSV-like TSVs that are
limited to be IANA-like¹ such that parsing them with tsv-utils is most
appropriate, hence the switch from `csvtk cut` to `tsv-select`.

¹ See <nextstrain/augur#1566>.

Ported-from: <nextstrain/measles@faebd64>
Related-to: <nextstrain/measles#52>
Related-to: <#65>
tsibley added a commit that referenced this issue Oct 3, 2024
This construction reads a bit clearer and cleaner.  It's also a good
example of how to use `augur merge`.

The limitation on non-seekable streams means the workflow now uses
additional transient disk space, but it typically shouldn't be an issue.
The way Augur's slow start up time impacts `augur merge` also
contributes to a longer rule execution time, but it should be negligible
in the context of the larger workflow and presumably we'll fix the slow
start up eventually.¹

The output is semantically identical but has some syntactic changes re:
quoting.  It's worth noting that the pre-existing TSV format was _not_
IANA TSV, despite it (still) being treated as such in a few places, but
was (and remains) a CSV-like TSV with some quoted fields.  We really
need to sort out our TSV formats³, but that's for a larger project.

¹ <nextstrain/augur#1628>
² <nextstrain/augur#1565>
³ <nextstrain/augur#1566>

Ported-from: <nextstrain/measles@4d73b7f>
Related-to: <nextstrain/measles#52>
Related-to: <#65>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants