Skip to content

Commit

Permalink
ingest: Merge Nextclade metadata with augur merge
Browse files Browse the repository at this point in the history
This construction reads a bit clearer and cleaner.  It's also a good
example of how to use `augur merge`.

The limitation on non-seekable streams means the workflow now uses
additional transient disk space, but it typically shouldn't be an issue.
The way Augur's slow start up time impacts `augur merge` also
contributes to a longer rule execution time, but it should be negligible
in the context of the larger workflow and presumably we'll fix the slow
start up eventually.¹

The output is semantically identical but has some syntactic changes re:
quoting.  It's worth noting that the pre-existing TSV format was _not_
IANA TSV, despite it (still) being treated as such in a few places, but
was (and remains) a CSV-like TSV with some quoted fields (and some
mangled quotes², e.g. the "institution" column for accession KJ556895).
We really need to sort out our TSV formats³, but that's for a larger
project.

¹ <nextstrain/augur#1628>
² <nextstrain/augur#1565>
³ <nextstrain/augur#1566>
  • Loading branch information
tsibley committed Sep 10, 2024
1 parent faebd64 commit 4d73b7f
Showing 1 changed file with 27 additions and 14 deletions.
41 changes: 27 additions & 14 deletions ingest/rules/nextclade.smk
Original file line number Diff line number Diff line change
Expand Up @@ -56,14 +56,12 @@ if isinstance(config["nextclade"]["field_map"], str):
config["nextclade"]["field_map"] = dict(line.rstrip("\n").split("\t", 1) for line in f if not line.startswith("#"))


rule join_metadata_and_nextclade:
rule nextclade_metadata:
input:
nextclade="results/nextclade.tsv",
metadata="data/subset_metadata.tsv",
output:
metadata="results/metadata.tsv",
nextclade_metadata=temp("results/nextclade_metadata.tsv"),
params:
metadata_id_field=config["curate"]["output_id_field"],
nextclade_id_field=config["nextclade"]["id_field"],
nextclade_field_map=[f"{old}={new}" for old, new in config["nextclade"]["field_map"].items()],
nextclade_fields=",".join(config["nextclade"]["field_map"].values()),
Expand All @@ -75,13 +73,28 @@ rule join_metadata_and_nextclade:
--field-map {params.nextclade_field_map:q} \
--output-metadata - \
| tsv-select --header --fields {params.nextclade_fields:q} \
| tsv-join -H \
--filter-file - \
--key-fields {params.nextclade_id_field} \
--data-fields {params.metadata_id_field} \
--append-fields '*' \
--write-all ? \
{input.metadata} \
| tsv-select -H --exclude {params.nextclade_id_field} \
> {output.metadata}
"""
> {output.nextclade_metadata:q}
"""


rule join_metadata_and_nextclade:
input:
metadata="data/subset_metadata.tsv",
nextclade_metadata="results/nextclade_metadata.tsv",
output:
metadata="results/metadata.tsv",
params:
metadata_id_field=config["curate"]["output_id_field"],
nextclade_id_field=config["nextclade"]["id_field"],
shell:
r"""
augur merge \
--metadata \
metadata={input.metadata:q} \
nextclade={input.nextclade_metadata:q} \
--metadata-id-columns \
metadata={params.metadata_id_field:q} \
nextclade={params.nextclade_id_field:q} \
--output-metadata {output.metadata:q} \
--no-source-columns
"""

0 comments on commit 4d73b7f

Please sign in to comment.