Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest: Nextclade merge updates #67

Merged
merged 3 commits into from
Oct 3, 2024
Merged

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Oct 3, 2024

See commits.

Resolves #65 and #55.

Checklist

  • Checks pass

Preserving the line-breaks makes the command much more readable in
Snakemake output¹, which is important since I'm changing this rule right
now.

The \n previously interpreted by Python is now interpreted by `tr`,
which is preferable.

¹ <https://docs.nextstrain.org/en/latest/reference/snakemake-style-guide.html#use-triple-quoted-command-definitions>

Ported-from: <nextstrain/measles@762acdb>
Related-to: <nextstrain/measles#52>
Related-to: <#65>
This construction reads much clearer and cleaner.

Moves the Nextclade field map directly and more conveniently into the
YAML config instead of referencing a separate TSV file.  Putting the
field map into a separate file seemed to be only for the sake of the
--kv-file (-k) interface provided by `cvstk rename2`, which we're no
longer using here.  Backwards compatibility with configs that name a TSV
file is not preserved since this pathogen-repo-guide is expected to be
used to stamp out new repos, and we don't have any particular
process/plan for how to update previously stamped out repos.

Note that `augur curate` commands currently emit CSV-like TSVs that are
limited to be IANA-like¹ such that parsing them with tsv-utils is most
appropriate, hence the switch from `csvtk cut` to `tsv-select`.

¹ See <nextstrain/augur#1566>.

Ported-from: <nextstrain/measles@faebd64>
Related-to: <nextstrain/measles#52>
Related-to: <#65>
This construction reads a bit clearer and cleaner.  It's also a good
example of how to use `augur merge`.

The limitation on non-seekable streams means the workflow now uses
additional transient disk space, but it typically shouldn't be an issue.
The way Augur's slow start up time impacts `augur merge` also
contributes to a longer rule execution time, but it should be negligible
in the context of the larger workflow and presumably we'll fix the slow
start up eventually.¹

The output is semantically identical but has some syntactic changes re:
quoting.  It's worth noting that the pre-existing TSV format was _not_
IANA TSV, despite it (still) being treated as such in a few places, but
was (and remains) a CSV-like TSV with some quoted fields.  We really
need to sort out our TSV formats³, but that's for a larger project.

¹ <nextstrain/augur#1628>
² <nextstrain/augur#1565>
³ <nextstrain/augur#1566>

Ported-from: <nextstrain/measles@4d73b7f>
Related-to: <nextstrain/measles#52>
Related-to: <#65>
@tsibley tsibley linked an issue Oct 3, 2024 that may be closed by this pull request
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! So much simpler!

I was thinking I might need to update the related tutorial, but looks like section on Nextclade as part of ingest did not dive into the specifics of the config. So no need for updates there!

@tsibley tsibley merged commit abdea39 into main Oct 3, 2024
1 check passed
@tsibley tsibley deleted the trs/ingest/nextclade-merge branch October 3, 2024 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ingest: Updates for augur curate and augur merge ingest/nextclade: Replace csvtk
3 participants