Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

augur merge #1563

Merged
merged 2 commits into from
Aug 15, 2024
Merged

augur merge #1563

merged 2 commits into from
Aug 15, 2024

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Jul 30, 2024

Support generalized merging of two or more metadata tables. A long desired command. Behaviour is based on much discussion with the team and bespoke implementations like ncov's combine_metadata.py. Implementation requirements include handling inputs of arbitrary size (i.e. without needing to read any dataset fully into memory) and handling more than two inputs.

Checklist

@tsibley tsibley requested a review from a team July 30, 2024 23:51
@tsibley
Copy link
Member Author

tsibley commented Jul 30, 2024

One thing that's notable with this implementation is that it's stupidly slow for tiny datasets, e.g. a couple seconds. That's due to Augur's own slow startup time and having to wait for that 2+n times, where n is the number of metadata tables being joined. On large datasets, this fixed startup time shouldn't matter, but on small datasets it feels really dumb. Cutting out the additional startup times by cutting out the use of augur read-file and augur write-file makes it quite quick, as it should be. I think we can live with this for now, but if we can't, we can improve startup times or take a different approach to handling inputs.

Copy link
Contributor

@genehack genehack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was extremely pleasant to read.

augur/merge.py Outdated Show resolved Hide resolved
@tsibley
Copy link
Member Author

tsibley commented Jul 31, 2024

Repushed with a couple of docstring/help output tweaks and a couple more tests for the SQLITE3 env var.

@tsibley
Copy link
Member Author

tsibley commented Jul 31, 2024

As I expected, CI is running into failures due to #1557: too old SQLite and no tsv-pretty.

augur/merge.py Outdated Show resolved Hide resolved
Base automatically changed from trs/read-write-file to master August 1, 2024 00:54
CHANGES.md Show resolved Hide resolved
joverlee521 added a commit that referenced this pull request Aug 1, 2024
This reverts commit 915672e.

Per discussion in <#1563 (comment)>,
keep CSV-like TSV where quotes may be added or removed, but parsed
values should be equivalent.
Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! I know it's a big ask so consider it non-blocking, but it would be nice to see a draft PR on an existing pathogen repo that would benefit from this feature. Such a PR would help me better understand (a) how it fits into existing workflows and (b) if there are any performance issues with actual data.

augur/merge.py Show resolved Hide resolved
augur/merge.py Show resolved Hide resolved
@tsibley
Copy link
Member Author

tsibley commented Aug 2, 2024

As I expected, CI is running into failures due to #1557: too old SQLite and no tsv-pretty.

I've fixed this 🤞 in repush by addding sqlite and tsv-utils to the Conda environment for CI.

And that also caused me to realize I had to update the installation from source docs.

Copy link
Member

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I'll happily try to test drive this in a few repos once available!

@tsibley
Copy link
Member Author

tsibley commented Aug 13, 2024

I've fixed this 🤞 in repush by addding sqlite and tsv-utils to the Conda environment for CI.

CI failed for a different reason after that. I don't immediately understand why, so will have to troubleshoot.

@tsibley
Copy link
Member Author

tsibley commented Aug 14, 2024

Using the new debug mode in #1577 and the GitHub Actions debugger I wrote a while back, I was able to identify the failure in CI (but not locally) as a SQLite version incompatibility. Locally I'd been testing with SQLite 3.39 (the minimum supported version). CI was testing with SQLite 3.46.

The augur merge code was calling pragma_table_info("table_name") which was relying on the quoted identifier "table_name" being treated as a string literal. That behaviour changed for SQLite's CLI in 3.41, and it was no longer treated as a string literal. My mistake: I thought that the table-valued function took an identifier not a string argument, but it takes a string. The solution that's compatible across versions is to give it the string it really wants. I squashed the fix in to this PR's commit.

Base automatically changed from trs/debug-mode to master August 14, 2024 21:54
@tsibley
Copy link
Member Author

tsibley commented Aug 15, 2024

Ok, remaining test failures seem to be a discrepancy in stdout/err buffering/ordering in Python 3.8 vs. ≥3.9. Compare

$ python --version
Python 3.8.19

$ cram tests/functional/merge/cram/merge.t
!
--- tests/functional/merge/cram/merge.t
+++ tests/functional/merge/cram/merge.t.err
@@ -191,8 +191,8 @@
   $ ${AUGUR} merge \
   >   --metadata dups=dups.tsv Y=y.tsv \
   >   --output-metadata /dev/null
-  Reading 'dups' metadata from 'dups.tsv'…
   Error: stepping, UNIQUE constraint failed: metadata_dups.strain (19)
+  Reading 'dups' metadata from 'dups.tsv'\xe2\x80\xa6 (esc)
   WARNING: Skipped deletion of */augur-merge-*.sqlite due to error, but you may want to clean it up yourself (e.g. if it's large). (glob)
   ERROR: sqlite3 invocation failed
   [2]

# Ran 1 tests, 0 skipped, 1 failed.

vs.

$ python --version
Python 3.9.19

$ cram tests/functional/merge/cram/merge.t
.
# Ran 1 tests, 0 skipped, 0 failed.

Have we dealt with this ordering issue before? Do we have a standard solution?
I can think of a few, but I'll use our convention if one exists.

@tsibley
Copy link
Member Author

tsibley commented Aug 15, 2024

This is probably stderr changing default buffering modes beginning in 3.9.

Previously, sys.stderr was block-buffered when non-interactive. Now stderr defaults to always being line-buffered.

Copy link

codecov bot commented Aug 15, 2024

Codecov Report

Attention: Patch coverage is 96.22642% with 4 lines in your changes missing coverage. Please review.

Project coverage is 70.51%. Comparing base (28b8705) to head (b649f11).
Report is 130 commits behind head on master.

Files with missing lines Patch % Lines
augur/io/print.py 50.00% 1 Missing and 1 partial ⚠️
augur/merge.py 98.03% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1563      +/-   ##
==========================================
+ Coverage   70.17%   70.51%   +0.34%     
==========================================
  Files          77       78       +1     
  Lines        8001     8107     +106     
  Branches     1950     1966      +16     
==========================================
+ Hits         5615     5717     +102     
- Misses       2098     2100       +2     
- Partials      288      290       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Support generalized merging of two or more metadata tables.  A long
desired command.  Behaviour is based on much discussion with the team
and bespoke implementations like ncov's combine_metadata.py.
Implementation requirements include handling inputs of arbitrary size
(i.e. without needing to read any dataset fully into memory) and
handling more than two inputs.  SQLite is used in the implementation but
could be replaced by another implementation in the future.

One thing that's notable with this implementation is that it's stupidly
slow for tiny datasets, e.g. a couple seconds.  That's due to Augur's
own slow startup time and having to wait for that 2+n times, where n is
the number of metadata tables being joined, plus once for the initial
startup of `augur merge` and once more for writing the output.  On large
datasets, this fixed startup time shouldn't matter, but on small
datasets it feels really dumb.  Cutting out the additional startup times
by cutting out the use of `augur read-file` and `augur write-file` makes
it quite quick, as it should be.  I think we can live with this slowness
for now, but if it turns out we can't, we can improve startup times or
take a different approach to handling inputs.
The change in default from block-buffering in ≤3.8 to line-buffering in
≥3.9¹ made a Cram test output vary between Python versions (and thus
fail).  I could fix the Cram test various ways, but always
line-buffering stderr makes sense because we're exclusively using it for
messaging/logging.

¹ <https://docs.python.org/3/whatsnew/3.9.html#sys>
@tsibley
Copy link
Member Author

tsibley commented Aug 15, 2024

I looked into the few new lines that weren't covered by tests, expecting to dismiss them, and was puzzled that the "no sqlite3 found" error wasn't covered. I definitely had a test for it.

No `sqlite3`.
$ SQLITE3= \
> augur merge \
> --metadata X=x.tsv Y=y.tsv \
> --output-metadata /dev/null --quiet
ERROR: Unable to find the program `sqlite3`. Is it installed?
In order to use `augur merge`, the SQLite 3 CLI (version ≥3.39)
must be installed separately. It is typically provided by a
Nextstrain runtime.
[2]

And then I realized: that test (and the one above it) were calling augur instead of ${AUGUR}, and coverage is recorded by specially setting AUGUR.

Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only substantial change since my previous review is b649f11 which looks good.

Re: Have we dealt with this ordering issue before?

Something similar prompted the changes to tests/functional/filter/cram/subsample-ambiguous-dates-error.t in 851a141, but I think that was more due to mixing stdout/stderr? The scenario that you encountered must be new, otherwise we would have failing tests on either Python 3.8 or ≥3.9.

@tsibley
Copy link
Member Author

tsibley commented Aug 15, 2024

There's still some polish and minor features I want to add to augur merge, some of them based on testing it out for the Nextclade metadata merge, and of course there's the major feature of sequence merging, but I'm going to do those in subsequent PRs.

I'm going to merge this.

@tsibley tsibley merged commit f5275d0 into master Aug 15, 2024
28 checks passed
@tsibley tsibley deleted the trs/merge branch August 15, 2024 17:28
joverlee521 added a commit that referenced this pull request Nov 12, 2024
This reverts commit 915672e.

Per discussion in <#1563 (comment)>,
keep CSV-like TSV where quotes may be added or removed, but parsed
values should be equivalent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants