augur merge #1563

tsibley · 2024-07-30T23:51:32Z

Support generalized merging of two or more metadata tables. A long desired command. Behaviour is based on much discussion with the team and bespoke implementations like ncov's combine_metadata.py. Implementation requirements include handling inputs of arbitrary size (i.e. without needing to read any dataset fully into memory) and handling more than two inputs.

Checklist

Fix missing deps in CI
Switch TSV output to be CSV-like? (see also Pick a meaning of TSV and use it consistently #1566)
Add sequencing merging?
Automated checks pass
Check if you need to add a changelog message
Check if you need to add tests
Check if you need to update docs

tsibley · 2024-07-30T23:55:45Z

One thing that's notable with this implementation is that it's stupidly slow for tiny datasets, e.g. a couple seconds. That's due to Augur's own slow startup time and having to wait for that 2+n times, where n is the number of metadata tables being joined. On large datasets, this fixed startup time shouldn't matter, but on small datasets it feels really dumb. Cutting out the additional startup times by cutting out the use of augur read-file and augur write-file makes it quite quick, as it should be. I think we can live with this for now, but if we can't, we can improve startup times or take a different approach to handling inputs.

genehack

This was extremely pleasant to read.

augur/merge.py

tsibley · 2024-07-31T23:45:38Z

Repushed with a couple of docstring/help output tweaks and a couple more tests for the SQLITE3 env var.

tsibley · 2024-07-31T23:47:30Z

As I expected, CI is running into failures due to #1557: too old SQLite and no tsv-pretty.

augur/merge.py

CHANGES.md

This reverts commit 915672e. Per discussion in <#1563 (comment)>, keep CSV-like TSV where quotes may be added or removed, but parsed values should be equivalent.

victorlin

Great stuff! I know it's a big ask so consider it non-blocking, but it would be nice to see a draft PR on an existing pathogen repo that would benefit from this feature. Such a PR would help me better understand (a) how it fits into existing workflows and (b) if there are any performance issues with actual data.

augur/merge.py

tsibley · 2024-08-02T23:45:15Z

As I expected, CI is running into failures due to #1557: too old SQLite and no tsv-pretty.

I've fixed this 🤞 in repush by addding sqlite and tsv-utils to the Conda environment for CI.

And that also caused me to realize I had to update the installation from source docs.

augur/merge.py

corneliusroemer

Nice, I'll happily try to test drive this in a few repos once available!

tsibley · 2024-08-13T21:33:18Z

I've fixed this 🤞 in repush by addding sqlite and tsv-utils to the Conda environment for CI.

CI failed for a different reason after that. I don't immediately understand why, so will have to troubleshoot.

tsibley · 2024-08-14T21:46:45Z

Using the new debug mode in #1577 and the GitHub Actions debugger I wrote a while back, I was able to identify the failure in CI (but not locally) as a SQLite version incompatibility. Locally I'd been testing with SQLite 3.39 (the minimum supported version). CI was testing with SQLite 3.46.

The augur merge code was calling pragma_table_info("table_name") which was relying on the quoted identifier "table_name" being treated as a string literal. That behaviour changed for SQLite's CLI in 3.41, and it was no longer treated as a string literal. My mistake: I thought that the table-valued function took an identifier not a string argument, but it takes a string. The solution that's compatible across versions is to give it the string it really wants. I squashed the fix in to this PR's commit.

tsibley · 2024-08-15T00:15:13Z

Ok, remaining test failures seem to be a discrepancy in stdout/err buffering/ordering in Python 3.8 vs. ≥3.9. Compare

$ python --version
Python 3.8.19

$ cram tests/functional/merge/cram/merge.t
!
--- tests/functional/merge/cram/merge.t
+++ tests/functional/merge/cram/merge.t.err
@@ -191,8 +191,8 @@
   $ ${AUGUR} merge \
   >   --metadata dups=dups.tsv Y=y.tsv \
   >   --output-metadata /dev/null
-  Reading 'dups' metadata from 'dups.tsv'…
   Error: stepping, UNIQUE constraint failed: metadata_dups.strain (19)
+  Reading 'dups' metadata from 'dups.tsv'\xe2\x80\xa6 (esc)
   WARNING: Skipped deletion of */augur-merge-*.sqlite due to error, but you may want to clean it up yourself (e.g. if it's large). (glob)
   ERROR: sqlite3 invocation failed
   [2]

# Ran 1 tests, 0 skipped, 1 failed.

vs.

$ python --version
Python 3.9.19

$ cram tests/functional/merge/cram/merge.t
.
# Ran 1 tests, 0 skipped, 0 failed.

Have we dealt with this ordering issue before? Do we have a standard solution?
I can think of a few, but I'll use our convention if one exists.

tsibley · 2024-08-15T00:17:01Z

This is probably stderr changing default buffering modes beginning in 3.9.

Previously, sys.stderr was block-buffered when non-interactive. Now stderr defaults to always being line-buffered.

codecov · 2024-08-15T00:43:24Z

Codecov Report

Attention: Patch coverage is 96.22642% with 4 lines in your changes missing coverage. Please review.

Project coverage is 70.51%. Comparing base (28b8705) to head (b649f11).
Report is 130 commits behind head on master.

Files with missing lines	Patch %	Lines
augur/io/print.py	50.00%	1 Missing and 1 partial ⚠️
augur/merge.py	98.03%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1563      +/-   ##
==========================================
+ Coverage   70.17%   70.51%   +0.34%     
==========================================
  Files          77       78       +1     
  Lines        8001     8107     +106     
  Branches     1950     1966      +16     
==========================================
+ Hits         5615     5717     +102     
- Misses       2098     2100       +2     
- Partials      288      290       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Support generalized merging of two or more metadata tables. A long desired command. Behaviour is based on much discussion with the team and bespoke implementations like ncov's combine_metadata.py. Implementation requirements include handling inputs of arbitrary size (i.e. without needing to read any dataset fully into memory) and handling more than two inputs. SQLite is used in the implementation but could be replaced by another implementation in the future. One thing that's notable with this implementation is that it's stupidly slow for tiny datasets, e.g. a couple seconds. That's due to Augur's own slow startup time and having to wait for that 2+n times, where n is the number of metadata tables being joined, plus once for the initial startup of `augur merge` and once more for writing the output. On large datasets, this fixed startup time shouldn't matter, but on small datasets it feels really dumb. Cutting out the additional startup times by cutting out the use of `augur read-file` and `augur write-file` makes it quite quick, as it should be. I think we can live with this slowness for now, but if it turns out we can't, we can improve startup times or take a different approach to handling inputs.

The change in default from block-buffering in ≤3.8 to line-buffering in ≥3.9¹ made a Cram test output vary between Python versions (and thus fail). I could fix the Cram test various ways, but always line-buffering stderr makes sense because we're exclusively using it for messaging/logging. ¹ <https://docs.python.org/3/whatsnew/3.9.html#sys>

tsibley · 2024-08-15T04:39:46Z

I looked into the few new lines that weren't covered by tests, expecting to dismiss them, and was puzzled that the "no sqlite3 found" error wasn't covered. I definitely had a test for it.

augur/tests/functional/merge/cram/merge.t

Lines 217 to 229 in 8db6617

    
           No `sqlite3`. 
        
             $ SQLITE3= \ 
        
             > augur merge \ 
        
             >   --metadata X=x.tsv Y=y.tsv \ 
        
             >   --output-metadata /dev/null --quiet 
        
             ERROR: Unable to find the program `sqlite3`.  Is it installed? 
        
             In order to use `augur merge`, the SQLite 3 CLI (version ≥3.39) 
        
             must be installed separately.  It is typically provided by a 
        
             Nextstrain runtime. 
        
             [2]

And then I realized: that test (and the one above it) were calling augur instead of ${AUGUR}, and coverage is recorded by specially setting AUGUR.

victorlin

I think the only substantial change since my previous review is b649f11 which looks good.

Re: Have we dealt with this ordering issue before?

Something similar prompted the changes to tests/functional/filter/cram/subsample-ambiguous-dates-error.t in 851a141, but I think that was more due to mixing stdout/stderr? The scenario that you encountered must be new, otherwise we would have failing tests on either Python 3.8 or ≥3.9.

tsibley · 2024-08-15T17:27:38Z

There's still some polish and minor features I want to add to augur merge, some of them based on testing it out for the Nextclade metadata merge, and of course there's the major feature of sequence merging, but I'm going to do those in subsequent PRs.

I'm going to merge this.

This reverts commit 915672e. Per discussion in <#1563 (comment)>, keep CSV-like TSV where quotes may be added or removed, but parsed values should be equivalent.

tsibley requested a review from a team July 30, 2024 23:51

tsibley force-pushed the trs/merge branch from 868818d to cd6c195 Compare July 30, 2024 23:58

genehack approved these changes Jul 31, 2024

View reviewed changes

augur/merge.py Outdated Show resolved Hide resolved

tsibley force-pushed the trs/merge branch from cd6c195 to afeb986 Compare July 31, 2024 23:49

victorlin reviewed Aug 1, 2024

View reviewed changes

augur/merge.py Outdated Show resolved Hide resolved

Base automatically changed from trs/read-write-file to master August 1, 2024 00:54

tsibley force-pushed the trs/merge branch from afeb986 to eca2579 Compare August 1, 2024 18:07

tsibley mentioned this pull request Aug 1, 2024

augur curate passthru can add double quotations #1312

Closed

victorlin reviewed Aug 1, 2024

View reviewed changes

CHANGES.md Show resolved Hide resolved

tsibley force-pushed the trs/merge branch from eca2579 to 85e6eba Compare August 1, 2024 18:57

joverlee521 mentioned this pull request Aug 1, 2024

Fix curate internal quotes take 2 #1565

Merged

4 tasks

victorlin approved these changes Aug 2, 2024

View reviewed changes

augur/merge.py Show resolved Hide resolved

augur/merge.py Show resolved Hide resolved

tsibley mentioned this pull request Aug 2, 2024

Pick a meaning of TSV and use it consistently #1566

Open

6 tasks

tsibley force-pushed the trs/merge branch from 85e6eba to 92c2926 Compare August 2, 2024 23:44

corneliusroemer reviewed Aug 3, 2024

View reviewed changes

augur/merge.py Show resolved Hide resolved

corneliusroemer mentioned this pull request Aug 3, 2024

Add sqlite3 cli to augur's bioconda recipe once augur merge PR is merged #1567

Open

corneliusroemer reviewed Aug 3, 2024

View reviewed changes

trvrb mentioned this pull request Aug 13, 2024

read_metadata: Allow graceful handling of duplicate strain names #810

Closed

tsibley linked an issue Aug 14, 2024 that may be closed by this pull request

Add a debug mode to print stack trace for exceptions caught by top-level handler #1308

Closed

tsibley force-pushed the trs/merge branch 2 times, most recently from 0f5cc11 to 99231db Compare August 14, 2024 21:29

tsibley removed a link to an issue Aug 14, 2024

Add a debug mode to print stack trace for exceptions caught by top-level handler #1308

Closed

tsibley changed the base branch from master to trs/debug-mode August 14, 2024 21:31

Base automatically changed from trs/debug-mode to master August 14, 2024 21:54

tsibley force-pushed the trs/merge branch from e1e3a21 to 8db6617 Compare August 15, 2024 00:28

tsibley added 2 commits August 14, 2024 21:21

tsibley force-pushed the trs/merge branch from 8db6617 to b649f11 Compare August 15, 2024 04:39

victorlin approved these changes Aug 15, 2024

View reviewed changes

tsibley merged commit f5275d0 into master Aug 15, 2024
28 checks passed

tsibley deleted the trs/merge branch August 15, 2024 17:28

tsibley mentioned this pull request Aug 15, 2024

merge: Support sequences #1579

Open

victorlin mentioned this pull request Aug 27, 2024

Subcommand for data de-duplication #919

Open

tsibley mentioned this pull request Sep 10, 2024

augur merge is slow to read in metadata #1628

Open

2 tasks

victorlin mentioned this pull request Sep 13, 2024

WIP: Support multiple inputs during filter #697

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

augur merge #1563

augur merge #1563

tsibley commented Jul 30, 2024 •

edited

Loading

tsibley commented Jul 30, 2024

genehack left a comment

tsibley commented Jul 31, 2024

tsibley commented Jul 31, 2024

victorlin left a comment

tsibley commented Aug 2, 2024 •

edited

Loading

corneliusroemer left a comment

tsibley commented Aug 13, 2024

tsibley commented Aug 14, 2024

tsibley commented Aug 15, 2024

tsibley commented Aug 15, 2024

codecov bot commented Aug 15, 2024 •

edited

Loading

tsibley commented Aug 15, 2024

victorlin left a comment

tsibley commented Aug 15, 2024

augur merge #1563

augur merge #1563

Conversation

tsibley commented Jul 30, 2024 • edited Loading

Checklist

tsibley commented Jul 30, 2024

genehack left a comment

Choose a reason for hiding this comment

tsibley commented Jul 31, 2024

tsibley commented Jul 31, 2024

victorlin left a comment

Choose a reason for hiding this comment

tsibley commented Aug 2, 2024 • edited Loading

corneliusroemer left a comment

Choose a reason for hiding this comment

tsibley commented Aug 13, 2024

tsibley commented Aug 14, 2024

tsibley commented Aug 15, 2024

tsibley commented Aug 15, 2024

codecov bot commented Aug 15, 2024 • edited Loading

Codecov Report

tsibley commented Aug 15, 2024

victorlin left a comment

Choose a reason for hiding this comment

tsibley commented Aug 15, 2024

tsibley commented Jul 30, 2024 •

edited

Loading

tsibley commented Aug 2, 2024 •

edited

Loading

codecov bot commented Aug 15, 2024 •

edited

Loading