-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
augur merge #1563
augur merge #1563
Conversation
One thing that's notable with this implementation is that it's stupidly slow for tiny datasets, e.g. a couple seconds. That's due to Augur's own slow startup time and having to wait for that 2+n times, where n is the number of metadata tables being joined. On large datasets, this fixed startup time shouldn't matter, but on small datasets it feels really dumb. Cutting out the additional startup times by cutting out the use of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was extremely pleasant to read.
Repushed with a couple of docstring/help output tweaks and a couple more tests for the |
As I expected, CI is running into failures due to #1557: too old SQLite and no |
This reverts commit 915672e. Per discussion in <#1563 (comment)>, keep CSV-like TSV where quotes may be added or removed, but parsed values should be equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff! I know it's a big ask so consider it non-blocking, but it would be nice to see a draft PR on an existing pathogen repo that would benefit from this feature. Such a PR would help me better understand (a) how it fits into existing workflows and (b) if there are any performance issues with actual data.
I've fixed this 🤞 in repush by addding And that also caused me to realize I had to update the installation from source docs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I'll happily try to test drive this in a few repos once available!
CI failed for a different reason after that. I don't immediately understand why, so will have to troubleshoot. |
0f5cc11
to
99231db
Compare
Using the new debug mode in #1577 and the GitHub Actions debugger I wrote a while back, I was able to identify the failure in CI (but not locally) as a SQLite version incompatibility. Locally I'd been testing with SQLite 3.39 (the minimum supported version). CI was testing with SQLite 3.46. The |
Ok, remaining test failures seem to be a discrepancy in stdout/err buffering/ordering in Python 3.8 vs. ≥3.9. Compare $ python --version
Python 3.8.19
$ cram tests/functional/merge/cram/merge.t
!
--- tests/functional/merge/cram/merge.t
+++ tests/functional/merge/cram/merge.t.err
@@ -191,8 +191,8 @@
$ ${AUGUR} merge \
> --metadata dups=dups.tsv Y=y.tsv \
> --output-metadata /dev/null
- Reading 'dups' metadata from 'dups.tsv'…
Error: stepping, UNIQUE constraint failed: metadata_dups.strain (19)
+ Reading 'dups' metadata from 'dups.tsv'\xe2\x80\xa6 (esc)
WARNING: Skipped deletion of */augur-merge-*.sqlite due to error, but you may want to clean it up yourself (e.g. if it's large). (glob)
ERROR: sqlite3 invocation failed
[2]
# Ran 1 tests, 0 skipped, 1 failed. vs. $ python --version
Python 3.9.19
$ cram tests/functional/merge/cram/merge.t
.
# Ran 1 tests, 0 skipped, 0 failed. Have we dealt with this ordering issue before? Do we have a standard solution? |
This is probably stderr changing default buffering modes beginning in 3.9.
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1563 +/- ##
==========================================
+ Coverage 70.17% 70.51% +0.34%
==========================================
Files 77 78 +1
Lines 8001 8107 +106
Branches 1950 1966 +16
==========================================
+ Hits 5615 5717 +102
- Misses 2098 2100 +2
- Partials 288 290 +2 ☔ View full report in Codecov by Sentry. |
Support generalized merging of two or more metadata tables. A long desired command. Behaviour is based on much discussion with the team and bespoke implementations like ncov's combine_metadata.py. Implementation requirements include handling inputs of arbitrary size (i.e. without needing to read any dataset fully into memory) and handling more than two inputs. SQLite is used in the implementation but could be replaced by another implementation in the future. One thing that's notable with this implementation is that it's stupidly slow for tiny datasets, e.g. a couple seconds. That's due to Augur's own slow startup time and having to wait for that 2+n times, where n is the number of metadata tables being joined, plus once for the initial startup of `augur merge` and once more for writing the output. On large datasets, this fixed startup time shouldn't matter, but on small datasets it feels really dumb. Cutting out the additional startup times by cutting out the use of `augur read-file` and `augur write-file` makes it quite quick, as it should be. I think we can live with this slowness for now, but if it turns out we can't, we can improve startup times or take a different approach to handling inputs.
The change in default from block-buffering in ≤3.8 to line-buffering in ≥3.9¹ made a Cram test output vary between Python versions (and thus fail). I could fix the Cram test various ways, but always line-buffering stderr makes sense because we're exclusively using it for messaging/logging. ¹ <https://docs.python.org/3/whatsnew/3.9.html#sys>
I looked into the few new lines that weren't covered by tests, expecting to dismiss them, and was puzzled that the "no sqlite3 found" error wasn't covered. I definitely had a test for it. augur/tests/functional/merge/cram/merge.t Lines 217 to 229 in 8db6617
And then I realized: that test (and the one above it) were calling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only substantial change since my previous review is b649f11 which looks good.
Re: Have we dealt with this ordering issue before?
Something similar prompted the changes to tests/functional/filter/cram/subsample-ambiguous-dates-error.t
in 851a141, but I think that was more due to mixing stdout/stderr? The scenario that you encountered must be new, otherwise we would have failing tests on either Python 3.8 or ≥3.9.
There's still some polish and minor features I want to add to I'm going to merge this. |
This reverts commit 915672e. Per discussion in <#1563 (comment)>, keep CSV-like TSV where quotes may be added or removed, but parsed values should be equivalent.
Support generalized merging of two or more metadata tables. A long desired command. Behaviour is based on much discussion with the team and bespoke implementations like ncov's combine_metadata.py. Implementation requirements include handling inputs of arbitrary size (i.e. without needing to read any dataset fully into memory) and handling more than two inputs.
Checklist