-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge: Support sequences with cross-checking #1601
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1601 +/- ##
==========================================
+ Coverage 72.24% 73.00% +0.76%
==========================================
Files 79 79
Lines 8268 8475 +207
Branches 1691 1736 +45
==========================================
+ Hits 5973 6187 +214
+ Misses 2009 1989 -20
- Partials 286 299 +13 ☔ View full report in Codecov by Sentry. |
d143872
to
f86323a
Compare
--sequences
+ --output-sequences
--sequences
+ --output-sequences
with cross-checking
--sequences
+ --output-sequences
with cross-checkingac2dd27
to
275a87e
Compare
275a87e
to
146084c
Compare
128d62b
to
32199b8
Compare
@tsibley this is ready for initial review whenever you get the chance! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kinda skimmed, so not marking approved at this point.
Would be nice if methods had return type annotations too; seemed like they generally didn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave this a thorough review-by-inspection.¹ The big takeaway for me is that I think we need to revisit the implementation to avoid reading gobs of data into memory. Happy to talk thru what this looks like in comments here or in a 1:1 or two.
¹ I didn't read thru the new test files nor actually run the new code but will do both in a second round of review after other changes are made.
def run(self, *args, **kwargs): | ||
error = False | ||
|
||
try: | ||
sqlite3(self.path, *args, **kwargs) | ||
|
||
except SQLiteError as err: | ||
error = True | ||
raise AugurError(str(err)) from err | ||
|
||
finally: | ||
if error: | ||
print_info(f"WARNING: Skipped deletion of {self.path} due to error, but you may want to clean it up yourself (e.g. if it's large).") | ||
|
||
def cleanup(self): | ||
os.unlink(self.path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it's been refactored out, this error handling doesn't make sense to me. The caller may catch the raised exception and still end up calling db.cleanup()
. Or nothing may ever call db.cleanup()
and the file is always left. Concerns are mixed up here instead of clearly separated. (Refactoring often changes the boundaries of concerns, so they need to be reconsidered.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. File cleanup should only happen after all usage of the database, which the current implementation only satisfies because merge_metadata
is the last usage.
Instead of exposing a cleanup
function, how about making class Database
implement context manager functions where __exit__
removes the db file?
I think the error handling in run
could then stay the same. The caller could still catch the raised exception and run os.unlink(db.path)
, but I don't think there's any way around that except maybe marking db.path
as "private" to discourage such commands.
augur/merge.py
Outdated
def parse_named_inputs(inputs: Sequence[str]): | ||
if unnamed := [x for x in inputs if "=" not in x or x.startswith("=")]: | ||
raise UnnamedInputError(unnamed) | ||
|
||
named_inputs = pairs(inputs) | ||
|
||
if duplicate_names := [name for name, count | ||
in count_unique(name for name, _ in named_inputs) | ||
if count > 1]: | ||
raise DuplicateInputNameError(duplicate_names) | ||
|
||
return named_inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In keeping with the organization of this file, which is roughly "pyramidal" (e.g. callers above callees), this function would go above def pairs(…)
since it calls both pairs()
and count_unique()
(which is below pairs()
).
if x.startswith("="): | ||
invalid_named_inputs.append(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not invalid, at least by my intention. It an explicitly unnamed input, which can be necessary if the filename itself contains an equals sign. I don't think we need to introduce the idea of "invalidly named" separate from "unnamed". What purpose does it serve?
|
||
The following sequence {_n("input does", "inputs do", len(sequences_missing_metadata))} not have a corresponding metadata input: | ||
|
||
{indented_list([repr(x.path) for x in named_sequences if x.name in sequences_missing_metadata], ' ' + ' ')} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd include the bad inputs as name=path
in the error message (i.e. not just as path
), which gives the person staring at the error message two ways to figure out what they're supposed to change.
{indented_list([repr(x.path) for x in named_sequences if x.name in sequences_missing_metadata], ' ' + ' ')} | ||
""")) | ||
|
||
if metadata_missing_sequences := sorted(set(metadata_names) - set(sequences_names)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto re: sorted()
.
raise AugurError(dedent(f"""\ | ||
Named inputs must be paired with the same ordering. | ||
|
||
Order of inputs differs between named metadata {metadata_names!r} and named sequences {sequences_names!r}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Order of inputs differs between named metadata {metadata_names!r} and named sequences {sequences_names!r}. | |
Order of inputs differs between named metadata | |
{metadata_names!r} | |
and named sequences | |
{sequences_names!r} |
makes it easier to spot the difference.
if sequences_missing_metadata := sorted(set(sequences_names) - set(metadata_names)): | ||
raise AugurError(dedent(f"""\ | ||
Named inputs must be paired. | ||
|
||
The following sequence {_n("input does", "inputs do", len(sequences_missing_metadata))} not have a corresponding metadata input: | ||
|
||
{indented_list([repr(x.path) for x in named_sequences if x.name in sequences_missing_metadata], ' ' + ' ')} | ||
""")) | ||
|
||
if metadata_missing_sequences := sorted(set(metadata_names) - set(sequences_names)): | ||
raise AugurError(dedent(f"""\ | ||
Named inputs must be paired. | ||
|
||
The following metadata {_n("input does", "inputs do", len(metadata_missing_sequences))} not have a corresponding sequence input: | ||
|
||
{indented_list([repr(x.path) for x in metadata if x.name in metadata_missing_sequences], ' ' + ' ')} | ||
""")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be friendlier if mismatched names were reported all at once/together instead of potentially hitting an error with one set, fixing it, and then hitting the same error but in the other set.
augur/merge.py
Outdated
metadata_ids = {x[m.id_column] for x in | ||
connection.execute(f"""select {sqlite_quote_id(m.id_column)} | ||
from {sqlite_quote_id(m.table_name)} | ||
""")} | ||
|
||
sequence_ids = {x[SEQUENCE_ID_COLUMN] for x in | ||
connection.execute(f"""select {sqlite_quote_id(SEQUENCE_ID_COLUMN)} | ||
from {sqlite_quote_id(s.table_name)} | ||
""")} | ||
|
||
for x in sorted(metadata_ids - sequence_ids): | ||
print_info(f"WARNING: Sequence {x!r} in {m.path!r} is missing from {s.path!r}. Outputs may continue to be mismatched.") | ||
for x in sorted(sequence_ids - metadata_ids): | ||
print_info(f"WARNING: Sequence {x!r} in {s.path!r} is missing from {m.path!r}. Outputs may continue to be mismatched.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that reading all of these ids into memory (twice) is gonna fly… these set operations can be done directly in the database, returning just the ones needed for the warning (if any).
Preparing for use across functions.
Preparing for use across functions.
Preparing for use across functions.
Preparing for NamedSequences
Preparing for sequence support.
Sequence support will require the ability to load metadata into the database without actually merging (if --output-metadata is not specified).
Preparing for sequence support, which allows unnamed inputs.
Add sequence support in addition to the existing metadata support. SeqKit is used to deduplicate across sequence files. Duplicates within an individual sequence file are not supported. Those are checked by reading IDs using read_sequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tsibley thanks for the thorough review. I've glanced at most comments and they seem to make sense. I'll plan to address more comments tomorrow.
def run(self, *args, **kwargs): | ||
error = False | ||
|
||
try: | ||
sqlite3(self.path, *args, **kwargs) | ||
|
||
except SQLiteError as err: | ||
error = True | ||
raise AugurError(str(err)) from err | ||
|
||
finally: | ||
if error: | ||
print_info(f"WARNING: Skipped deletion of {self.path} due to error, but you may want to clean it up yourself (e.g. if it's large).") | ||
|
||
def cleanup(self): | ||
os.unlink(self.path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. File cleanup should only happen after all usage of the database, which the current implementation only satisfies because merge_metadata
is the last usage.
Instead of exposing a cleanup
function, how about making class Database
implement context manager functions where __exit__
removes the db file?
I think the error handling in run
could then stay the same. The caller could still catch the raised exception and run os.unlink(db.path)
, but I don't think there's any way around that except maybe marking db.path
as "private" to discourage such commands.
augur/merge.py
Outdated
def __repr__(self): | ||
return f"<NamedSequenceFile {self.name}={self.path}>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed:
Lines 157 to 158 in 2be60e9
def __repr__(self): | |
return f"<UnnamedSequenceFile {self.path}>" |
augur/merge.py
Outdated
class UnnamedSequenceFile(UnnamedFile): | ||
def __init__(self, path: str): | ||
self.path = path | ||
self.table_name = f"sequences_{re.sub(r'[^a-zA-Z0-9]', '_', os.path.basename(self.path))}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had thought that basenames would be the easiest to recognize for debugging and didn't think about collisions or generic names. It seems common to have something like data-source-1/sequences.fasta
and data-source-2/sequences.fasta
, so makes sense to add the path.
I forgot that sqlite_quote_id
will take care of bad characters.
Updated:
Line 155 in 2be60e9
self.table_name = f"sequences_{self.path}" |
3d55c60
to
2be60e9
Compare
SeqKit is used behind the scenes to implement the merge, but, at least for now, | ||
this should be considered an implementation detail that may change in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: don't hedge, just state.
SeqKit is used behind the scenes to implement the merge, but, at least for now, | |
this should be considered an implementation detail that may change in the | |
SeqKit is used behind the scenes to implement the merge, but | |
this should be considered an implementation detail that may change in the |
future. The CLI program seqkit must be available. If it's not on PATH (or | ||
you want to use a version different from what's on PATH), set the SEQKIT | ||
environment variable to path of the desired seqkit executable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitnitnit
future. The CLI program seqkit must be available. If it's not on PATH (or | |
you want to use a version different from what's on PATH), set the SEQKIT | |
environment variable to path of the desired seqkit executable. | |
future. The CLI program seqkit must be available. If it's not in $PATH (or | |
you want to use a version different from what's in $PATH), set the SEQKIT | |
environment variable to the full path of the desired seqkit executable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Victor's just following language I introduced; see the description of SQLITE
in the same --help
output.
The phrasing "on PATH" is fairly conventional. First, the environment variable name is just PATH
not $PATH
and second, the program isn't in PATH
itself, it's in one of the directories in PATH
. This is conventionally described as being "on PATH".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. (n.b., i have literally never heard "on PATH" 🤷)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Public code on GitHub is not definitive, but "on PATH" is widely used to describe a binary's location. The "in PATH" phrasing is also widely used, but much of that usage is referring to directories appearing in PATH not just binaries found in those directories.
We can totally use "in PATH" if you want, we should just be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with "on PATH" if that's the house style, no worries.
@genehack @jameshadfield and @tsibley thanks for taking the time to review this PR. Closing per #1579 (comment), but doing the work here and reading through comments gave me a better understanding of how to work with SQLite in this codebase, which should be useful for future work even if this is not merged. I've pushed more work in progress as efc4881 in case we ever want to revisit this approach. |
From preview review: <#1601 (comment)> Co-authored-by: John SJ Anderson <[email protected]>
Description of proposed changes
Initial prototype for sequence support in
augur merge
where metadata and sequence merge happens with additional cross-checking.Related issue(s)
Closes #1579
Checklist