Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge: Support sequences with cross-checking #1601

Closed
wants to merge 9 commits into from

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Aug 27, 2024

Description of proposed changes

Initial prototype for sequence support in augur merge where metadata and sequence merge happens with additional cross-checking.

Related issue(s)

Closes #1579

Checklist

@victorlin victorlin self-assigned this Aug 27, 2024
Copy link

codecov bot commented Aug 27, 2024

Codecov Report

Attention: Patch coverage is 87.96680% with 29 lines in your changes missing coverage. Please review.

Project coverage is 73.00%. Comparing base (862aa37) to head (3d55c60).
Report is 91 commits behind head on master.

Files with missing lines Patch % Lines
augur/merge.py 87.96% 17 Missing and 12 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1601      +/-   ##
==========================================
+ Coverage   72.24%   73.00%   +0.76%     
==========================================
  Files          79       79              
  Lines        8268     8475     +207     
  Branches     1691     1736      +45     
==========================================
+ Hits         5973     6187     +214     
+ Misses       2009     1989      -20     
- Partials      286      299      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@victorlin victorlin force-pushed the victorlin/merge-sequences branch from d143872 to f86323a Compare August 30, 2024 22:55
@victorlin victorlin changed the title merge: Add --sequences + --output-sequences merge: Add --sequences + --output-sequences with cross-checking Sep 16, 2024
@victorlin victorlin changed the title merge: Add --sequences + --output-sequences with cross-checking merge: Support sequences with cross-checking Sep 16, 2024
@victorlin victorlin force-pushed the victorlin/merge-sequences branch from ac2dd27 to 275a87e Compare October 2, 2024 23:51
@victorlin victorlin force-pushed the victorlin/merge-sequences branch from 275a87e to 146084c Compare October 15, 2024 22:06
@victorlin victorlin force-pushed the victorlin/merge-sequences branch 3 times, most recently from 128d62b to 32199b8 Compare October 23, 2024 20:59
@victorlin victorlin marked this pull request as ready for review October 23, 2024 21:50
@victorlin
Copy link
Member Author

@tsibley this is ready for initial review whenever you get the chance!

@victorlin victorlin requested a review from tsibley October 23, 2024 21:50
Copy link
Contributor

@genehack genehack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda skimmed, so not marking approved at this point.

Would be nice if methods had return type annotations too; seemed like they generally didn't.

augur/merge.py Outdated Show resolved Hide resolved
augur/merge.py Show resolved Hide resolved
augur/merge.py Outdated Show resolved Hide resolved
Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this a thorough review-by-inspection.¹ The big takeaway for me is that I think we need to revisit the implementation to avoid reading gobs of data into memory. Happy to talk thru what this looks like in comments here or in a 1:1 or two.

¹ I didn't read thru the new test files nor actually run the new code but will do both in a second round of review after other changes are made.

augur/merge.py Outdated Show resolved Hide resolved
Comment on lines +76 to +122
def run(self, *args, **kwargs):
error = False

try:
sqlite3(self.path, *args, **kwargs)

except SQLiteError as err:
error = True
raise AugurError(str(err)) from err

finally:
if error:
print_info(f"WARNING: Skipped deletion of {self.path} due to error, but you may want to clean it up yourself (e.g. if it's large).")

def cleanup(self):
os.unlink(self.path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it's been refactored out, this error handling doesn't make sense to me. The caller may catch the raised exception and still end up calling db.cleanup(). Or nothing may ever call db.cleanup() and the file is always left. Concerns are mixed up here instead of clearly separated. (Refactoring often changes the boundaries of concerns, so they need to be reconsidered.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. File cleanup should only happen after all usage of the database, which the current implementation only satisfies because merge_metadata is the last usage.

Instead of exposing a cleanup function, how about making class Database implement context manager functions where __exit__ removes the db file?

I think the error handling in run could then stay the same. The caller could still catch the raised exception and run os.unlink(db.path), but I don't think there's any way around that except maybe marking db.path as "private" to discourage such commands.

augur/merge.py Outdated Show resolved Hide resolved
augur/merge.py Outdated
Comment on lines 561 to 572
def parse_named_inputs(inputs: Sequence[str]):
if unnamed := [x for x in inputs if "=" not in x or x.startswith("=")]:
raise UnnamedInputError(unnamed)

named_inputs = pairs(inputs)

if duplicate_names := [name for name, count
in count_unique(name for name, _ in named_inputs)
if count > 1]:
raise DuplicateInputNameError(duplicate_names)

return named_inputs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In keeping with the organization of this file, which is roughly "pyramidal" (e.g. callers above callees), this function would go above def pairs(…) since it calls both pairs() and count_unique() (which is below pairs()).

Comment on lines +575 to +833
if x.startswith("="):
invalid_named_inputs.append(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not invalid, at least by my intention. It an explicitly unnamed input, which can be necessary if the filename itself contains an equals sign. I don't think we need to introduce the idea of "invalidly named" separate from "unnamed". What purpose does it serve?


The following sequence {_n("input does", "inputs do", len(sequences_missing_metadata))} not have a corresponding metadata input:

{indented_list([repr(x.path) for x in named_sequences if x.name in sequences_missing_metadata], ' ' + ' ')}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd include the bad inputs as name=path in the error message (i.e. not just as path), which gives the person staring at the error message two ways to figure out what they're supposed to change.

{indented_list([repr(x.path) for x in named_sequences if x.name in sequences_missing_metadata], ' ' + ' ')}
"""))

if metadata_missing_sequences := sorted(set(metadata_names) - set(sequences_names)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto re: sorted().

raise AugurError(dedent(f"""\
Named inputs must be paired with the same ordering.

Order of inputs differs between named metadata {metadata_names!r} and named sequences {sequences_names!r}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Order of inputs differs between named metadata {metadata_names!r} and named sequences {sequences_names!r}.
Order of inputs differs between named metadata
{metadata_names!r}
and named sequences
{sequences_names!r}

makes it easier to spot the difference.

Comment on lines +245 to +256
if sequences_missing_metadata := sorted(set(sequences_names) - set(metadata_names)):
raise AugurError(dedent(f"""\
Named inputs must be paired.

The following sequence {_n("input does", "inputs do", len(sequences_missing_metadata))} not have a corresponding metadata input:

{indented_list([repr(x.path) for x in named_sequences if x.name in sequences_missing_metadata], ' ' + ' ')}
"""))

if metadata_missing_sequences := sorted(set(metadata_names) - set(sequences_names)):
raise AugurError(dedent(f"""\
Named inputs must be paired.

The following metadata {_n("input does", "inputs do", len(metadata_missing_sequences))} not have a corresponding sequence input:

{indented_list([repr(x.path) for x in metadata if x.name in metadata_missing_sequences], ' ' + ' ')}
"""))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be friendlier if mismatched names were reported all at once/together instead of potentially hitting an error with one set, fixing it, and then hitting the same error but in the other set.

augur/merge.py Outdated
Comment on lines 294 to 302
metadata_ids = {x[m.id_column] for x in
connection.execute(f"""select {sqlite_quote_id(m.id_column)}
from {sqlite_quote_id(m.table_name)}
""")}

sequence_ids = {x[SEQUENCE_ID_COLUMN] for x in
connection.execute(f"""select {sqlite_quote_id(SEQUENCE_ID_COLUMN)}
from {sqlite_quote_id(s.table_name)}
""")}

for x in sorted(metadata_ids - sequence_ids):
print_info(f"WARNING: Sequence {x!r} in {m.path!r} is missing from {s.path!r}. Outputs may continue to be mismatched.")
for x in sorted(sequence_ids - metadata_ids):
print_info(f"WARNING: Sequence {x!r} in {s.path!r} is missing from {m.path!r}. Outputs may continue to be mismatched.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that reading all of these ids into memory (twice) is gonna fly… these set operations can be done directly in the database, returning just the ones needed for the warning (if any).

Preparing for use across functions.
Preparing for use across functions.
Preparing for use across functions.
Preparing for NamedSequences
Preparing for sequence support.
Sequence support will require the ability to load metadata into the
database without actually merging (if --output-metadata is not
specified).
Preparing for sequence support, which allows unnamed inputs.
Add sequence support in addition to the existing metadata support.
SeqKit is used to deduplicate across sequence files. Duplicates within
an individual sequence file are not supported. Those are checked by
reading IDs using read_sequences.
Copy link
Member Author

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsibley thanks for the thorough review. I've glanced at most comments and they seem to make sense. I'll plan to address more comments tomorrow.

Comment on lines +76 to +122
def run(self, *args, **kwargs):
error = False

try:
sqlite3(self.path, *args, **kwargs)

except SQLiteError as err:
error = True
raise AugurError(str(err)) from err

finally:
if error:
print_info(f"WARNING: Skipped deletion of {self.path} due to error, but you may want to clean it up yourself (e.g. if it's large).")

def cleanup(self):
os.unlink(self.path)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. File cleanup should only happen after all usage of the database, which the current implementation only satisfies because merge_metadata is the last usage.

Instead of exposing a cleanup function, how about making class Database implement context manager functions where __exit__ removes the db file?

I think the error handling in run could then stay the same. The caller could still catch the raised exception and run os.unlink(db.path), but I don't think there's any way around that except maybe marking db.path as "private" to discourage such commands.

augur/merge.py Outdated
Comment on lines 161 to 162
def __repr__(self):
return f"<NamedSequenceFile {self.name}={self.path}>"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed:

augur/augur/merge.py

Lines 157 to 158 in 2be60e9

def __repr__(self):
return f"<UnnamedSequenceFile {self.path}>"

augur/merge.py Outdated Show resolved Hide resolved
augur/merge.py Outdated Show resolved Hide resolved
augur/merge.py Outdated
class UnnamedSequenceFile(UnnamedFile):
def __init__(self, path: str):
self.path = path
self.table_name = f"sequences_{re.sub(r'[^a-zA-Z0-9]', '_', os.path.basename(self.path))}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought that basenames would be the easiest to recognize for debugging and didn't think about collisions or generic names. It seems common to have something like data-source-1/sequences.fasta and data-source-2/sequences.fasta, so makes sense to add the path.

I forgot that sqlite_quote_id will take care of bad characters.

Updated:

self.table_name = f"sequences_{self.path}"

@victorlin victorlin force-pushed the victorlin/merge-sequences branch from 3d55c60 to 2be60e9 Compare January 14, 2025 01:33
Comment on lines +48 to +49
SeqKit is used behind the scenes to implement the merge, but, at least for now,
this should be considered an implementation detail that may change in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: don't hedge, just state.

Suggested change
SeqKit is used behind the scenes to implement the merge, but, at least for now,
this should be considered an implementation detail that may change in the
SeqKit is used behind the scenes to implement the merge, but
this should be considered an implementation detail that may change in the

Comment on lines +50 to +52
future. The CLI program seqkit must be available. If it's not on PATH (or
you want to use a version different from what's on PATH), set the SEQKIT
environment variable to path of the desired seqkit executable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitnitnit

Suggested change
future. The CLI program seqkit must be available. If it's not on PATH (or
you want to use a version different from what's on PATH), set the SEQKIT
environment variable to path of the desired seqkit executable.
future. The CLI program seqkit must be available. If it's not in $PATH (or
you want to use a version different from what's in $PATH), set the SEQKIT
environment variable to the full path of the desired seqkit executable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Victor's just following language I introduced; see the description of SQLITE in the same --help output.

The phrasing "on PATH" is fairly conventional. First, the environment variable name is just PATH not $PATH and second, the program isn't in PATH itself, it's in one of the directories in PATH. This is conventionally described as being "on PATH".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. (n.b., i have literally never heard "on PATH" 🤷)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public code on GitHub is not definitive, but "on PATH" is widely used to describe a binary's location. The "in PATH" phrasing is also widely used, but much of that usage is referring to directories appearing in PATH not just binaries found in those directories.

We can totally use "in PATH" if you want, we should just be consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with "on PATH" if that's the house style, no worries.

@victorlin
Copy link
Member Author

victorlin commented Jan 23, 2025

@genehack @jameshadfield and @tsibley thanks for taking the time to review this PR. Closing per #1579 (comment), but doing the work here and reading through comments gave me a better understanding of how to work with SQLite in this codebase, which should be useful for future work even if this is not merged.

I've pushed more work in progress as efc4881 in case we ever want to revisit this approach.

@victorlin victorlin closed this Jan 23, 2025
@victorlin victorlin deleted the victorlin/merge-sequences branch January 23, 2025 22:38
victorlin added a commit that referenced this pull request Jan 31, 2025
From preview review: <#1601 (comment)>

Co-authored-by: John SJ Anderson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

merge: Support sequences
4 participants