merge: Support sequences with cross-checking #1601

victorlin · 2024-08-27T17:48:11Z

Description of proposed changes

Initial prototype for sequence support in augur merge where metadata and sequence merge happens with additional cross-checking.

Related issue(s)

Closes #1579

Checklist

Automated checks pass
Add a changelog message
Add tests
Update docs
Test on a pathogen repo (zika, avian-flu or mpox?)
- Use augur merge for sequences zika#76

codecov · 2024-08-27T18:12:23Z

Codecov Report

Attention: Patch coverage is 87.96680% with 29 lines in your changes missing coverage. Please review.

Project coverage is 73.00%. Comparing base (862aa37) to head (3d55c60).
Report is 91 commits behind head on master.

Files with missing lines	Patch %	Lines
augur/merge.py	87.96%	17 Missing and 12 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1601      +/-   ##
==========================================
+ Coverage   72.24%   73.00%   +0.76%     
==========================================
  Files          79       79              
  Lines        8268     8475     +207     
  Branches     1691     1736      +45     
==========================================
+ Hits         5973     6187     +214     
+ Misses       2009     1989      -20     
- Partials      286      299      +13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

victorlin · 2024-10-23T21:50:47Z

@tsibley this is ready for initial review whenever you get the chance!

genehack

Kinda skimmed, so not marking approved at this point.

Would be nice if methods had return type annotations too; seemed like they generally didn't.

augur/merge.py

tests/functional/merge/cram/merge-sequences.t

tsibley

Gave this a thorough review-by-inspection.¹ The big takeaway for me is that I think we need to revisit the implementation to avoid reading gobs of data into memory. Happy to talk thru what this looks like in comments here or in a 1:1 or two.

¹ I didn't read thru the new test files nor actually run the new code but will do both in a second round of review after other changes are made.

augur/merge.py

tsibley · 2025-01-07T20:54:53Z

augur/merge.py

+    def run(self, *args, **kwargs):
+        error = False
+
+        try:
+            sqlite3(self.path, *args, **kwargs)
+
+        except SQLiteError as err:
+            error = True
+            raise AugurError(str(err)) from err
+
+        finally:
+            if error:
+                print_info(f"WARNING: Skipped deletion of {self.path} due to error, but you may want to clean it up yourself (e.g. if it's large).")
+
+    def cleanup(self):
+        os.unlink(self.path)


As it's been refactored out, this error handling doesn't make sense to me. The caller may catch the raised exception and still end up calling db.cleanup(). Or nothing may ever call db.cleanup() and the file is always left. Concerns are mixed up here instead of clearly separated. (Refactoring often changes the boundaries of concerns, so they need to be reconsidered.)

Good point. File cleanup should only happen after all usage of the database, which the current implementation only satisfies because merge_metadata is the last usage.

Instead of exposing a cleanup function, how about making class Database implement context manager functions where __exit__ removes the db file?

I think the error handling in run could then stay the same. The caller could still catch the raised exception and run os.unlink(db.path), but I don't think there's any way around that except maybe marking db.path as "private" to discourage such commands.

augur/merge.py

tsibley · 2025-01-08T07:21:16Z

augur/merge.py

+def parse_named_inputs(inputs: Sequence[str]):
+    if unnamed := [x for x in inputs if "=" not in x or x.startswith("=")]:
+        raise UnnamedInputError(unnamed)
+
+    named_inputs = pairs(inputs)
+
+    if duplicate_names := [name for name, count
+                                 in count_unique(name for name, _ in named_inputs)
+                                 if count > 1]:
+        raise DuplicateInputNameError(duplicate_names)
+
+    return named_inputs


In keeping with the organization of this file, which is roughly "pyramidal" (e.g. callers above callees), this function would go above def pairs(…) since it calls both pairs() and count_unique() (which is below pairs()).

tsibley · 2025-01-08T07:26:33Z

augur/merge.py

+        if x.startswith("="):
+            invalid_named_inputs.append(x)


This is not invalid, at least by my intention. It an explicitly unnamed input, which can be necessary if the filename itself contains an equals sign. I don't think we need to introduce the idea of "invalidly named" separate from "unnamed". What purpose does it serve?

tsibley · 2025-01-08T08:39:50Z

augur/merge.py

+
+                The following sequence {_n("input does", "inputs do", len(sequences_missing_metadata))} not have a corresponding metadata input:
+
+                  {indented_list([repr(x.path) for x in named_sequences if x.name in sequences_missing_metadata], '                ' + '  ')}


I'd include the bad inputs as name=path in the error message (i.e. not just as path), which gives the person staring at the error message two ways to figure out what they're supposed to change.

tsibley · 2025-01-08T08:40:24Z

augur/merge.py

+                  {indented_list([repr(x.path) for x in named_sequences if x.name in sequences_missing_metadata], '                ' + '  ')}
+                """))
+
+        if metadata_missing_sequences := sorted(set(metadata_names) - set(sequences_names)):


Ditto re: sorted().

tsibley · 2025-01-08T08:41:59Z

augur/merge.py

+            raise AugurError(dedent(f"""\
+                Named inputs must be paired with the same ordering.
+
+                Order of inputs differs between named metadata {metadata_names!r} and named sequences {sequences_names!r}.


Suggested change

Order of inputs differs between named metadata {metadata_names!r} and named sequences {sequences_names!r}.

Order of inputs differs between named metadata

{metadata_names!r}

and named sequences

{sequences_names!r}

makes it easier to spot the difference.

tsibley · 2025-01-08T08:42:59Z

augur/merge.py

+        if sequences_missing_metadata := sorted(set(sequences_names) - set(metadata_names)):
+            raise AugurError(dedent(f"""\
+                Named inputs must be paired.
+
+                The following sequence {_n("input does", "inputs do", len(sequences_missing_metadata))} not have a corresponding metadata input:
+
+                  {indented_list([repr(x.path) for x in named_sequences if x.name in sequences_missing_metadata], '                ' + '  ')}
+                """))
+
+        if metadata_missing_sequences := sorted(set(metadata_names) - set(sequences_names)):
+            raise AugurError(dedent(f"""\
+                Named inputs must be paired.
+
+                The following metadata {_n("input does", "inputs do", len(metadata_missing_sequences))} not have a corresponding sequence input:
+
+                  {indented_list([repr(x.path) for x in metadata if x.name in metadata_missing_sequences], '                ' + '  ')}
+                """))


I think it'd be friendlier if mismatched names were reported all at once/together instead of potentially hitting an error with one set, fixing it, and then hitting the same error but in the other set.

tsibley · 2025-01-08T08:48:14Z

augur/merge.py

+                metadata_ids = {x[m.id_column] for x in
+                                connection.execute(f"""select {sqlite_quote_id(m.id_column)}
+                                                         from {sqlite_quote_id(m.table_name)}
+                                                    """)}
+
+                sequence_ids = {x[SEQUENCE_ID_COLUMN] for x in
+                                connection.execute(f"""select {sqlite_quote_id(SEQUENCE_ID_COLUMN)}
+                                                         from {sqlite_quote_id(s.table_name)}
+                                                    """)}
+
+            for x in sorted(metadata_ids - sequence_ids):
+                print_info(f"WARNING: Sequence {x!r} in {m.path!r} is missing from {s.path!r}. Outputs may continue to be mismatched.")
+            for x in sorted(sequence_ids - metadata_ids):
+                print_info(f"WARNING: Sequence {x!r} in {s.path!r} is missing from {m.path!r}. Outputs may continue to be mismatched.")


I don't think that reading all of these ids into memory (twice) is gonna fly… these set operations can be done directly in the database, returning just the ones needed for the warning (if any).

Preparing for use across functions.

Preparing for NamedSequences

Preparing for sequence support.

Sequence support will require the ability to load metadata into the database without actually merging (if --output-metadata is not specified).

Preparing for sequence support, which allows unnamed inputs.

Add sequence support in addition to the existing metadata support. SeqKit is used to deduplicate across sequence files. Duplicates within an individual sequence file are not supported. Those are checked by reading IDs using read_sequences.

victorlin

@tsibley thanks for the thorough review. I've glanced at most comments and they seem to make sense. I'll plan to address more comments tomorrow.

victorlin · 2025-01-14T00:46:16Z

augur/merge.py

+    def run(self, *args, **kwargs):
+        error = False
+
+        try:
+            sqlite3(self.path, *args, **kwargs)
+
+        except SQLiteError as err:
+            error = True
+            raise AugurError(str(err)) from err
+
+        finally:
+            if error:
+                print_info(f"WARNING: Skipped deletion of {self.path} due to error, but you may want to clean it up yourself (e.g. if it's large).")
+
+    def cleanup(self):
+        os.unlink(self.path)


Good point. File cleanup should only happen after all usage of the database, which the current implementation only satisfies because merge_metadata is the last usage.

Instead of exposing a cleanup function, how about making class Database implement context manager functions where __exit__ removes the db file?

I think the error handling in run could then stay the same. The caller could still catch the raised exception and run os.unlink(db.path), but I don't think there's any way around that except maybe marking db.path as "private" to discourage such commands.

victorlin · 2025-01-14T00:50:30Z

augur/merge.py

+    def __repr__(self):
+        return f"<NamedSequenceFile {self.name}={self.path}>"


Fixed:

augur/augur/merge.py

Lines 157 to 158 in 2be60e9

def __repr__(self):

return f"<UnnamedSequenceFile {self.path}>"

augur/merge.py

victorlin · 2025-01-14T01:30:42Z

augur/merge.py

+class UnnamedSequenceFile(UnnamedFile):
+    def __init__(self, path: str):
+        self.path = path
+        self.table_name = f"sequences_{re.sub(r'[^a-zA-Z0-9]', '_', os.path.basename(self.path))}"


I had thought that basenames would be the easiest to recognize for debugging and didn't think about collisions or generic names. It seems common to have something like data-source-1/sequences.fasta and data-source-2/sequences.fasta, so makes sense to add the path.

I forgot that sqlite_quote_id will take care of bad characters.

Updated:

augur/augur/merge.py

Line 155 in 2be60e9

self.table_name = f"sequences_{self.path}"

genehack · 2025-01-14T17:32:38Z

augur/merge.py

+SeqKit is used behind the scenes to implement the merge, but, at least for now,
+this should be considered an implementation detail that may change in the


Suggestion: don't hedge, just state.

Suggested change

SeqKit is used behind the scenes to implement the merge, but, at least for now,

this should be considered an implementation detail that may change in the

SeqKit is used behind the scenes to implement the merge, but

this should be considered an implementation detail that may change in the

genehack · 2025-01-14T17:33:29Z

augur/merge.py

+future.  The CLI program seqkit must be available.  If it's not on PATH (or
+you want to use a version different from what's on PATH), set the SEQKIT
+environment variable to path of the desired seqkit executable.


nitnitnit

Suggested change

future. The CLI program seqkit must be available. If it's not on PATH (or

you want to use a version different from what's on PATH), set the SEQKIT

environment variable to path of the desired seqkit executable.

future. The CLI program seqkit must be available. If it's not in $PATH (or

you want to use a version different from what's in $PATH), set the SEQKIT

environment variable to the full path of the desired seqkit executable.

Victor's just following language I introduced; see the description of SQLITE in the same --help output.

The phrasing "on PATH" is fairly conventional. First, the environment variable name is just PATH not $PATH and second, the program isn't in PATH itself, it's in one of the directories in PATH. This is conventionally described as being "on PATH".

ok. (n.b., i have literally never heard "on PATH" 🤷)

Public code on GitHub is not definitive, but "on PATH" is widely used to describe a binary's location. The "in PATH" phrasing is also widely used, but much of that usage is referring to directories appearing in PATH not just binaries found in those directories.

We can totally use "in PATH" if you want, we should just be consistent.

I'm fine with "on PATH" if that's the house style, no worries.

victorlin · 2025-01-23T22:37:44Z

@genehack @jameshadfield and @tsibley thanks for taking the time to review this PR. Closing per #1579 (comment), but doing the work here and reading through comments gave me a better understanding of how to work with SQLite in this codebase, which should be useful for future work even if this is not merged.

I've pushed more work in progress as efc4881 in case we ever want to revisit this approach.

From preview review: <#1601 (comment)> Co-authored-by: John SJ Anderson <[email protected]>

victorlin self-assigned this Aug 27, 2024

victorlin force-pushed the victorlin/merge-sequences branch from d143872 to f86323a Compare August 30, 2024 22:55

victorlin changed the title ~~merge: Add --sequences + --output-sequences~~ merge: Add --sequences + --output-sequences with cross-checking Sep 16, 2024

victorlin changed the title ~~merge: Add --sequences + --output-sequences with cross-checking~~ merge: Support sequences with cross-checking Sep 16, 2024

victorlin mentioned this pull request Sep 16, 2024

merge: Support sequences #1579

Open

victorlin force-pushed the victorlin/merge-sequences branch from ac2dd27 to 275a87e Compare October 2, 2024 23:51

victorlin force-pushed the victorlin/merge-sequences branch from 275a87e to 146084c Compare October 15, 2024 22:06

victorlin force-pushed the victorlin/merge-sequences branch 3 times, most recently from 128d62b to 32199b8 Compare October 23, 2024 20:59

victorlin marked this pull request as ready for review October 23, 2024 21:50

victorlin requested a review from tsibley October 23, 2024 21:50

genehack reviewed Oct 24, 2024

View reviewed changes

augur/merge.py Outdated Show resolved Hide resolved

augur/merge.py Show resolved Hide resolved

augur/merge.py Outdated Show resolved Hide resolved

This was referenced Dec 4, 2024

Allow different (multiple) inputs nextstrain/avian-flu#106

Closed

Use augur merge to spike in USVI records nextstrain/zika#73

Closed

Use augur merge for sequences nextstrain/zika#76

Draft

jameshadfield reviewed Dec 16, 2024

View reviewed changes

tests/functional/merge/cram/merge-sequences.t Show resolved Hide resolved

tsibley requested changes Jan 8, 2025

View reviewed changes

victorlin added 8 commits January 13, 2025 17:23

Use global print_info

3d3c686

Preparing for use across functions.

Use global database

74f4014

Preparing for use across functions.

Use global augur variable

de206ca

Preparing for use across functions.

Use NamedFile class

4f7f4da

Preparing for NamedSequences

Make metadata options optional

bbc833b

Preparing for sequence support.

Split run() into separate functions, add types

813bd9c

Sequence support will require the ability to load metadata into the database without actually merging (if --output-metadata is not specified).

Parse unnamed inputs

d5ab7af

Preparing for sequence support, which allows unnamed inputs.

merge: Support sequences

2be60e9

Add sequence support in addition to the existing metadata support. SeqKit is used to deduplicate across sequence files. Duplicates within an individual sequence file are not supported. Those are checked by reading IDs using read_sequences.

victorlin commented Jan 14, 2025

View reviewed changes

victorlin force-pushed the victorlin/merge-sequences branch from 3d55c60 to 2be60e9 Compare January 14, 2025 01:33

genehack reviewed Jan 14, 2025

View reviewed changes

🚧 don't read all IDs into database

efc4881

victorlin closed this Jan 23, 2025

victorlin deleted the victorlin/merge-sequences branch January 23, 2025 22:38

victorlin mentioned this pull request Jan 23, 2025

merge: Support sequences without cross-checking #1631

Draft

5 tasks

victorlin added a commit that referenced this pull request Jan 31, 2025

Don't hedge, just state SQLite implementation detail

4030371

From preview review: <#1601 (comment)> Co-authored-by: John SJ Anderson <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge: Support sequences with cross-checking #1601

merge: Support sequences with cross-checking #1601

victorlin commented Aug 27, 2024 •

edited

Loading

codecov bot commented Aug 27, 2024 •

edited

Loading

victorlin commented Oct 23, 2024

genehack left a comment

tsibley left a comment •

edited

Loading

tsibley Jan 7, 2025

victorlin Jan 14, 2025

tsibley Jan 8, 2025

tsibley Jan 8, 2025

tsibley Jan 8, 2025

tsibley Jan 8, 2025

tsibley Jan 8, 2025

tsibley Jan 8, 2025

tsibley Jan 8, 2025

victorlin left a comment

victorlin Jan 14, 2025

victorlin Jan 14, 2025

victorlin Jan 14, 2025

genehack Jan 14, 2025

genehack Jan 14, 2025

tsibley Jan 14, 2025

genehack Jan 14, 2025

tsibley Jan 15, 2025

genehack Jan 15, 2025

victorlin commented Jan 23, 2025 •

edited

Loading


		The following sequence {_n("input does", "inputs do", len(sequences_missing_metadata))} not have a corresponding metadata input:

		{indented_list([repr(x.path) for x in named_sequences if x.name in sequences_missing_metadata], ' ' + ' ')}

		def __repr__(self):
		return f"<NamedSequenceFile {self.name}={self.path}>"

	def __repr__(self):
	return f"<UnnamedSequenceFile {self.path}>"

		SeqKit is used behind the scenes to implement the merge, but, at least for now,
		this should be considered an implementation detail that may change in the

merge: Support sequences with cross-checking #1601

merge: Support sequences with cross-checking #1601

Conversation

victorlin commented Aug 27, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

codecov bot commented Aug 27, 2024 • edited Loading

Codecov Report

victorlin commented Oct 23, 2024

genehack left a comment

Choose a reason for hiding this comment

tsibley left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

victorlin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

victorlin commented Jan 23, 2025 • edited Loading

victorlin commented Aug 27, 2024 •

edited

Loading

codecov bot commented Aug 27, 2024 •

edited

Loading

tsibley left a comment •

edited

Loading

victorlin commented Jan 23, 2025 •

edited

Loading