Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces two changes.
First, this changes all skani invocations to use
-m 15 -c 10
as the new default parameters for the marker k-mer compression factor (-m
) and the k-mer subsampling rate (-c
). This is down from our previous default values of-m 30 -c 20
which is in turn lower than the skani-recommended settings for viruses of-m 200 -c 30
which is lower than the skani bacterial-inspired default values of-m 1000 -c 125
. The new default values were determined through empirical testing of diverse viral taxa, and-m 15 -c 10
succeeds in clustering (finding non-zero ANI between) all of rhinovirus & enterovirus together as well as all of Lassa virus (our old default values only partially succeeded on these taxa).Second, this PR reformats skani TSV output to be written out in sorted order of descending
ANI * Total_bases_covered
(instead of the default sort order of descendingANI
). This prevents our reference selection code from favoring reference genomes with higher ANI for short stretches of sequence (which happened with some rhino/enteros) and now favors longer matches of high identity. This specific metric (the product of ANI and match length) is inspired by ReferenceSeeker (publication, GitHub). Unit tests added to confirm proper sort order.