Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference-less contig stitching #1221

Draft
wants to merge 151 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
151 commits
Select commit Hold shift + click to select a range
0beecb6
Start implementation of merge_contigs
Donaim Dec 24, 2024
9ee3136
Add tests for find_maximum_overlap
Donaim Dec 24, 2024
6beb4d9
Add more test cases for find_maximum_overlap
Donaim Dec 24, 2024
2b77b1a
Rename merge_contigs into find_maximum_overlap
Donaim Dec 24, 2024
a063ac6
Implement show_maximum_overlap
Donaim Dec 24, 2024
4708aab
Add main entry to find_maximum_overlap script
Donaim Dec 24, 2024
c1af31d
Fix no overlap case in find_maximum_overlap
Donaim Dec 24, 2024
f104e84
Fix formatting error in tests
Donaim Dec 24, 2024
aaccdc2
Change tie behaviour of find_maximum_overlap
Donaim Dec 24, 2024
73c7ed5
Add TODO comment for find_maximum_overlap
Donaim Dec 25, 2024
eb22286
Improve code in tests for find_maximum_overlap
Donaim Dec 31, 2024
ec4647b
Initial implementation of referenceless contig stitcher
Donaim Dec 31, 2024
3943d72
Prepare to integrate referenceless contig stitcher with CLI
Donaim Dec 31, 2024
f4a6179
Split contig stitcher reading based on reference avail
Donaim Dec 31, 2024
d286cc6
Factor out referencefull_contig_stitcher
Donaim Dec 31, 2024
931388e
Fix warning in test_contig_stitcher
Donaim Dec 31, 2024
ca8248b
Improve typing in referencefull_contig_stitcher
Donaim Dec 31, 2024
6ec43ca
Finish initial implementation of referenceless_contig_stitcher
Donaim Dec 31, 2024
a240d7d
Add referenceless stitcher to contig_stitcher.py
Donaim Dec 31, 2024
47065c9
Change operating format of referenceless stitcher to FASTA
Donaim Jan 2, 2025
1db40d9
Improve contig_stitcher.py CLI parsing
Donaim Jan 2, 2025
2b95042
Correct handling of FASTA extension in sample.py
Donaim Jan 2, 2025
01c6c17
Add more test cases for referenceless stitcher
Donaim Jan 2, 2025
eeccb44
Merge remote-tracking branch 'origin/master' into contig-merger
Donaim Jan 2, 2025
9ec9792
Fix writing in referenceless_contig_stitcher
Donaim Jan 2, 2025
e25717a
Perform referenceless contig stitching in sample.py
Donaim Jan 2, 2025
09972c0
Add Config.empty static method
Donaim Jan 2, 2025
6b4c16a
Fix style error in stitcher_contigs.py
Donaim Jan 2, 2025
acba19b
Make sure that empty contings are not causing a crash
Donaim Jan 2, 2025
f6d9c2a
Improve combination in referenceless contig stitcher
Donaim Jan 2, 2025
6589510
Fix type error in referenceless contig_stitcher
Donaim Jan 3, 2025
42fcc9e
Improve probability calculation in referenceless contig_stitcher
Donaim Jan 3, 2025
0fd9979
Run referenceless contig stitcher in sample.py
Donaim Jan 3, 2025
e23898d
Tweak ACCEPTABLE_STITCHING_PROB in referenceless contig_stitcher
Donaim Jan 3, 2025
96281b8
Improve code readability in referencless contig stitcher
Donaim Jan 3, 2025
ed4cc84
Improve overlap combination in referenceless contig stitcher
Donaim Jan 3, 2025
b988304
Factor out overlap stitcher
Donaim Jan 3, 2025
88dc5ba
Fix typing in contig_stitcher
Donaim Jan 3, 2025
60a45b7
Factor out sort_concordance_indexes into overlap_stitcher
Donaim Jan 3, 2025
ae8a6b7
Merge remote-tracking branch 'origin/master' into HEAD
Donaim Jan 3, 2025
695e5b8
Improve overlap combination in referenceless_contig_stitcher
Donaim Jan 3, 2025
cb797f7
Implement a more accurate pvalue calculation for overlaps
Donaim Jan 3, 2025
4952a4d
Remove a finished FIXME comment
Donaim Jan 3, 2025
ecb81c3
Fix type error in contig_stitcher.py
Donaim Jan 3, 2025
0f95ed7
Memoize calls to calc_overlap_pvalue
Donaim Jan 3, 2025
9020979
Do not use lists in refereceless_contig_stitcher
Donaim Jan 3, 2025
d38ca14
Fix logging in contig stitchers
Donaim Jan 3, 2025
31a8d3b
Add csv_to_fasta utility
Donaim Jan 4, 2025
9411fa2
Add tests for csv_to_fasta
Donaim Jan 4, 2025
8243579
Add some logging to referenceless contig stitcher
Donaim Jan 4, 2025
a041fc7
Improve CLI entry of contig_stitcher.py
Donaim Jan 4, 2025
582fc82
Fix covering behaviour of referenceless contig stitcher
Donaim Jan 4, 2025
d50a36c
Add one more test to referenceless contig_stitcher
Donaim Jan 4, 2025
fe6fbc9
Fix probability calculations for covered contigs in referenceless sti…
Donaim Jan 4, 2025
059b4a3
Fix formatting error
Donaim Jan 4, 2025
b36afe4
Small optimize alignments
Donaim Jan 4, 2025
8ed3d95
Factor out score methods in ContigsPaths
Donaim Jan 4, 2025
c290422
Simplify control flow in extend_by_1
Donaim Jan 5, 2025
961bd53
Further optimize referenceless version
Donaim Jan 5, 2025
c0e4da1
Introduce pessimisstic probability
Donaim Jan 5, 2025
b6dd4a9
Increase threshold for stitching in referenceless
Donaim Jan 5, 2025
27a392f
Further increase threshold for stitching in referenceless
Donaim Jan 5, 2025
1938719
Adjust MAX_ALTERNATIVES value in referenceless
Donaim Jan 5, 2025
041a5be
Better error handling in csv_to_fasta
Donaim Jan 7, 2025
60330e6
Stitch merged_contigs with referenceless stitcher
Donaim Jan 7, 2025
06644a0
Do not accept merged_contigs in fasta_to_csv
Donaim Jan 7, 2025
cfcaad5
Remove pessimisstic_probability from referenceless
Donaim Jan 8, 2025
d7082c3
Revert "Remove pessimisstic_probability from referenceless"
Donaim Jan 8, 2025
f656023
Fix pessimisstic probability calculation in referenceless
Donaim Jan 8, 2025
3f0f93b
Fix fasta_to_csv CLI interface
Donaim Jan 8, 2025
b10b826
Simplify calc_overlap_pvalue calculation
Donaim Jan 9, 2025
3c713d2
Small improvements to overlap finder
Donaim Jan 9, 2025
08e33ec
Optimize find_maximum_overlap
Donaim Jan 9, 2025
131acc7
Use optimized find_maximum_overlap in referenceless
Donaim Jan 9, 2025
35c7907
Simplify code in find_maximum_overlap
Donaim Jan 9, 2025
f591a65
Optimize overlap extension in referenceless
Donaim Jan 9, 2025
af7312b
Optimize referenceless by applying an heuristics
Donaim Jan 10, 2025
fc24bf5
Optimize referenceless by improving max overlap heuristics
Donaim Jan 10, 2025
2d0c4fa
Improve caching in referenceless
Donaim Jan 10, 2025
f13bdb4
Fix typing in referenceless stitcher
Donaim Jan 10, 2025
617bf27
Enable find_maximum_overlap caching
Donaim Jan 10, 2025
eb87412
Tweak minimap parameters in referenceless
Donaim Jan 10, 2025
4fb6167
Cache cutoff determination in referenceless
Donaim Jan 10, 2025
23d30c6
Reduce amount of logging in referenceless
Donaim Jan 10, 2025
4b4778f
Reduce amout of logging in referenceless further
Donaim Jan 10, 2025
ba5daf1
Sort contigs by size in referenceless
Donaim Jan 10, 2025
c97a211
Add size field to Overlap struct of referenceless
Donaim Jan 10, 2025
c2b3ae3
Return number of matches from find_maximum_overlap
Donaim Jan 10, 2025
a4db363
Fix probability heuristic in referenceless
Donaim Jan 10, 2025
33288ca
Remove redundant calculation in referenceless
Donaim Jan 10, 2025
4ea1fd5
Improve scoring in referenceless
Donaim Jan 10, 2025
a159da1
Fix scoring in referenceless
Donaim Jan 10, 2025
40f3c88
Revert to non-adaptable scoring in referenceless
Donaim Jan 10, 2025
26ef042
Revert to before Score class in referenceless
Donaim Jan 10, 2025
8aea4be
Remove redundant comments in referenceless
Donaim Jan 10, 2025
393180c
Wrap max_acceptable_prob in referenceless
Donaim Jan 10, 2025
865bc32
Limit number of contigs in referenceless
Donaim Jan 11, 2025
e63d9e4
Improve MaximumAcceptableProbability calculation in referenceless
Donaim Jan 11, 2025
cdf16f2
Do not clear MaximumAcceptableProbability pool in referenceless
Donaim Jan 11, 2025
c93be63
Revert "Do not clear MaximumAcceptableProbability pool in referenceless"
Donaim Jan 11, 2025
d60e474
Add a final loop in referenceless to collect covered contigs
Donaim Jan 13, 2025
e040cb6
Add another heuristic to referenceless stitcher
Donaim Jan 13, 2025
60e1cb7
Rename classes in referenceless
Donaim Jan 13, 2025
45e50dc
Small improvements to pool structure in referenceless
Donaim Jan 13, 2025
a1a4eca
Reduce logging in referenceless
Donaim Jan 13, 2025
d846b08
Further improvements to the pool structure of referenceless
Donaim Jan 13, 2025
24825e0
Replace Fractions by ints in referenceless
Donaim Jan 13, 2025
bbf87b0
Use correlate instead of convolve in find_maximum_overlap
Donaim Jan 13, 2025
1225d33
Use oaconvolve in find_maximum_overlap
Donaim Jan 13, 2025
1639dad
Use np.convolve in find_maxmimum_overlap
Donaim Jan 13, 2025
2e03d9d
Do not use scipy in find_maximum_overlap
Donaim Jan 13, 2025
ce5971e
Use faster method for conversion of strings into np.arrays
Donaim Jan 13, 2025
94017ad
Improve concordance calculations in contig stitcher
Donaim Jan 14, 2025
4de1569
Make probability the only score in referenceless
Donaim Jan 14, 2025
d058a17
Remove redundant method in referenceless
Donaim Jan 14, 2025
8483bbe
Do not use len(SortedList)
Donaim Jan 14, 2025
56ff67e
Fix initial filtering in referenceless
Donaim Jan 14, 2025
95d85b1
Improve logging in referenceless
Donaim Jan 14, 2025
bc49847
Shortcut referenceless final laps
Donaim Jan 14, 2025
be0090e
Fix punctuation in referenceless logging
Donaim Jan 14, 2025
0260a02
Do not print average concordance in contig stitcher
Donaim Jan 14, 2025
983cd8e
Optimize smallest score access in referenceless
Donaim Jan 14, 2025
9a8c6e5
Fix caching in referenceless
Donaim Jan 14, 2025
569e3c5
Fix some logging in referenceless
Donaim Jan 14, 2025
1e3ccbf
Optimize conversion into numpy arrays in find_maximum_overlap
Donaim Jan 14, 2025
f1204b9
Reduce amount of caching in referenceless
Donaim Jan 14, 2025
abe0be2
Remove CombineCache from referenceless completely
Donaim Jan 14, 2025
e6bd7e8
Implement a seed idea for referenceless
Donaim Jan 14, 2025
1dee289
Add singleton method to ContigsPath in referenceless
Donaim Jan 14, 2025
85c5d78
Bring back combine_cache to referenceless
Donaim Jan 14, 2025
262ba71
Factor out cutoffs handling in referenceless
Donaim Jan 14, 2025
6262410
Make cutoffs cache global again in referenceless
Donaim Jan 14, 2025
fe6c8d7
Make OverlapFinder global in referenceless
Donaim Jan 14, 2025
5b53ccc
Introduce ALIGN_CACHE into referenceless
Donaim Jan 14, 2025
06d37f4
Improve heuristics that skips overlap search in referenceless
Donaim Jan 14, 2025
6d071f6
Remove useless checks in referenceless
Donaim Jan 14, 2025
3e7f20c
Reorder heuristics calculations in referenceless
Donaim Jan 14, 2025
0b40c4b
Make sure to start with longer seeds first in referenceless
Donaim Jan 16, 2025
98a2807
Add cat utility
Donaim Jan 16, 2025
8fb4f7d
Use cat in sample.py
Donaim Jan 16, 2025
4951ac4
Remove all options from cat utility
Donaim Jan 16, 2025
66acb77
Factor out code related to concordance calculations
Donaim Jan 20, 2025
83463ca
Merge remote-tracking branch 'origin/master' into contig-merger
Donaim Jan 21, 2025
4b0352c
Simplify exp_accumulate_array code
Donaim Jan 21, 2025
237bf45
Make sure that exp_accumulate_array receives a binary array
Donaim Jan 21, 2025
92a4d28
Improve typing in exp_accumulate_array
Donaim Jan 21, 2025
466e473
Rewrite concordance calculations in numpy
Donaim Jan 21, 2025
6c74dd0
Factor out exp_accumulate_array in overlap_stitcher
Donaim Jan 21, 2025
751d20f
Optimize numpy.fromiter calls
Donaim Jan 21, 2025
cce2d95
Fix cat.py exceptional behaviour
Donaim Jan 22, 2025
b8f1d9e
Allow to accept 0 files in cat.py
Donaim Jan 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
712 changes: 49 additions & 663 deletions micall/core/contig_stitcher.py

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions micall/core/plot_contigs.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import typing
from typing import Dict, Tuple, List, Set, Iterable, NoReturn
from typing import Dict, Tuple, List, Set, Iterable, NoReturn, Sequence
from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter, FileType
from collections import Counter, defaultdict
from csv import DictReader
Expand Down Expand Up @@ -555,11 +555,11 @@ def hit_to_insertions(contig: GenotypedContig, hit: CigarHit):
yield CigarHit.from_default_alignment(q_st=hit.q_ei + 1, q_ei=len(contig.seq) - 1,
r_st=hit.r_ei + 1, r_ei=hit.r_ei)

def hits_to_insertions(contig: GenotypedContig, hits: List[CigarHit]):
def hits_to_insertions(contig: GenotypedContig, hits: Iterable[CigarHit]):
for hit in hits:
yield from hit_to_insertions(contig, hit)

def record_initial_hit(contig: GenotypedContig, hits: List[CigarHit]):
def record_initial_hit(contig: GenotypedContig, hits: Sequence[CigarHit]):
insertions = [gap for gap in hits_to_insertions(contig, hits)]
unaligned_map[contig.id] = insertions

Expand Down
27 changes: 21 additions & 6 deletions micall/drivers/sample.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
from micall.core.aln2counts import aln2counts
from micall.core.amplicon_finder import write_merge_lengths_plot, merge_for_entropy
from micall.core.cascade_report import CascadeReport
from micall.core.contig_stitcher import contig_stitcher
from micall.core.coverage_plots import coverage_plot, concordance_plot
from micall.core.plot_contigs import plot_genome_coverage
from micall.core.prelim_map import prelim_map
Expand All @@ -21,6 +20,10 @@
from micall.g2p.fastq_g2p import fastq_g2p, DEFAULT_MIN_COUNT, MIN_VALID, MIN_VALID_PERCENT
from micall.utils.driver_utils import makedirs
from micall.utils.fasta_to_csv import fasta_to_csv
from micall.utils.csv_to_fasta import csv_to_fasta, NoContigsInCSV
from micall.utils.referencefull_contig_stitcher import referencefull_contig_stitcher
from micall.utils.referenceless_contig_stitcher import referenceless_contig_stitcher
from micall.utils.cat import cat as concatenate_files
from contextlib import contextmanager

logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -150,7 +153,7 @@ def get_default_path(self, output_name):
if self.scratch_path is None:
raise AttributeError(
'Unknown output {} and no scratch path.'.format(output_name))
for extension in ('csv', 'fastq', 'pdf', 'svg', 'png'):
for extension in ('csv', 'fastq', 'pdf', 'svg', 'png', 'fasta'):
if output_name.endswith('_'+extension):
file_name = output_name[:-(len(extension)+1)] + '.' + extension
break
Expand Down Expand Up @@ -426,18 +429,30 @@ def run_denovo(self, excluded_seeds):
merged_contigs_csv,
)

with open(self.merged_contigs_csv, 'r'):
try:
csv_to_fasta(self.merged_contigs_csv, Path(self.merged_contigs_fasta))
except NoContigsInCSV:
Path(self.merged_contigs_fasta).touch()

concatenate_files(inputs=[self.unstitched_contigs_fasta,
self.merged_contigs_fasta],
output=self.combined_contigs_fasta)

with open(self.combined_contigs_fasta, 'r') as combined_contigs_fasta, \
open(self.stitched_contigs_fasta, 'w') as stitched_contigs_fasta:
referenceless_contig_stitcher(combined_contigs_fasta, stitched_contigs_fasta)

with open(self.unstitched_contigs_csv, 'w') as unstitched_contigs_csv, \
open(self.merged_contigs_csv, 'r') as merged_contigs_csv, \
open(self.blast_csv, 'w') as blast_csv:
fasta_to_csv(Path(self.unstitched_contigs_fasta),
fasta_to_csv(Path(self.stitched_contigs_fasta),
unstitched_contigs_csv,
merged_contigs_csv,
blast_csv=blast_csv,
)

with open(self.unstitched_contigs_csv, 'r') as unstitched_contigs_csv, \
open(self.contigs_csv, 'w') as contigs_csv:
contig_stitcher(unstitched_contigs_csv, contigs_csv, self.stitcher_plot_svg)
referencefull_contig_stitcher(unstitched_contigs_csv, contigs_csv, self.stitcher_plot_svg)

logger.info('Running remap on %s.', self)
if self.debug_remap:
Expand Down
3 changes: 3 additions & 0 deletions micall/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,9 @@
"micall/monitor/update_qai.py",
"micall/monitor/micall_watcher.py",
"micall/tcr/igblast.py",
"micall/utils/find_maximum_overlap.py",
"micall/utils/csv_to_fasta.py",
"micall/utils/cat.py",
]


Expand Down
75 changes: 39 additions & 36 deletions micall/tests/test_contig_stitcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,27 +7,27 @@

from aligntools import CigarActions, CigarHit, Cigar

import micall.core.contig_stitcher as stitcher
from micall.core.contig_stitcher import (
import micall.utils.referencefull_contig_stitcher as stitcher
from micall.utils.referencefull_contig_stitcher import (
split_contigs_with_gaps,
stitch_contigs,
GenotypedContig,
merge_intervals,
find_covered_contig,
stitch_consensus,
calculate_concordance,
calculate_concordance_norm,
align_all_to_reference,
disambiguate_concordance,
lstrip,
rstrip,
)
from micall.utils.overlap_stitcher import disambiguate_concordance
from micall.core.plot_contigs import plot_stitcher_coverage
from micall.tests.utils import mock_align_consensus, MockAlignment, fixed_random_seed
from micall.tests.test_fasta_to_csv import check_hcv_db, DEFAULT_DATABASE # activates the fixture
from micall.tests.test_remap import load_projects # activates the "projects" fixture


logging.getLogger("micall.core.contig_stitcher").setLevel(logging.DEBUG)
logging.getLogger("micall.utils.referencefull_contig_stitcher").setLevel(logging.DEBUG)
logging.getLogger("micall.core.plot_contigs").setLevel(logging.DEBUG)


Expand All @@ -39,7 +39,7 @@

@pytest.fixture()
def exact_aligner(monkeypatch):
monkeypatch.setattr("micall.core.contig_stitcher.align_consensus", mock_align_consensus)
monkeypatch.setattr("micall.utils.referencefull_contig_stitcher.align_consensus", mock_align_consensus)


@pytest.fixture
Expand Down Expand Up @@ -875,7 +875,7 @@ def test_stitching_contig_with_small_covered_gap(exact_aligner, visualizer):
assert len(visualizer().elements) > len(contigs)

assert all(x.seq == lstrip(rstrip(x)).seq for x in results)
assert {contig.seq for contig in contigs} == {contig.seq for contig in results}
assert {contig.seq for contig in contigs} != {contig.seq for contig in results}


def test_stitching_partial_align(exact_aligner, visualizer):
Expand Down Expand Up @@ -1395,7 +1395,7 @@ def mock_align(reference_seq: str, consensus: str) -> Tuple[List[MockAlignment],
algorithm = 'mock'
return (alignments, algorithm)

monkeypatch.setattr("micall.core.contig_stitcher.align_consensus", mock_align)
monkeypatch.setattr("micall.utils.referencefull_contig_stitcher.align_consensus", mock_align)

ref = 'A' * 700
seq = 'C' * 600
Expand Down Expand Up @@ -1457,10 +1457,11 @@ def test_correct_stitching_of_one_normal_and_one_unknown(exact_aligner, visualiz


def test_main_invocation(exact_aligner, tmp_path, hcv_db):
from micall.core.contig_stitcher import main
pwd = os.path.dirname(__file__)
contigs = os.path.join(pwd, "data", "exact_parts_contigs.csv")
stitched_contigs = os.path.join(tmp_path, "stitched.csv")
stitcher.main([contigs, stitched_contigs])
main(['with-references', contigs, stitched_contigs])

assert os.path.exists(contigs)
assert os.path.exists(stitched_contigs)
Expand All @@ -1479,11 +1480,12 @@ def test_main_invocation(exact_aligner, tmp_path, hcv_db):


def test_visualizer_simple(exact_aligner, tmp_path, hcv_db):
from micall.core.contig_stitcher import main
pwd = os.path.dirname(__file__)
contigs = os.path.join(pwd, "data", "exact_parts_contigs.csv")
stitched_contigs = os.path.join(tmp_path, "stitched.csv")
plot = os.path.join(tmp_path, "exact_parts_contigs.plot.svg")
stitcher.main([contigs, stitched_contigs, "--debug", "--plot", plot])
main(['with-references', contigs, stitched_contigs, "--debug", "--plot", plot])

assert os.path.exists(contigs)
assert os.path.exists(stitched_contigs)
Expand Down Expand Up @@ -1626,17 +1628,16 @@ def test_merge_intervals(intervals, expected):
assert merge_intervals(intervals) == expected


@dataclass
class TestMockAlignment:
r_st: int
r_ei: int


class MockAlignedContig:
@dataclass
class TestMockAlignment:
r_st: int
r_ei: int

def __init__(self, ref_name, group_ref, r_st, r_ei, name="contig"):
self.ref_name = ref_name
self.group_ref = group_ref
self.alignment = TestMockAlignment(r_st, r_ei)
self.alignment = MockAlignedContig.TestMockAlignment(r_st, r_ei)
self.name = name
self.id = id(self)

Expand Down Expand Up @@ -1744,12 +1745,12 @@ def test_find_covered(contigs, expected_covered_name):

def test_concordance_same_length_inputs():
with pytest.raises(ValueError):
calculate_concordance("abc", "ab")
calculate_concordance_norm("abc", "ab")


def test_concordance_completely_different_strings():
result = calculate_concordance("a" * 30, "b" * 30)
assert all(n == 0 for n in result)
result = calculate_concordance_norm("a" * 30, "b" * 30)
assert any(n > 0 for n in result)


def generate_random_string_pair(length):
Expand All @@ -1761,19 +1762,21 @@ def generate_random_string_pair(length):
@pytest.mark.parametrize(
"left, right, expected",
[
("aaaaa", "aaaaa", [0.6, 0.68, 0.7, 0.68, 0.6]),
("abcdd", "abcdd", [0.6, 0.68, 0.7, 0.68, 0.6]),
("aaaaaaaa", "baaaaaab", [0.3, 0.62, 0.71, 0.75, 0.75, 0.71, 0.62, 0.3]),
("aaaaaaaa", "aaaaaaab", [0.64, 0.73, 0.79, 0.8, 0.79, 0.73, 0.64, 0.31]),
("aaaaaaaa", "aaaaaaab", [0.64, 0.73, 0.79, 0.8, 0.79, 0.73, 0.64, 0.31]),
("aaaaaaaa", "aaaaabbb", [0.6, 0.68, 0.7, 0.68, 0.6, 0.29, 0.19, 0.13]),
("aaaaaaaa", "aaabbaaa", [0.56, 0.63, 0.62, 0.39, 0.39, 0.62, 0.63, 0.56]),
("aaaaa", "bbbbb", [0] * 5),
# ("aaaaa", "aaaaa", [0.0, 0.78, 1.0, 0.78, 0.0]),
# ("abcdd", "abcdd", [0.0, 0.78, 1.0, 0.78, 0.0]),
("aaaaaaaa", "baaaaaab", [0.0, 0.95, 0.99, 1.0, 1.0, 0.99, 0.95, 0.0]),
("aaaaaaaa", "aaaaaaab", [0.94, 0.98, 0.99, 1.0, 0.99, 0.98, 0.94, 0.0]),
("aaaaaaaa", "aaaaabbb", [0.96, 0.99, 1.0, 0.99, 0.96, 0.02, 0.0, 0.02]),
("aaaaaaaaaaa", "aaaaabaaaaa", [0.96, 0.99, 1.0, 0.99, 0.96, 0.0, 0.96, 0.99, 1.0, 0.99, 0.96]),
("aaaaaaaaaaa", "aaaabbbaaaa", [0.98, 1.0, 1.0, 0.98, 0.02, 0.0, 0.02, 0.98, 1.0, 1.0, 0.98]),
("aaaaaaaaaaa", "aaabbbbbaaa", [0.98, 1.0, 0.98, 0.04, 0.01, 0.0, 0.01, 0.04, 0.98, 1.0, 0.98]),
("aaaaaaaa", "aaabbaaa", [0.98, 1.0, 0.98, 0.0, 0.0, 0.98, 1.0, 0.98]),
("aaaaa", "bbbbb", [1.0, 0.22, 0.0, 0.22, 1.0]),
("", "", []),
],
)
def test_concordance_simple(left, right, expected):
result = [round(float(x), 2) for x in calculate_concordance(left, right)]
result = [round(float(x), 2) for x in calculate_concordance_norm(left, right)]
assert result == expected


Expand All @@ -1790,8 +1793,8 @@ def test_concordance_simple(left, right, expected):
(
"a" * 128,
"b" * 48 + "a" * 15 + "ab" + "a" * 15 + "b" * 48,
48 + 16 // 2,
), # two peaks - choosing 1nd
48 + 15 // 2,
), # two peaks - choosing 1st
(
"a" * 128,
"b" * 48 + "a" * 15 + "ba" + "a" * 15 + "b" * 48,
Expand All @@ -1805,7 +1808,7 @@ def test_concordance_simple(left, right, expected):
],
)
def test_concordance_simple_index(left, right, expected):
concordance = calculate_concordance(left, right)
concordance = calculate_concordance_norm(left, right)
concordance_d = list(disambiguate_concordance(concordance))
index = max(range(len(concordance)), key=lambda i: concordance_d[i])
if abs(index - expected) > 1:
Expand All @@ -1823,7 +1826,7 @@ def generate_test_cases(num_cases):

@pytest.mark.parametrize("left, right", concordance_cases)
def test_concordance_output_range(left, right):
result = calculate_concordance(left, right)
result = calculate_concordance_norm(left, right)
assert all(
0 <= n <= 1 for n in result
), "All values in result should be between 0 and 1"
Expand All @@ -1845,8 +1848,8 @@ def test_concordance_higher_if_more_matches_added(left, right):
+ right[insert_position + len(matching_sequence):]
)

old_conc = calculate_concordance(left, right)
new_conc = calculate_concordance(new_left, new_right)
old_conc = calculate_concordance_norm(left, right)
new_conc = calculate_concordance_norm(new_left, new_right)
old_average = sum(old_conc) / len(old_conc)
new_average = sum(new_conc) / len(new_conc)
assert old_average <= new_average
Expand All @@ -1868,7 +1871,7 @@ def test_concordance_higher_in_matching_areas(left, right):
+ right[insert_position + len(matching_sequence):]
)

concordance_scores = calculate_concordance(new_left, new_right)
concordance_scores = calculate_concordance_norm(new_left, new_right)

# Check concordance in the matching area
matching_area_concordance = concordance_scores[
Expand Down
6 changes: 3 additions & 3 deletions micall/tests/test_contig_stitcher_fuzz.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
import pytest
import json
import os
from micall.core.contig_stitcher import (
from micall.utils.referencefull_contig_stitcher import (
GenotypedContig,
AlignedContig,
stitch_consensus,
stitch_contigs,
drop_completely_covered,
StitcherContext,
)
import micall.core.contig_stitcher as stitcher
import micall.utils.referencefull_contig_stitcher as stitcher
from micall.core.plot_contigs import build_stitcher_figure
from aligntools import CigarHit, Cigar, CigarActions
from typing import Dict, List
Expand All @@ -18,7 +18,7 @@

@pytest.fixture
def no_aligner(monkeypatch):
monkeypatch.setattr("micall.core.contig_stitcher.align_to_reference", lambda x: [x])
monkeypatch.setattr("micall.utils.referencefull_contig_stitcher.align_to_reference", lambda x: [x])


@pytest.fixture(autouse=True)
Expand Down
64 changes: 64 additions & 0 deletions micall/tests/test_csv_to_fasta.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
import sys
import pytest
from pathlib import Path
from micall.utils.csv_to_fasta import cli, main


def test_normal_input(tmpdir):
contigs_fasta = Path(tmpdir) / "contigs.fasta"
contigs_csv = Path(tmpdir) / "contigs.csv"

contigs_csv.write_text("""\
ref,match,group_ref,contig
HCV-1a,1.0,HCV-1a,TCACCAGGACAGCGGGTTGAATTCCTCGTGCAAGCGTGGAA
HCV-1a,0.75,HCV-1a,CATCACATAGGAGACAGGGCTCCAGGACTGCACCATGCTCGTGTGTGGCGACGAC
""")

expected_contigs_fasta = """\
>1
TCACCAGGACAGCGGGTTGAATTCCTCGTGCAAGCGTGGAA
>2
CATCACATAGGAGACAGGGCTCCAGGACTGCACCATGCTCGTGTGTGGCGACGAC
"""

assert main([str(contigs_csv), str(contigs_fasta)]) == 0
assert expected_contigs_fasta == contigs_fasta.read_text()


def test_empty(tmpdir):
contigs_fasta = Path(tmpdir) / "contigs.fasta"
contigs_csv = Path(tmpdir) / "contigs.csv"

contigs_csv.write_text("""\
ref,match,group_ref,contig
""")

expected_contigs_fasta = ""
assert main([str(contigs_csv), str(contigs_fasta)]) == 0
assert expected_contigs_fasta == contigs_fasta.read_text()


def test_main_invocation(tmpdir, monkeypatch):
contigs_fasta = Path(tmpdir) / "contigs.fasta"
contigs_csv = Path(tmpdir) / "contigs.csv"

contigs_csv.write_text("""\
ref,match,group_ref,contig
HCV-1a,1.0,HCV-1a,TCACCAGGACAGCGGGTTGAATTCCTCGTGCAAGCGTGGAA
HCV-1a,0.75,HCV-1a,CATCACATAGGAGACAGGGCTCCAGGACTGCACCATGCTCGTGTGTGGCGACGAC
""")

expected_contigs_fasta = """\
>1
TCACCAGGACAGCGGGTTGAATTCCTCGTGCAAGCGTGGAA
>2
CATCACATAGGAGACAGGGCTCCAGGACTGCACCATGCTCGTGTGTGGCGACGAC
"""

argv = ['program', str(contigs_csv), str(contigs_fasta)]
monkeypatch.setattr(sys, 'argv', argv)

with pytest.raises(SystemExit):
cli()

assert expected_contigs_fasta == contigs_fasta.read_text()
Loading