-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: support reference alignment computation for negative stranded intervals #138
base: main
Are you sure you want to change the base?
feat!: support reference alignment computation for negative stranded intervals #138
Conversation
Breaks backward compat, so need to edit PR title to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to comments, I'd like to get @AliceDG 's approval on correctness of the tests.
docs-src/develop.rst
Outdated
@@ -47,7 +47,7 @@ by linking your source tree from python's ``site-packages``. | |||
|
|||
Finally, run the all the tests:: | |||
|
|||
python -m unittest discover | |||
CI=1 python -m unittest discover |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave you bad advice. This is to skip running some tests in test_data_manager.py and test_gk_data.py that require cloud credentials, but unfortunately this will also skip some tests it's better to run. I'll think of a better solution, in the meantime can you revert the 2 changes to this file?
genome_kit/_apply_variants.py
Outdated
if isinstance(i, tuple): # If the item is a tuple, add it to the temp list | ||
tmp.append(i) | ||
else: | ||
if tmp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer not to rely on falsey values.
if tmp: | |
if len(tmp) > 0: |
genome_kit/_apply_variants.py
Outdated
This encompasses two steps: firstly, it flips the provided alignment, since | ||
negative strand intervals are defined from the 3' to the 5' end. Secondly, | ||
it translates the alignment so that it is reported in ascending order | ||
instead of descending order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace with a small example
genome_kit/_apply_variants.py
Outdated
@@ -195,7 +233,8 @@ def apply_variants(sequence, variants, interval, reference_alignment=False): | |||
var_sequence = reverse_complement(var_sequence) | |||
|
|||
if reference_alignment: | |||
raise ValueError("Reference alignment only work on forward strand.") | |||
# raise ValueError("Reference alignment only work on forward strand.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# raise ValueError("Reference alignment only work on forward strand.") |
else: | ||
flipped_alignment.append(highest_index-i) | ||
return flipped_alignment | ||
|
||
|
||
def apply_variants(sequence, variants, interval, reference_alignment=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document the format of the returned reference_alignment?
interval = Interval('chr1', '-', 5, 10, 'hg19', 8) | ||
self.assertEqual(interval.anchor_offset, 0) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why add this test but not mirror the assert on the anchor value?
This encompasses two steps: firstly, it flips the provided alignment, since | ||
negative strand intervals are defined from the 3' to the 5' end. Secondly, | ||
it translates the alignment so that it is reported in ascending order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we do this in a single pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the implementation, this is done in a single pass. I thought separating out the steps in the docstring would make it easier to understand!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's iterating over the sequence twice as far as I can tell
self.assertEqual(reference_alignment, [-1, 0, 1, 2, 3, 4, 6, 7, 8, 9]) | ||
|
||
reference_alignment = apply_variants(genome37.dna, variants, negative_strand_interval, reference_alignment=True)[1] | ||
self.assertEqual(reference_alignment, [0, 1, 2, 3, 5, 6, 7, 8, 9, 10]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be getting confused but I'm not sure the original case is correct even: I see a deletion of 2, where the anchor is in the middle?
see https://deepgenomics.github.io/GenomeKit/anchors.html#deletions-at-the-anchor
5p-->3p +'ve
5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7 8 9
C G T A C G T A C G
G C A T G C A T G C
9 8 7 6 5 4 3 2 1 0
deleting pos [10,12) should delete 5,6
4 5 6 7 8 9 10 11 12 13 14 15 16
-1 0 1 2 3 4 5 6 7 8 9 10
A C G T A C . |. A C G T
T G C A T G . |. T G C A
10 9 8 7 6 5 4 3 2 1 0 -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll take a look later but likely it's not splitting the var over apply left + right,
(if ``reference_alignment=True``). The alignment is comprised of integers | ||
and (int, int) tuples, with the tuples signifying an insertion of | ||
nucleotides. The first index of these tuples denotes the index of the | ||
nucleotide immediately proceeding the insertion sequence, while the | ||
second index denotes the index of that nucleotide relative to all the | ||
other nucleotides included in that insertion operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit wordy and the second index can be misinterpreted, this might be more concise
(if ``reference_alignment=True``). The alignment is comprised of integers | |
and (int, int) tuples, with the tuples signifying an insertion of | |
nucleotides. The first index of these tuples denotes the index of the | |
nucleotide immediately proceeding the insertion sequence, while the | |
second index denotes the index of that nucleotide relative to all the | |
other nucleotides included in that insertion operation. | |
(if ``reference_alignment=True``). The alignment is comprised of | |
int/tuple[int, int] offsets to `interval.5p`. | |
The tuples denote insertions, where tuple[0] is the offset after the | |
insertion and tuple[1] is the indexes the inserted sequence. |
Fixes #130.
Previously, if a user wanted to get the reference alignment for a negative strand through
apply_variants
, they'd get a ValueError saying that it was not supported. This PR adds support for this feature.The alignment generation is done by first flipping the positive strand alignment and then translating it so that it is returned in an ascending order.