Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: support reference alignment computation for negative stranded intervals #138

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

Veyron2121
Copy link

Fixes #130.

Previously, if a user wanted to get the reference alignment for a negative strand through apply_variants, they'd get a ValueError saying that it was not supported. This PR adds support for this feature.

The alignment generation is done by first flipping the positive strand alignment and then translating it so that it is returned in an ascending order.

@ovesh ovesh requested review from s22chan and ovesh March 4, 2025 21:25
@ovesh
Copy link
Contributor

ovesh commented Mar 4, 2025

Breaks backward compat, so need to edit PR title to feat!

Copy link
Contributor

@ovesh ovesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to comments, I'd like to get @AliceDG 's approval on correctness of the tests.

@@ -47,7 +47,7 @@ by linking your source tree from python's ``site-packages``.

Finally, run the all the tests::

python -m unittest discover
CI=1 python -m unittest discover
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave you bad advice. This is to skip running some tests in test_data_manager.py and test_gk_data.py that require cloud credentials, but unfortunately this will also skip some tests it's better to run. I'll think of a better solution, in the meantime can you revert the 2 changes to this file?

if isinstance(i, tuple): # If the item is a tuple, add it to the temp list
tmp.append(i)
else:
if tmp:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to rely on falsey values.

Suggested change
if tmp:
if len(tmp) > 0:

Comment on lines 79 to 82
This encompasses two steps: firstly, it flips the provided alignment, since
negative strand intervals are defined from the 3' to the 5' end. Secondly,
it translates the alignment so that it is reported in ascending order
instead of descending order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with a small example

@@ -195,7 +233,8 @@ def apply_variants(sequence, variants, interval, reference_alignment=False):
var_sequence = reverse_complement(var_sequence)

if reference_alignment:
raise ValueError("Reference alignment only work on forward strand.")
# raise ValueError("Reference alignment only work on forward strand.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# raise ValueError("Reference alignment only work on forward strand.")

else:
flipped_alignment.append(highest_index-i)
return flipped_alignment


def apply_variants(sequence, variants, interval, reference_alignment=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document the format of the returned reference_alignment?

@Veyron2121 Veyron2121 changed the title feat: support reference alignment computation for negative stranded intervals feat!: support reference alignment computation for negative stranded intervals Mar 5, 2025
Comment on lines +571 to +573
interval = Interval('chr1', '-', 5, 10, 'hg19', 8)
self.assertEqual(interval.anchor_offset, 0)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add this test but not mirror the assert on the anchor value?

Comment on lines +79 to +81
This encompasses two steps: firstly, it flips the provided alignment, since
negative strand intervals are defined from the 3' to the 5' end. Secondly,
it translates the alignment so that it is reported in ascending order
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we do this in a single pass?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the implementation, this is done in a single pass. I thought separating out the steps in the docstring would make it easier to understand!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's iterating over the sequence twice as far as I can tell

self.assertEqual(reference_alignment, [-1, 0, 1, 2, 3, 4, 6, 7, 8, 9])

reference_alignment = apply_variants(genome37.dna, variants, negative_strand_interval, reference_alignment=True)[1]
self.assertEqual(reference_alignment, [0, 1, 2, 3, 5, 6, 7, 8, 9, 10])
Copy link
Collaborator

@s22chan s22chan Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be getting confused but I'm not sure the original case is correct even: I see a deletion of 2, where the anchor is in the middle?

see https://deepgenomics.github.io/GenomeKit/anchors.html#deletions-at-the-anchor

5p-->3p +'ve

    5  6  7  8  9 10 11 12 13 14 15

     0  1  2  3  4  5  6  7  8  9
     C  G  T  A  C  G  T  A  C  G

     G  C  A  T  G  C  A  T  G  C
     9  8  7  6  5  4  3  2  1  0

deleting pos [10,12) should delete 5,6 

 4  5  6  7  8  9 10 11 12 13 14 15 16

 -1  0  1  2  3  4  5  6  7  8  9 10
  A  C  G  T  A  C  . |.  A  C  G  T
 
  T  G  C  A  T  G  . |.  T  G  C  A
 10  9  8  7  6  5  4  3  2  1  0 -1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll take a look later but likely it's not splitting the var over apply left + right,

Comment on lines +138 to +143
(if ``reference_alignment=True``). The alignment is comprised of integers
and (int, int) tuples, with the tuples signifying an insertion of
nucleotides. The first index of these tuples denotes the index of the
nucleotide immediately proceeding the insertion sequence, while the
second index denotes the index of that nucleotide relative to all the
other nucleotides included in that insertion operation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit wordy and the second index can be misinterpreted, this might be more concise

Suggested change
(if ``reference_alignment=True``). The alignment is comprised of integers
and (int, int) tuples, with the tuples signifying an insertion of
nucleotides. The first index of these tuples denotes the index of the
nucleotide immediately proceeding the insertion sequence, while the
second index denotes the index of that nucleotide relative to all the
other nucleotides included in that insertion operation.
(if ``reference_alignment=True``). The alignment is comprised of
int/tuple[int, int] offsets to `interval.5p`.
The tuples denote insertions, where tuple[0] is the offset after the
insertion and tuple[1] is the indexes the inserted sequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

apply_variants with reference_alignment=True raises a ValueError
3 participants