feat: allele rle normalization + pin pydantic version #234

korikuzma · 2023-08-24T02:39:51Z

@theferrit32 @toneillbroad I don't think RLE normalization was fully added, so this PR will close #204 . The normalize vrs test now passes and have different error messages for the translator ones that failed. The translator tests that previously passed, still pass with these changes.

@ahwagner @larrybabb The 2.0-alpha docs has Allele with LSE and RSE (which is not in 2.0-alpha). So I assumed that the LSE section is used for both LSE and RLE. In addition to @theferrit32 or @toneillbroad review, I'd like one of you to review (my brain is on low battery).

Notes:

Pin pydantic version to working version. v2.2+ had errors when importing models
Updated _normalize_allele to handle LSE and RLE
- In previous VRS-Python versions, we did not handle normalization for definite ranges, so I assumed the same still applied
- There were checks for different state types, but the only Allele states in 2.0-alpha right now are LSE and RLE so I removed the unnecessary checks
- Removed normalize test that had RSE

korikuzma · 2023-08-24T10:40:01Z

@ahwagner @larrybabb The 2.0-alpha docs has Allele with LSE and RSE (which is not in 2.0-alpha). So I assumed that the LSE section is used for both LSE and RLE.

@ahwagner @larrybabb I think this will cause issues with hgvs dup del mode in the variation-normalizer if someone chooses the literal_seq_expr mode when they want to express using a LSE. If the current changes to the _normalize_allele function is correct, I think we should rethink the hgvs dup del mode in variation-normalizer. A proposal would be to just represent using RLE with sequence provided.

jsstevenson · 2023-08-25T00:44:41Z

@korikuzma could you make a new issue about the pydantic v2.2+ errors?

ahwagner

Outstanding work @korikuzma. I have a few general questions for @larrybabb to address and a few minor suggestions for clarity or streamlining. RLE implementation looks sound overall.

For @larrybabb, one major theme raised in this PR is the question of "how do we handle Allele normalization when the Allele Location is specified by Ranges"? To me, these have always seemed to be a shorthand for "I did a targeted region assay and want to craft general statements about copy number in those regions and the potential broader impact they have". I know we allow people to create Alleles with Range-based Locations anyway, but... why? Kori's work here supports those cases and raises interesting questions, e.g. what do we do with definite range intervals? We don't want that discussion to gate this PR but I have created #237 to discuss.

ahwagner · 2023-08-25T20:21:10Z

tests/test_vrs_normalize.py

@@ -35,27 +35,65 @@
    }
 }

-seq_loc = {
+
+allele_dict2_normalized = {


@larrybabb note this example of ambiguous endpoint deletion as state.length=0. This is replacing what might otherwise be a reported deletion with state.sequence="".

ahwagner · 2023-08-25T20:22:44Z

tests/test_vrs_normalize.py

-            "type": "Number",
-            "value": 2
-        }
+        "sequence": "",


@larrybabb in this example, we have a defined range on either side, but the state.sequence = "" is a deletion. We will want to pick one of these two paths for deletion representation across all int and range representations. I assume you prefer the RLE path (prior comment) but please confirm so @korikuzma can update accordingly.

This is acknowledging that the entire concept of an Allele (and not a CopyNumber) with ambiguous endpoints is a little absurd. I wonder if we should even be supporting that now that we do CNC / CNX.

It's absurd because I added these tests over 2 years ago (when I knew almost nothing). If @ahwagner and @larrybabb could provide some good test examples, that would be great 😄

Sorry @korikuzma; I didn't mean to indicate that the test was absurd, the notion that @larrybabb and I have not disallowed this use case is absurd. The tests are a good reflection of what we have asked for.

src/ga4gh/vrs/normalize.py

ahwagner · 2023-08-25T21:31:39Z

src/ga4gh/vrs/normalize.py

+    # Temporarily convert SequenceReference to IRI because it makes the code simpler.
+    # This will be changed back to SequenceReference at the end of the method
    sequence_reference = None
    if isinstance(allele.location.sequence, models.SequenceReference):
        sequence_reference = allele.location.sequence
        allele.location.sequence = models.IRI(sequence_reference.refgetAccession)


Not following this. An IRI isn't guaranteed to have the digest in it, so we should always assume a SequenceReference, or have a mechanism for resolving the IRI to get a SequenceReference. I might be missing something here, but why not simply remove this block and then revise line 107 to pull from the digest field expected in every object?

Suggested change

# Temporarily convert SequenceReference to IRI because it makes the code simpler.

# This will be changed back to SequenceReference at the end of the method

sequence_reference = None

if isinstance(allele.location.sequence, models.SequenceReference):

sequence_reference = allele.location.sequence

allele.location.sequence = models.IRI(sequence_reference.refgetAccession)

This was initially done by @theferrit32 . @ahwagner is the digest expected in every VRS object? In the models.py, the digest is set as an optional field for all VRS objects.

I'm trying to figure out why this was done, but I'm now realizing that this is an issue if allele.location.sequence is not defined. A SequenceLocation.sequence is listed as an optional field. @ahwagner can you explain why this is not a required field?

@ahwagner is the digest expected in every VRS object?

Yes, this is something that I think @theferrit32 was working on. It is available to every object, and our digest strategy in VRS 2 will compute this for every object, whether or not it is an identifiable object. I believe we were going to be using this for all objects in VRS-Python.

A SequenceLocation.sequence is listed as an optional field. @ahwagner can you explain why this is not a required field?

Yes, it is optional to use this attribute in JSON Schema, because when used in an Allele that is part of a Haplotype, the SequenceLocation.sequence can be omitted from VRS messages, as they (by definition) will match the sequence of the parent Haplotype object. However, it is required from the ga4ghDigest.keys for creating computed digests, because it is still a critical component of the value of a SequenceLocation. In those cases, it is expected that the system loading the VRS Haplotype object would refer to / copy over the Haplotype.sequence for the Haplotype.member[*].location.sequence properties.

@ahwagner I think there might be some confusion regarding the digest field. At the moment, digest is optional and does not get computed when a VRS Object is created. Should this field be populated each time a user creates a VRS Object?

@ahwagner is suggesting to switch this condition to IRI. Some test cases: #seqrefs/myseq123 and HTTPS://w3id.org/NM_012345

We will need to dereference the #seqrefs/myseq123 outside this function because this function only receives the allele to be normalized, it won't have access to the full original document the allele came from, where #seqrefs/myseq123 can be resolved from.

We can create another function like ga4gh_inline (or something) that takes a JSON document, finds the GA4GH objects in it, and for any field whose value is an IRI that is a relative pointer, inline the object it points to in that field, if the document contains it.

If we do that the type fields would then need to be required on the input JSON/dict. They are not currently required because the type fields are defined as literals and when you construct a particular class with some input it assumes it is that type and fills in the type field.

If we don't want this constraint, we could just traverse the input document and replace all field values that look like JSON pointers (not just those in fields that are defined as IRIs in the VRS models) with the objects they refer to. This would also let people use the JSON pointer thing in non-model fields. Like if someone has a statement and a custom field they added that isn't in the model, but they want to refer to the variant in the same document from there using a pointer.

@theferrit32 updated comments + tests with deferenced IRIs. Let me know if I need to make any more changes!

src/ga4gh/vrs/normalize.py

Co-authored-by: Alex H. Wagner, PhD <[email protected]>

src/ga4gh/vrs/normalize.py

theferrit32 · 2023-08-30T17:02:20Z

tests/test_vrs_normalize.py

@@ -66,10 +123,21 @@ def test_normalize_allele(rest_dataproxy):
    allele2 = normalize(allele1, rest_dataproxy)
    assert allele1 == allele2

+    allele1_seq_ref = models.Allele(**allele_dict_sequence_reference)
+    allele2_seq_ref = normalize(allele1_seq_ref, rest_dataproxy)


ahwagner

I think we need to go the other way on this–full SequenceReference object representation. Sorry for the extra work!

ahwagner · 2023-08-30T17:58:56Z

tests/test_vrs_normalize.py

@@ -75,7 +75,7 @@
    "type": "Allele",
    "location": {
        "type": "SequenceLocation",
-        "sequence": "refseq:NC_000023.11",
+        "sequence": "ga4gh:SQ.w0WZEvgJF0zf_P4yyTzjjv9oW1z61HHP",


An IRI is a reference to another object. It can be of any form under the IETF specification. When we say the sequence slot is dereferenced, it means that instead of an IRI, we have a SequenceReference object. This is true for every property in VRS where we allow for an IRI or object.

I think it is fair for us to assume this property (and every property) is dereferenced / has full object representation for normalization. We SHOULD NOT assume that an IRI takes a specific form (e.g. a refseq or ga4gh identifier) as we do here. I also believe that IRIs that contain a colon before an IRI fragment identifier (#; again, as seen here) are not valid IRIs.

We should add a regex pattern on the IRI class. I can make a new issue for this

An IRI is a reference to another object. It can be of any form. When we say the sequence slot is dereferenced, it means that instead of an IRI, we have a SequenceReference object. This is true for every property in VRS where we allow for an IRI or object.

Okay, I will update the code + tests to always assume a SequenceReference

We SHOULD NOT assume that an IRI takes a specific form (e.g. a refseq or ga4gh identifier) as we do here.

This was just examples for tests. The SequenceProxy class will take the input (regardless of refseq/ga4gh/ensembl etc) to get the corresponding sequence.

When we say the sequence slot is dereferenced, it means that instead of an IRI, we have a SequenceReference object.

@ahwagner thanks for this clarification. The Translator class will need to be updated to work like this (doesn't need to be in this PR). Currently it sets the sequence id (ga4gh:SQ, not ga4gh:SQR ) as the location.sequence value

ahwagner

Nice!

korikuzma added 2 commits August 23, 2023 22:20

fix: allele normalization + pin pydantic version

c8f5419

re-run translator tests

1ed754c

korikuzma added enhancement New feature or request priority:high High priority labels Aug 24, 2023

korikuzma requested review from ahwagner, larrybabb, theferrit32 and toneillbroad August 24, 2023 02:39

korikuzma self-assigned this Aug 24, 2023

korikuzma requested review from a team as code owners August 24, 2023 02:39

didnt use new allele for state

8ac1649

korikuzma added 6 commits August 24, 2023 10:15

use one line

2113235

include sequence in rle + paramaterize limit

b579734

maybe fixed normalize function??

59424bc

revert

14e5bc9

fix setting trimmed alleles

6736891

allow for no rle_seq_limit + update example

9938462

korikuzma mentioned this pull request Aug 25, 2023

Unpin pydantic v2.1.1 #235

Closed

korikuzma added the 2.0-alpha Issues related to VRS 2.0-alpha branch label Aug 25, 2023

korikuzma mentioned this pull request Aug 25, 2023

Update metaschema to use VRS 2.0-alpha changes cancervariants/variation-normalization#476

Closed

ahwagner mentioned this pull request Aug 25, 2023

How to handle Allele normalization for Range Locations #237

Open

ahwagner requested changes Aug 25, 2023

View reviewed changes

korikuzma and others added 5 commits August 25, 2023 19:22

Update src/ga4gh/vrs/normalize.py

1b7c2ff

Co-authored-by: Alex H. Wagner, PhD <[email protected]>

Update src/ga4gh/vrs/normalize.py

daa2195

Co-authored-by: Alex H. Wagner, PhD <[email protected]>

Update src/ga4gh/vrs/normalize.py

812f480

Co-authored-by: Alex H. Wagner, PhD <[email protected]>

refactor handling getting alias for SequenceProxy

af267f0

iri must be dereferenced + update tests

05deaca

theferrit32 reviewed Aug 30, 2023

View reviewed changes

src/ga4gh/vrs/normalize.py Show resolved Hide resolved

korikuzma requested a review from ahwagner August 30, 2023 16:11

toneillbroad approved these changes Aug 30, 2023

View reviewed changes

theferrit32 reviewed Aug 30, 2023

View reviewed changes

theferrit32 approved these changes Aug 30, 2023

View reviewed changes

ahwagner requested changes Aug 30, 2023

View reviewed changes

_normalize_allele always expects a SequenceReference

b64aa8b

korikuzma requested review from ahwagner, theferrit32 and toneillbroad August 30, 2023 18:18

korikuzma mentioned this pull request Aug 30, 2023

Add regex pattern to IRI class #240

Open

ahwagner approved these changes Aug 30, 2023

View reviewed changes

add log warning when IRI is passed

2e1923a

theferrit32 approved these changes Aug 30, 2023

View reviewed changes

korikuzma merged commit 26a0b1d into 2-alpha Aug 30, 2023
0 of 8 checks passed

korikuzma deleted the update-allele-normalization branch August 30, 2023 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allele rle normalization + pin pydantic version #234

feat: allele rle normalization + pin pydantic version #234

korikuzma commented Aug 24, 2023

korikuzma commented Aug 24, 2023

jsstevenson commented Aug 25, 2023

ahwagner left a comment

ahwagner Aug 25, 2023

ahwagner Aug 25, 2023

ahwagner Aug 25, 2023

korikuzma Aug 25, 2023

ahwagner Aug 26, 2023

ahwagner Aug 25, 2023

korikuzma Aug 25, 2023 •

edited

Loading

ahwagner Aug 26, 2023 •

edited

Loading

korikuzma Aug 28, 2023

korikuzma Aug 28, 2023

theferrit32 Aug 28, 2023

theferrit32 Aug 28, 2023

korikuzma Aug 30, 2023

theferrit32 Aug 30, 2023

ahwagner left a comment

ahwagner Aug 30, 2023 •

edited

Loading

korikuzma Aug 30, 2023

korikuzma Aug 30, 2023

theferrit32 Aug 30, 2023

ahwagner left a comment

feat: allele rle normalization + pin pydantic version #234

feat: allele rle normalization + pin pydantic version #234

Conversation

korikuzma commented Aug 24, 2023

korikuzma commented Aug 24, 2023

jsstevenson commented Aug 25, 2023

ahwagner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

korikuzma Aug 25, 2023 • edited Loading

Choose a reason for hiding this comment

ahwagner Aug 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahwagner left a comment

Choose a reason for hiding this comment

ahwagner Aug 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahwagner left a comment

Choose a reason for hiding this comment

korikuzma Aug 25, 2023 •

edited

Loading

ahwagner Aug 26, 2023 •

edited

Loading

ahwagner Aug 30, 2023 •

edited

Loading