Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allele rle normalization + pin pydantic version #234

Merged
merged 16 commits into from
Aug 30, 2023

Conversation

korikuzma
Copy link
Contributor

@theferrit32 @toneillbroad I don't think RLE normalization was fully added, so this PR will close #204 . The normalize vrs test now passes and have different error messages for the translator ones that failed. The translator tests that previously passed, still pass with these changes.

@ahwagner @larrybabb The 2.0-alpha docs has Allele with LSE and RSE (which is not in 2.0-alpha). So I assumed that the LSE section is used for both LSE and RLE. In addition to @theferrit32 or @toneillbroad review, I'd like one of you to review (my brain is on low battery).

Notes:

  • Pin pydantic version to working version. v2.2+ had errors when importing models
  • Updated _normalize_allele to handle LSE and RLE
    • In previous VRS-Python versions, we did not handle normalization for definite ranges, so I assumed the same still applied
    • There were checks for different state types, but the only Allele states in 2.0-alpha right now are LSE and RLE so I removed the unnecessary checks
    • Removed normalize test that had RSE

@korikuzma korikuzma added enhancement New feature or request priority:high High priority labels Aug 24, 2023
@korikuzma korikuzma self-assigned this Aug 24, 2023
@korikuzma korikuzma requested review from a team as code owners August 24, 2023 02:39
@korikuzma
Copy link
Contributor Author

@ahwagner @larrybabb The 2.0-alpha docs has Allele with LSE and RSE (which is not in 2.0-alpha). So I assumed that the LSE section is used for both LSE and RLE.

@ahwagner @larrybabb I think this will cause issues with hgvs dup del mode in the variation-normalizer if someone chooses the literal_seq_expr mode when they want to express using a LSE. If the current changes to the _normalize_allele function is correct, I think we should rethink the hgvs dup del mode in variation-normalizer. A proposal would be to just represent using RLE with sequence provided.

@jsstevenson
Copy link
Contributor

@korikuzma could you make a new issue about the pydantic v2.2+ errors?

Copy link
Member

@ahwagner ahwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outstanding work @korikuzma. I have a few general questions for @larrybabb to address and a few minor suggestions for clarity or streamlining. RLE implementation looks sound overall.

For @larrybabb, one major theme raised in this PR is the question of "how do we handle Allele normalization when the Allele Location is specified by Ranges"? To me, these have always seemed to be a shorthand for "I did a targeted region assay and want to craft general statements about copy number in those regions and the potential broader impact they have". I know we allow people to create Alleles with Range-based Locations anyway, but... why? Kori's work here supports those cases and raises interesting questions, e.g. what do we do with definite range intervals? We don't want that discussion to gate this PR but I have created #237 to discuss.

@@ -35,27 +35,65 @@
}
}

seq_loc = {

allele_dict2_normalized = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larrybabb note this example of ambiguous endpoint deletion as state.length=0. This is replacing what might otherwise be a reported deletion with state.sequence="".

"type": "Number",
"value": 2
}
"sequence": "",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larrybabb in this example, we have a defined range on either side, but the state.sequence = "" is a deletion. We will want to pick one of these two paths for deletion representation across all int and range representations. I assume you prefer the RLE path (prior comment) but please confirm so @korikuzma can update accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is acknowledging that the entire concept of an Allele (and not a CopyNumber) with ambiguous endpoints is a little absurd. I wonder if we should even be supporting that now that we do CNC / CNX.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's absurd because I added these tests over 2 years ago (when I knew almost nothing). If @ahwagner and @larrybabb could provide some good test examples, that would be great 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @korikuzma; I didn't mean to indicate that the test was absurd, the notion that @larrybabb and I have not disallowed this use case is absurd. The tests are a good reflection of what we have asked for.

src/ga4gh/vrs/normalize.py Outdated Show resolved Hide resolved
src/ga4gh/vrs/normalize.py Show resolved Hide resolved
src/ga4gh/vrs/normalize.py Outdated Show resolved Hide resolved
Comment on lines 99 to 104
# Temporarily convert SequenceReference to IRI because it makes the code simpler.
# This will be changed back to SequenceReference at the end of the method
sequence_reference = None
if isinstance(allele.location.sequence, models.SequenceReference):
sequence_reference = allele.location.sequence
allele.location.sequence = models.IRI(sequence_reference.refgetAccession)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not following this. An IRI isn't guaranteed to have the digest in it, so we should always assume a SequenceReference, or have a mechanism for resolving the IRI to get a SequenceReference. I might be missing something here, but why not simply remove this block and then revise line 107 to pull from the digest field expected in every object?

Suggested change
# Temporarily convert SequenceReference to IRI because it makes the code simpler.
# This will be changed back to SequenceReference at the end of the method
sequence_reference = None
if isinstance(allele.location.sequence, models.SequenceReference):
sequence_reference = allele.location.sequence
allele.location.sequence = models.IRI(sequence_reference.refgetAccession)

Copy link
Contributor Author

@korikuzma korikuzma Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was initially done by @theferrit32 . @ahwagner is the digest expected in every VRS object? In the models.py, the digest is set as an optional field for all VRS objects.

I'm trying to figure out why this was done, but I'm now realizing that this is an issue if allele.location.sequence is not defined. A SequenceLocation.sequence is listed as an optional field. @ahwagner can you explain why this is not a required field?

Copy link
Member

@ahwagner ahwagner Aug 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahwagner is the digest expected in every VRS object?

Yes, this is something that I think @theferrit32 was working on. It is available to every object, and our digest strategy in VRS 2 will compute this for every object, whether or not it is an identifiable object. I believe we were going to be using this for all objects in VRS-Python.

A SequenceLocation.sequence is listed as an optional field. @ahwagner can you explain why this is not a required field?

Yes, it is optional to use this attribute in JSON Schema, because when used in an Allele that is part of a Haplotype, the SequenceLocation.sequence can be omitted from VRS messages, as they (by definition) will match the sequence of the parent Haplotype object. However, it is required from the ga4ghDigest.keys for creating computed digests, because it is still a critical component of the value of a SequenceLocation. In those cases, it is expected that the system loading the VRS Haplotype object would refer to / copy over the Haplotype.sequence for the Haplotype.member[*].location.sequence properties.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahwagner I think there might be some confusion regarding the digest field. At the moment, digest is optional and does not get computed when a VRS Object is created. Should this field be populated each time a user creates a VRS Object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahwagner is suggesting to switch this condition to IRI. Some test cases: #seqrefs/myseq123 and HTTPS://w3id.org/NM_012345

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to dereference the #seqrefs/myseq123 outside this function because this function only receives the allele to be normalized, it won't have access to the full original document the allele came from, where #seqrefs/myseq123 can be resolved from.

We can create another function like ga4gh_inline (or something) that takes a JSON document, finds the GA4GH objects in it, and for any field whose value is an IRI that is a relative pointer, inline the object it points to in that field, if the document contains it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do that the type fields would then need to be required on the input JSON/dict. They are not currently required because the type fields are defined as literals and when you construct a particular class with some input it assumes it is that type and fills in the type field.

If we don't want this constraint, we could just traverse the input document and replace all field values that look like JSON pointers (not just those in fields that are defined as IRIs in the VRS models) with the objects they refer to. This would also let people use the JSON pointer thing in non-model fields. Like if someone has a statement and a custom field they added that isn't in the model, but they want to refer to the variant in the same document from there using a pointer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theferrit32 updated comments + tests with deferenced IRIs. Let me know if I need to make any more changes!

src/ga4gh/vrs/normalize.py Outdated Show resolved Hide resolved
src/ga4gh/vrs/normalize.py Outdated Show resolved Hide resolved
@@ -66,10 +123,21 @@ def test_normalize_allele(rest_dataproxy):
allele2 = normalize(allele1, rest_dataproxy)
assert allele1 == allele2

allele1_seq_ref = models.Allele(**allele_dict_sequence_reference)
allele2_seq_ref = normalize(allele1_seq_ref, rest_dataproxy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@ahwagner ahwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to go the other way on this–full SequenceReference object representation. Sorry for the extra work!

@@ -75,7 +75,7 @@
"type": "Allele",
"location": {
"type": "SequenceLocation",
"sequence": "refseq:NC_000023.11",
"sequence": "ga4gh:SQ.w0WZEvgJF0zf_P4yyTzjjv9oW1z61HHP",
Copy link
Member

@ahwagner ahwagner Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An IRI is a reference to another object. It can be of any form under the IETF specification. When we say the sequence slot is dereferenced, it means that instead of an IRI, we have a SequenceReference object. This is true for every property in VRS where we allow for an IRI or object.

I think it is fair for us to assume this property (and every property) is dereferenced / has full object representation for normalization. We SHOULD NOT assume that an IRI takes a specific form (e.g. a refseq or ga4gh identifier) as we do here. I also believe that IRIs that contain a colon before an IRI fragment identifier (#; again, as seen here) are not valid IRIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a regex pattern on the IRI class. I can make a new issue for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An IRI is a reference to another object. It can be of any form. When we say the sequence slot is dereferenced, it means that instead of an IRI, we have a SequenceReference object. This is true for every property in VRS where we allow for an IRI or object.

Okay, I will update the code + tests to always assume a SequenceReference

We SHOULD NOT assume that an IRI takes a specific form (e.g. a refseq or ga4gh identifier) as we do here.

This was just examples for tests. The SequenceProxy class will take the input (regardless of refseq/ga4gh/ensembl etc) to get the corresponding sequence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we say the sequence slot is dereferenced, it means that instead of an IRI, we have a SequenceReference object.

@ahwagner thanks for this clarification. The Translator class will need to be updated to work like this (doesn't need to be in this PR). Currently it sets the sequence id (ga4gh:SQ, not ga4gh:SQR ) as the location.sequence value

Copy link
Member

@ahwagner ahwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@korikuzma korikuzma merged commit 26a0b1d into 2-alpha Aug 30, 2023
0 of 8 checks passed
@korikuzma korikuzma deleted the update-allele-normalization branch August 30, 2023 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.0-alpha Issues related to VRS 2.0-alpha branch enhancement New feature or request priority:high High priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants