-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allele rle normalization + pin pydantic version #234
Conversation
@ahwagner @larrybabb I think this will cause issues with hgvs dup del mode in the variation-normalizer if someone chooses the |
@korikuzma could you make a new issue about the pydantic v2.2+ errors? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outstanding work @korikuzma. I have a few general questions for @larrybabb to address and a few minor suggestions for clarity or streamlining. RLE implementation looks sound overall.
For @larrybabb, one major theme raised in this PR is the question of "how do we handle Allele normalization when the Allele Location is specified by Ranges"? To me, these have always seemed to be a shorthand for "I did a targeted region assay and want to craft general statements about copy number in those regions and the potential broader impact they have". I know we allow people to create Alleles with Range-based Locations anyway, but... why? Kori's work here supports those cases and raises interesting questions, e.g. what do we do with definite range intervals? We don't want that discussion to gate this PR but I have created #237 to discuss.
@@ -35,27 +35,65 @@ | |||
} | |||
} | |||
|
|||
seq_loc = { | |||
|
|||
allele_dict2_normalized = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larrybabb note this example of ambiguous endpoint deletion as state.length=0
. This is replacing what might otherwise be a reported deletion with state.sequence=""
.
"type": "Number", | ||
"value": 2 | ||
} | ||
"sequence": "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larrybabb in this example, we have a defined range on either side, but the state.sequence = ""
is a deletion. We will want to pick one of these two paths for deletion representation across all int and range representations. I assume you prefer the RLE path (prior comment) but please confirm so @korikuzma can update accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is acknowledging that the entire concept of an Allele (and not a CopyNumber) with ambiguous endpoints is a little absurd. I wonder if we should even be supporting that now that we do CNC / CNX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's absurd because I added these tests over 2 years ago (when I knew almost nothing). If @ahwagner and @larrybabb could provide some good test examples, that would be great 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @korikuzma; I didn't mean to indicate that the test was absurd, the notion that @larrybabb and I have not disallowed this use case is absurd. The tests are a good reflection of what we have asked for.
src/ga4gh/vrs/normalize.py
Outdated
# Temporarily convert SequenceReference to IRI because it makes the code simpler. | ||
# This will be changed back to SequenceReference at the end of the method | ||
sequence_reference = None | ||
if isinstance(allele.location.sequence, models.SequenceReference): | ||
sequence_reference = allele.location.sequence | ||
allele.location.sequence = models.IRI(sequence_reference.refgetAccession) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not following this. An IRI isn't guaranteed to have the digest in it, so we should always assume a SequenceReference
, or have a mechanism for resolving the IRI to get a SequenceReference
. I might be missing something here, but why not simply remove this block and then revise line 107 to pull from the digest
field expected in every object?
# Temporarily convert SequenceReference to IRI because it makes the code simpler. | |
# This will be changed back to SequenceReference at the end of the method | |
sequence_reference = None | |
if isinstance(allele.location.sequence, models.SequenceReference): | |
sequence_reference = allele.location.sequence | |
allele.location.sequence = models.IRI(sequence_reference.refgetAccession) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was initially done by @theferrit32 . @ahwagner is the digest expected in every VRS object? In the models.py, the digest
is set as an optional field for all VRS objects.
I'm trying to figure out why this was done, but I'm now realizing that this is an issue if allele.location.sequence
is not defined. A SequenceLocation.sequence
is listed as an optional field. @ahwagner can you explain why this is not a required field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahwagner is the digest expected in every VRS object?
Yes, this is something that I think @theferrit32 was working on. It is available to every object, and our digest strategy in VRS 2 will compute this for every object, whether or not it is an identifiable object. I believe we were going to be using this for all objects in VRS-Python.
A SequenceLocation.sequence is listed as an optional field. @ahwagner can you explain why this is not a required field?
Yes, it is optional to use this attribute in JSON Schema, because when used in an Allele
that is part of a Haplotype
, the SequenceLocation.sequence
can be omitted from VRS messages, as they (by definition) will match the sequence
of the parent Haplotype
object. However, it is required from the ga4ghDigest.keys
for creating computed digests, because it is still a critical component of the value of a SequenceLocation
. In those cases, it is expected that the system loading the VRS Haplotype
object would refer to / copy over the Haplotype.sequence
for the Haplotype.member[*].location.sequence
properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahwagner I think there might be some confusion regarding the digest
field. At the moment, digest
is optional and does not get computed when a VRS Object is created. Should this field be populated each time a user creates a VRS Object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahwagner is suggesting to switch this condition to IRI. Some test cases: #seqrefs/myseq123
and HTTPS://w3id.org/NM_012345
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to dereference the #seqrefs/myseq123
outside this function because this function only receives the allele to be normalized, it won't have access to the full original document the allele came from, where #seqrefs/myseq123
can be resolved from.
We can create another function like ga4gh_inline
(or something) that takes a JSON document, finds the GA4GH objects in it, and for any field whose value is an IRI that is a relative pointer, inline the object it points to in that field, if the document contains it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do that the type
fields would then need to be required on the input JSON/dict. They are not currently required because the type fields are defined as literals and when you construct a particular class with some input it assumes it is that type and fills in the type field.
If we don't want this constraint, we could just traverse the input document and replace all field values that look like JSON pointers (not just those in fields that are defined as IRIs in the VRS models) with the objects they refer to. This would also let people use the JSON pointer thing in non-model fields. Like if someone has a statement and a custom field they added that isn't in the model, but they want to refer to the variant in the same document from there using a pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@theferrit32 updated comments + tests with deferenced IRIs. Let me know if I need to make any more changes!
Co-authored-by: Alex H. Wagner, PhD <[email protected]>
Co-authored-by: Alex H. Wagner, PhD <[email protected]>
Co-authored-by: Alex H. Wagner, PhD <[email protected]>
tests/test_vrs_normalize.py
Outdated
@@ -66,10 +123,21 @@ def test_normalize_allele(rest_dataproxy): | |||
allele2 = normalize(allele1, rest_dataproxy) | |||
assert allele1 == allele2 | |||
|
|||
allele1_seq_ref = models.Allele(**allele_dict_sequence_reference) | |||
allele2_seq_ref = normalize(allele1_seq_ref, rest_dataproxy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to go the other way on this–full SequenceReference object representation. Sorry for the extra work!
tests/test_vrs_normalize.py
Outdated
@@ -75,7 +75,7 @@ | |||
"type": "Allele", | |||
"location": { | |||
"type": "SequenceLocation", | |||
"sequence": "refseq:NC_000023.11", | |||
"sequence": "ga4gh:SQ.w0WZEvgJF0zf_P4yyTzjjv9oW1z61HHP", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An IRI is a reference to another object. It can be of any form under the IETF specification. When we say the sequence
slot is dereferenced, it means that instead of an IRI, we have a SequenceReference
object. This is true for every property in VRS where we allow for an IRI or object.
I think it is fair for us to assume this property (and every property) is dereferenced / has full object representation for normalization. We SHOULD NOT assume that an IRI takes a specific form (e.g. a refseq or ga4gh identifier) as we do here. I also believe that IRIs that contain a colon before an IRI fragment identifier (#
; again, as seen here) are not valid IRIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a regex pattern on the IRI class. I can make a new issue for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An IRI is a reference to another object. It can be of any form. When we say the sequence slot is dereferenced, it means that instead of an IRI, we have a SequenceReference object. This is true for every property in VRS where we allow for an IRI or object.
Okay, I will update the code + tests to always assume a SequenceReference
We SHOULD NOT assume that an IRI takes a specific form (e.g. a refseq or ga4gh identifier) as we do here.
This was just examples for tests. The SequenceProxy
class will take the input (regardless of refseq/ga4gh/ensembl etc) to get the corresponding sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we say the sequence slot is dereferenced, it means that instead of an IRI, we have a SequenceReference object.
@ahwagner thanks for this clarification. The Translator class will need to be updated to work like this (doesn't need to be in this PR). Currently it sets the sequence id (ga4gh:SQ
, not ga4gh:SQR
) as the location.sequence
value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
@theferrit32 @toneillbroad I don't think RLE normalization was fully added, so this PR will close #204 . The normalize vrs test now passes and have different error messages for the translator ones that failed. The translator tests that previously passed, still pass with these changes.
@ahwagner @larrybabb The 2.0-alpha docs has Allele with LSE and RSE (which is not in 2.0-alpha). So I assumed that the LSE section is used for both LSE and RLE. In addition to @theferrit32 or @toneillbroad review, I'd like one of you to review (my brain is on low battery).
Notes:
_normalize_allele
to handle LSE and RLE