-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial SV-VRS schema update #428
Conversation
@d-cameron VRS start/end positions represent the intervals between residues they do not represent the positions of the residues themselves. This eliminates the confusion related to inclusivity and exclusivity. Here's one of several discussions that goes over the rational for the "interval" method. |
Using interbase coordinates doesn't eliminate the ambiguity regarding inclusive or exclusive endpoints. The link you provided does though:
|
Issue: Haplotype is defined on a single contig and SVs can traverse multiple contigs |
schema/vrs.yaml
Outdated
# TODO: how should the Breakend.sequence be interpreted? | ||
# Option 1: be1 <-> seq1 <-> seq2 <-> be2 | ||
# That is, concatenate the sequences of both of them. | ||
# Pros: no possibility of data inconsistency | ||
# Cons: need to 'allocate' the sequence to one of the breakends | ||
# | ||
# Option2: be1 <-> seq <-> be2 | ||
# Pros: more intuitive interpretation | ||
# Cons: possibility of data inconsistency (when seq1 != seq2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would seq
be defined in option 2?
Is this basically saying add a require that be1.sequence == be2.sequence? If so, would it make more sense to move this to the Breakpoint level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a discussion with Daniel, the reason for putting sequence
at the breakend-level is that we also need to support inserted sequence on terminal breakends.
Another possibility is representing terminal breakends as a breakpoint with a single terminal breakend, and moving sequence
up to the breakpoint.
This is not correct. VRS intervals use inter-residue coordinates, or interval counting, to avoid ambiguity in inclusive/exclusive behavior. If you want to convert this to a residue-counting system, it is equivalent to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of interesting work here. I think we need to align on the notion of Sequence
, Location
, and Variation
as used in VRS 1.x
.
schema/vrs.yaml
Outdated
uniqueItems: true | ||
ordered: false | ||
# By allowing repeats and defining an order on the | ||
# members within a haplotype, we get a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trailing comment
Would it be clearer, or overly specific to use the term "linker sequence" instead of just "sequence" to refer to the literal sequence expression occurring between the breakends? https://fusions.cancervariants.org/en/latest/information_model.html#linker-sequence |
Co-authored-by: Daniel Cameron <[email protected]>
Initial SV-VRS Schema
High-level design summary
VCF equivalences:
^ single breakend sequence must be at least one base otherwise it will be interpreted as the missing
.
allele.Clarifications needed
Is Location.end inclusive or exclusive?That is, isLocation
encoded as [start, end] or [start, end)?Location is start-end inclusive
What is "non-overlapping Allele" in Haplotype defined with respect to?
Allowing duplication means that two copies of an Alelle can exist on the same molecule. For SV-VRS to work, the definition of 'non-overlapping' has to be with respect to the molecule, not w.r.t the reference. I don't think there's any problems with this subtle redefinition.
What is the relationship between the reference and a haplotype?
The presence of an Allele/Haplotype defines a deviation w.r.t a reference. What (if anything) can be said about the reference outside of the Allele positions? Take the following example where there are 3 SNPs in a gene:
A) A SNP chip identifies and imputes phasing for SNPs at locations A, B and C. It reports a Haplotype containing three Alleles.
B) A WGS sequencing run finds the entire gene matches the reference except for these SNPs. It reports a Haplotype containing three Alleles.
C) A WES sequencing run finds the exons of the gene match the reference except for these SNPs. It reports a Haplotype containing three Alleles.
How does VRS differentiate these three scenarios? Does the CN of the gene impact this?
How should Breakend.sequence be encoded?
reverse complement when DerivedSequenceExpression.reverse_complement=true?(Not relevant: reverse_complement is not part of Location)Do we need a CILEN equivalent?
The breakpoint representation chosen is lossy w.r.t common variant calling types.
VCF has CILEN which encodes the range of expected lengths for simple SVs but has no equivalent for breakpoints.
This is actually a lossy representation as many variant callers
can constrain the actual location much more than anywhere in the
[(start1, end1), (start2, end2)] range.
for example, when the interval are due to homology, then the
interval widths must be the same and, for any given position
in the first breakend interval, there is only one possible position
in the second breakend interval that is possible.
Just specifying the two intervals independently as is done in this
model does not intrinsically encode this information (although for homology we can indeed define it as such since it's an unambiguous representation).
Similarly, even imprecise calls have possibilities that are less
likely than others. For example if a deletion break1 was at start1, then
break2 might be constrained so something like [start2, start2 + (end2 - start2) / 3]
because that would imply a longer deletion length that is plausible
(hence the VCF CILEN field).
Is there value in an equivalent field? Would anyone actually make any decisions differently if there was a narrower band of possible positions for an imprecise SV call?