Initial SV-VRS schema update #428

d-cameron · 2023-07-02T15:43:56Z

Initial SV-VRS Schema

High-level design summary

Breakend represent a change from reference to non-reference sequence at a given position
Location encode any uncertainty in the position
Breakpoint are composed on two (unordered) breakend
Breakends can have literal sequence after the break
SV-aware phasing is done by making Haplotype ordered and encoding the phasing in the allele ordering
Allele remains untouched to minimise impact on the existing schema
Zero-width Locations are allowed (to encode exact breakpoint positions)
End-of-chromosome breakends are encoded in Breakend.terminal instead of VCF's approach of using placeholder breakends past the end of the chromosome
CN untouched

VCF equivalences:

SV-VRS	VCF
Breakpoint	Two (redundant) ALT breakpoint records
Isolated Breakend	Single Breakend ALT allele notation
Terminal Breakend	Breakpoint past end of chromosome
Haplotype ordering	PSL/PSO fields
Breakend.sequence	Encoded in ALT^ for single breakends/(redundantly in) breakpoints
Breakpoint.insertion	Not supported

^ single breakend sequence must be at least one base otherwise it will be interpreted as the missing . allele.

Clarifications needed

Is Location.end inclusive or exclusive?

~~That is, is Location encoded as [start, end] or [start, end)?~~

Location is start-end inclusive

What is "non-overlapping Allele" in Haplotype defined with respect to?

Allowing duplication means that two copies of an Alelle can exist on the same molecule. For SV-VRS to work, the definition of 'non-overlapping' has to be with respect to the molecule, not w.r.t the reference. I don't think there's any problems with this subtle redefinition.

What is the relationship between the reference and a haplotype?

The presence of an Allele/Haplotype defines a deviation w.r.t a reference. What (if anything) can be said about the reference outside of the Allele positions? Take the following example where there are 3 SNPs in a gene:

A) A SNP chip identifies and imputes phasing for SNPs at locations A, B and C. It reports a Haplotype containing three Alleles.

B) A WGS sequencing run finds the entire gene matches the reference except for these SNPs. It reports a Haplotype containing three Alleles.

C) A WES sequencing run finds the exons of the gene match the reference except for these SNPs. It reports a Haplotype containing three Alleles.

How does VRS differentiate these three scenarios? Does the CN of the gene impact this?

How should Breakend.sequence be encoded?

reverse complement on DivergesBefore?
~~reverse complement when DerivedSequenceExpression.reverse_complement=true?~~ (Not relevant: reverse_complement is not part of Location)

Do we need a CILEN equivalent?

The breakpoint representation chosen is lossy w.r.t common variant calling types.
VCF has CILEN which encodes the range of expected lengths for simple SVs but has no equivalent for breakpoints.
This is actually a lossy representation as many variant callers
can constrain the actual location much more than anywhere in the
[(start1, end1), (start2, end2)] range.
for example, when the interval are due to homology, then the
interval widths must be the same and, for any given position
in the first breakend interval, there is only one possible position
in the second breakend interval that is possible.
Just specifying the two intervals independently as is done in this
model does not intrinsically encode this information (although for homology we can indeed define it as such since it's an unambiguous representation).

Similarly, even imprecise calls have possibilities that are less
likely than others. For example if a deletion break1 was at start1, then
break2 might be constrained so something like [start2, start2 + (end2 - start2) / 3]
because that would imply a longer deletion length that is plausible
(hence the VCF CILEN field).

Is there value in an equivalent field? Would anyone actually make any decisions differently if there was a narrower band of possible positions for an imprecise SV call?

d-cameron · 2023-07-02T15:46:04Z

Do not merge to main

Related issues: #365 #425

larrybabb · 2023-07-05T02:08:36Z

@d-cameron
re; Is Location.end inclusive or exclusive?
That is, is Location encoded as [start, end] or [start, end)?

VRS start/end positions represent the intervals between residues they do not represent the positions of the residues themselves. This eliminates the confusion related to inclusivity and exclusivity. Here's one of several discussions that goes over the rational for the "interval" method.

d-cameron · 2023-07-06T10:32:08Z

VRS start/end positions represent the intervals between residues they do not represent the positions of the residues themselves. This eliminates the confusion related to inclusivity and exclusivity.

Using interbase coordinates doesn't eliminate the ambiguity regarding inclusive or exclusive endpoints. The link you provided does though:

defined as the subsequence that begins at the base numbered "start" and goes to the base numbered "end", both inclusive.

d-cameron · 2023-07-06T11:26:24Z

Issue: Haplotype is defined on a single contig and SVs can traverse multiple contigs

Mrinal-Thomas-Epic · 2023-07-17T19:31:39Z

schema/vrs.yaml

+        # TODO: how should the Breakend.sequence be interpreted?
+        # Option 1: be1 <-> seq1 <-> seq2 <-> be2
+        # That is, concatenate the sequences of both of them.
+        # Pros: no possibility of data inconsistency
+        # Cons: need to 'allocate' the sequence to one of the breakends
+        #
+        # Option2: be1 <-> seq <-> be2
+        # Pros: more intuitive interpretation
+        # Cons: possibility of data inconsistency (when seq1 != seq2)


How would seq be defined in option 2?

Is this basically saying add a require that be1.sequence == be2.sequence? If so, would it make more sense to move this to the Breakpoint level?

From a discussion with Daniel, the reason for putting sequence at the breakend-level is that we also need to support inserted sequence on terminal breakends.

Another possibility is representing terminal breakends as a breakpoint with a single terminal breakend, and moving sequence up to the breakpoint.

ahwagner · 2023-08-03T10:46:28Z

VRS start/end positions represent the intervals between residues they do not represent the positions of the residues themselves. This eliminates the confusion related to inclusivity and exclusivity.

Using interbase coordinates doesn't eliminate the ambiguity regarding inclusive or exclusive endpoints. The link you provided does though:

defined as the subsequence that begins at the base numbered "start" and goes to the base numbered "end", both inclusive.

This is not correct. VRS intervals use inter-residue coordinates, or interval counting, to avoid ambiguity in inclusive/exclusive behavior. If you want to convert this to a residue-counting system, it is equivalent to .bed files, which are 0-based, inclusive beginning, exclusive end. The link Larry provided describes "both inclusive" with respect to the Alignment method, but VRS uses the Interval method as described in that post. Using an inter-residue system simplifies how we describe indel behavior. @bmilius-nmdp has a pretty good slide deck on the topic here.

ahwagner

Lots of interesting work here. I think we need to align on the notion of Sequence, Location, and Variation as used in VRS 1.x.

ahwagner · 2023-08-03T10:52:34Z

schema/vrs.yaml

-        uniqueItems: true
-        ordered: false
+        # By allowing repeats and defining an order on the
+        # members within a haplotype, we get a


trailing comment

greg-sharpe · 2023-09-07T13:45:45Z

Would it be clearer, or overly specific to use the term "linker sequence" instead of just "sequence" to refer to the literal sequence expression occurring between the breakends?

https://fusions.cancervariants.org/en/latest/information_model.html#linker-sequence

Co-authored-by: Daniel Cameron <[email protected]>

Initial SV-VRS schema update

b43ab40

d-cameron requested review from andreasprlic, ahwagner and larrybabb as code owners July 2, 2023 15:43

Mrinal-Thomas-Epic reviewed Jul 17, 2023

View reviewed changes

ahwagner reviewed Aug 3, 2023

View reviewed changes

Moved sequence to breakpoint; added event

28de88d

ahwagner changed the base branch from main to 2.0-alpha November 6, 2023 22:50

ahwagner changed the base branch from 2.0-alpha to main November 6, 2023 22:50

ahwagner added a commit that referenced this pull request Nov 6, 2023

Merge changes from #428 into 2.0 feature branch

00cf336

Co-authored-by: Daniel Cameron <[email protected]>

ahwagner mentioned this pull request Nov 27, 2023

2.0 alpha SV issues/clarifications #449

Closed

ahwagner closed this Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial SV-VRS schema update #428

Initial SV-VRS schema update #428

d-cameron commented Jul 2, 2023 •

edited

Loading

d-cameron commented Jul 2, 2023

larrybabb commented Jul 5, 2023

d-cameron commented Jul 6, 2023

d-cameron commented Jul 6, 2023

Mrinal-Thomas-Epic Jul 17, 2023 •

edited

Loading

Mrinal-Thomas-Epic Aug 2, 2023

ahwagner commented Aug 3, 2023

ahwagner left a comment

ahwagner Aug 3, 2023

greg-sharpe commented Sep 7, 2023

Initial SV-VRS schema update #428

Initial SV-VRS schema update #428

Conversation

d-cameron commented Jul 2, 2023 • edited Loading

High-level design summary

VCF equivalences:

Clarifications needed

Is Location.end inclusive or exclusive?

What is "non-overlapping Allele" in Haplotype defined with respect to?

What is the relationship between the reference and a haplotype?

How should Breakend.sequence be encoded?

Do we need a CILEN equivalent?

d-cameron commented Jul 2, 2023

Do not merge to main

larrybabb commented Jul 5, 2023

d-cameron commented Jul 6, 2023

d-cameron commented Jul 6, 2023

Mrinal-Thomas-Epic Jul 17, 2023 • edited Loading

Choose a reason for hiding this comment

Mrinal-Thomas-Epic Aug 2, 2023

Choose a reason for hiding this comment

ahwagner commented Aug 3, 2023

ahwagner left a comment

Choose a reason for hiding this comment

ahwagner Aug 3, 2023

Choose a reason for hiding this comment

greg-sharpe commented Sep 7, 2023

d-cameron commented Jul 2, 2023 •

edited

Loading

Mrinal-Thomas-Epic Jul 17, 2023 •

edited

Loading