Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial SV-VRS schema update #428

Closed
wants to merge 2 commits into from
Closed

Conversation

d-cameron
Copy link

@d-cameron d-cameron commented Jul 2, 2023

Initial SV-VRS Schema

High-level design summary

  • Breakend represent a change from reference to non-reference sequence at a given position
  • Location encode any uncertainty in the position
  • Breakpoint are composed on two (unordered) breakend
  • Breakends can have literal sequence after the break
  • SV-aware phasing is done by making Haplotype ordered and encoding the phasing in the allele ordering
  • Allele remains untouched to minimise impact on the existing schema
  • Zero-width Locations are allowed (to encode exact breakpoint positions)
  • End-of-chromosome breakends are encoded in Breakend.terminal instead of VCF's approach of using placeholder breakends past the end of the chromosome
  • CN untouched

VCF equivalences:

SV-VRS VCF
Breakpoint Two (redundant) ALT breakpoint records
Isolated Breakend Single Breakend ALT allele notation
Terminal Breakend Breakpoint past end of chromosome
Haplotype ordering PSL/PSO fields
Breakend.sequence Encoded in ALT^ for single breakends/(redundantly in) breakpoints
Breakpoint.insertion Not supported

^ single breakend sequence must be at least one base otherwise it will be interpreted as the missing . allele.

Clarifications needed

Is Location.end inclusive or exclusive?

That is, is Location encoded as [start, end] or [start, end)?

Location is start-end inclusive

What is "non-overlapping Allele" in Haplotype defined with respect to?

Allowing duplication means that two copies of an Alelle can exist on the same molecule. For SV-VRS to work, the definition of 'non-overlapping' has to be with respect to the molecule, not w.r.t the reference. I don't think there's any problems with this subtle redefinition.

What is the relationship between the reference and a haplotype?

The presence of an Allele/Haplotype defines a deviation w.r.t a reference. What (if anything) can be said about the reference outside of the Allele positions? Take the following example where there are 3 SNPs in a gene:

A) A SNP chip identifies and imputes phasing for SNPs at locations A, B and C. It reports a Haplotype containing three Alleles.

B) A WGS sequencing run finds the entire gene matches the reference except for these SNPs. It reports a Haplotype containing three Alleles.

C) A WES sequencing run finds the exons of the gene match the reference except for these SNPs. It reports a Haplotype containing three Alleles.

How does VRS differentiate these three scenarios? Does the CN of the gene impact this?

How should Breakend.sequence be encoded?

  • reverse complement on DivergesBefore?
  • reverse complement when DerivedSequenceExpression.reverse_complement=true? (Not relevant: reverse_complement is not part of Location)

Do we need a CILEN equivalent?

The breakpoint representation chosen is lossy w.r.t common variant calling types.
VCF has CILEN which encodes the range of expected lengths for simple SVs but has no equivalent for breakpoints.
This is actually a lossy representation as many variant callers
can constrain the actual location much more than anywhere in the
[(start1, end1), (start2, end2)] range.
for example, when the interval are due to homology, then the
interval widths must be the same and, for any given position
in the first breakend interval, there is only one possible position
in the second breakend interval that is possible.
Just specifying the two intervals independently as is done in this
model does not intrinsically encode this information (although for homology we can indeed define it as such since it's an unambiguous representation).

Similarly, even imprecise calls have possibilities that are less
likely than others. For example if a deletion break1 was at start1, then
break2 might be constrained so something like [start2, start2 + (end2 - start2) / 3]
because that would imply a longer deletion length that is plausible
(hence the VCF CILEN field).

Is there value in an equivalent field? Would anyone actually make any decisions differently if there was a narrower band of possible positions for an imprecise SV call?

@d-cameron
Copy link
Author

Do not merge to main

Related issues: #365 #425

@larrybabb
Copy link
Contributor

@d-cameron
re; Is Location.end inclusive or exclusive?
That is, is Location encoded as [start, end] or [start, end)?

VRS start/end positions represent the intervals between residues they do not represent the positions of the residues themselves. This eliminates the confusion related to inclusivity and exclusivity. Here's one of several discussions that goes over the rational for the "interval" method.

@d-cameron
Copy link
Author

VRS start/end positions represent the intervals between residues they do not represent the positions of the residues themselves. This eliminates the confusion related to inclusivity and exclusivity.

Using interbase coordinates doesn't eliminate the ambiguity regarding inclusive or exclusive endpoints. The link you provided does though:

defined as the subsequence that begins at the base numbered "start" and goes to the base numbered "end", both inclusive.

@d-cameron
Copy link
Author

Issue: Haplotype is defined on a single contig and SVs can traverse multiple contigs

schema/vrs.yaml Outdated
Comment on lines 795 to 803
# TODO: how should the Breakend.sequence be interpreted?
# Option 1: be1 <-> seq1 <-> seq2 <-> be2
# That is, concatenate the sequences of both of them.
# Pros: no possibility of data inconsistency
# Cons: need to 'allocate' the sequence to one of the breakends
#
# Option2: be1 <-> seq <-> be2
# Pros: more intuitive interpretation
# Cons: possibility of data inconsistency (when seq1 != seq2)
Copy link
Contributor

@Mrinal-Thomas-Epic Mrinal-Thomas-Epic Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would seq be defined in option 2?

Is this basically saying add a require that be1.sequence == be2.sequence? If so, would it make more sense to move this to the Breakpoint level?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a discussion with Daniel, the reason for putting sequence at the breakend-level is that we also need to support inserted sequence on terminal breakends.

Another possibility is representing terminal breakends as a breakpoint with a single terminal breakend, and moving sequence up to the breakpoint.

@ahwagner
Copy link
Member

ahwagner commented Aug 3, 2023

VRS start/end positions represent the intervals between residues they do not represent the positions of the residues themselves. This eliminates the confusion related to inclusivity and exclusivity.

Using interbase coordinates doesn't eliminate the ambiguity regarding inclusive or exclusive endpoints. The link you provided does though:

defined as the subsequence that begins at the base numbered "start" and goes to the base numbered "end", both inclusive.

This is not correct. VRS intervals use inter-residue coordinates, or interval counting, to avoid ambiguity in inclusive/exclusive behavior. If you want to convert this to a residue-counting system, it is equivalent to .bed files, which are 0-based, inclusive beginning, exclusive end. The link Larry provided describes "both inclusive" with respect to the Alignment method, but VRS uses the Interval method as described in that post. Using an inter-residue system simplifies how we describe indel behavior. @bmilius-nmdp has a pretty good slide deck on the topic here.

Copy link
Member

@ahwagner ahwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of interesting work here. I think we need to align on the notion of Sequence, Location, and Variation as used in VRS 1.x.

schema/vrs.yaml Outdated
uniqueItems: true
ordered: false
# By allowing repeats and defining an order on the
# members within a haplotype, we get a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trailing comment

@greg-sharpe
Copy link

Would it be clearer, or overly specific to use the term "linker sequence" instead of just "sequence" to refer to the literal sequence expression occurring between the breakends?

https://fusions.cancervariants.org/en/latest/information_model.html#linker-sequence

@ahwagner ahwagner changed the base branch from main to 2.0-alpha November 6, 2023 22:50
@ahwagner ahwagner changed the base branch from 2.0-alpha to main November 6, 2023 22:50
ahwagner added a commit that referenced this pull request Nov 6, 2023
@ahwagner ahwagner closed this Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants