Skip to content
This repository has been archived by the owner on Jan 25, 2023. It is now read-only.

Add support to ask for more types of variants (more complex InDels and duplications) #20

Closed
mfiume opened this issue Apr 8, 2016 · 23 comments
Assignees
Milestone

Comments

@mfiume
Copy link
Contributor

mfiume commented Apr 8, 2016

Proposal by Michael Baudis, please elaborate if insufficient.

For example:
INS[ATGC]+
DEL[0-9]*
DUP

Discussion on interpretation and use cases was already started in this document: https://docs.google.com/document/d/1PfSt0o0m59BRs92PtyDcP31fUl8QgMYSTiclHAXCG0s/edit?usp=sharing

@mfiume mfiume added the proposal label Apr 8, 2016
@mfiume mfiume added this to the beacon-api-1.0 milestone Apr 8, 2016
@mcupak
Copy link
Contributor

mcupak commented Apr 14, 2016

Comments from the Google Doc for reference:

Miro Cupak (4:03 PM Mar 28): The description [of alternateBases] refers to the VCF spec. Is there ambiguity?
Michael Baudis (4:11 PM Mar 28): DEL (or <DEL>); DUP (or <DUP>) ...
Heinz Stockinger (8:02 AM Mar 29): ALT field looks ok. We might even consider to add the INFO field such as "ALT;INFO" (i.e use ";" to separate ALT and INFO). Then we can have examples such as: "<DUP>;SVTYPE=DUP;END=12686200;SVLEN=21100;CIPOS=-500,500;CIEND=-500,500"

@heinzstockinger
Copy link

+1 to include that in version 0.4

@antbro
Copy link

antbro commented Jun 14, 2016

+1

Heinz Stockinger wrote:

+1 to include that in version 0.4


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#20 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AI_EVCKdQWRzPP7H4O182LALcUM5H8PGks5qLsEogaJpZM4IDB-p.

@jrambla
Copy link
Collaborator

jrambla commented Jun 14, 2016

+1

On Tue, 14 Jun 2016 at 16:57 antbro [email protected] wrote:

+1

Heinz Stockinger wrote:

+1 to include that in version 0.4


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#20 (comment),

or mute the thread
<
https://github.com/notifications/unsubscribe/AI_EVCKdQWRzPP7H4O182LALcUM5H8PGks5qLsEogaJpZM4IDB-p
.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#20 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AHsiOlqtZEFzMel7QoN5TzSX44Fw4ZtCks5qLsFngaJpZM4IDB-p
.

@ddtxra
Copy link

ddtxra commented Jun 14, 2016

+1

1 similar comment
@sdelatorrep
Copy link
Contributor

+1

@mbaudis
Copy link
Member

mbaudis commented Jun 15, 2016

Obviously +1 on this. Additional elaboration:

The CNV/CNA space (basic description of regional copy number imbalances vs. standard reference genomes) is a supremely suitable first extension of the current variant representation schema:

  • high biologic relevance, both in genetics and cancer
  • simple structure for basic query/representation
  • represented in VCF
  • (presumed) low impact for re-identification

I am not overly concerned regarding specific privacy issues. Obviously, any additional datapoint in principle can provide a point of attack for re-identification attempts. However, the number of rare CNVs per sample is comparatively low; it is not trivial to query base-specific CNV boundaries (and those may freq. be approximate); somatic CNV/CNA (e.g., cancer) are currently not considered critical (see e.g. ICGC, where computed copy number is fully open).

Anyway, the evaluation of possible re-identification issues is deferred to the implementer of the Beacon resource.

The only open issues right now are IMHO specifics, e.g. how overlap queries & imprecise boundaries are defined & implemented, as well as how to query/return CN levels (i.e. granular, integer options beyond DUP/DEL).

@heinzstockinger
Copy link

Hello Michael,
Could you please provide an example of what you mean by "... how to query/return CN levels (i.e. granular, integer options beyond DUP/DEL)." Thanks, Heinz

@mbaudis
Copy link
Member

mbaudis commented Jun 17, 2016

@heinzstockinger Copy number variations can have different quantitative levels. Based on a 2n allele count, deletions can lead to 1n or 0n (homozygous). For duplications there is no upper limit; in cancer genomes, amplicons with hundreds of repeats of the same sequence can be found (sometimes including one or more complete CDRs; an example here is MYCN).

There are reasons to query specific copy number levels, e.g. to find only homozygous deletions.

The VCF file format allows to provide this information through FORMAT => CN ("Copy number genotype for imprecise events"); see pp. 13/14 of VCF 4.3.

Calling numerically correct copy numbers is difficult (especially in cancer w/ mixed cellularity etc.), and frequently data contains just DUP/DEL information instead of integer count values, with the possible addition of HOMODEL (i.e. 0n) and AMP (i.e. passing a arbitrary threshold, e.g ≧ 4).

While there are clearly use cases for this kind of granularity, implementation adds some complexity which makes only sense when there are repositories actually providing this type of data & not only the theoretical urge to do so (e.g. while we work on this for arraymap.org, integer CN calls are not implemented yet).

Conclusion:

  • At least for 0.4 implement qualitative DUP/DEL calls.
  • Keep in mind a future extensibility towards integer CN thresholding.

@mcupak mcupak removed the proposal label Jun 28, 2016
@mcupak mcupak modified the milestones: 0.4, 1.x Jun 28, 2016
@sduvaud
Copy link

sduvaud commented Jul 5, 2016

In order to be consistent, we should have:

INS[ATGC]+
DEL[0-9]*
DUP[0-9]*

@heinzstockinger
Copy link

+1

@antbro
Copy link

antbro commented Jul 5, 2016

I don't understand the issue
If I'm not the only one, perhaps chat it through in a Beacon TC ?
Tony

Heinz Stockinger wrote:

+1


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#20 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AI_EVAYvkq8YFxR2U54CKiLrn7ArH0RGks5qSiAzgaJpZM4IDB-p.

@heinzstockinger
Copy link

It's just a small change with respect to the proposal at the top of the page, i.e., we had:

INS[ATGC]+
DEL[0-9]*
DUP

we now propose to update it to:

INS[ATGC]+
DEL[0-9]*
DUP[0-9]*

i.e. only adding [0-9]* to DUP - so both, DUP and DEL have the possibility for integer values
(as it is already the case in the current v0.3 and earlier specifications).

@antbro
Copy link

antbro commented Jul 5, 2016

Aha - thanks!

+1

T

Heinz Stockinger wrote:

It's just a small change with respect to the proposal at the top of
the page, i.e., we had:

INS[ATGC]+
DEL[0-9]*
DUP

we now propose to update it to:

INS[ATGC]+
DEL[0-9]*
DUP[0-9]*

i.e. only adding [0-9]* to DUP - so both, DUP and DEL have the
possibility for integer values
(as it is already the case in the current v0.3 and earlier
specifications).


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#20 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AI_EVKHqydasshsiGIgqaiivo-RRUvuXks5qSixDgaJpZM4IDB-p.

@antbro
Copy link

antbro commented Jul 5, 2016

Hi All

I'd like to reflect on some basics about 'what is a Beacon', and hence
thereafter decide on what to turn it into...

Currently Beacon asks about 'a single base allele'. Originally this meant
(1) "any subject-specific record where the query allele is present"
(whether heterozygous or homozygous)

But we now allow people to use Beacon to ask
(2) "any database record referring to the query allele" (could be
population frequency data, protein structure or pathogenicity
consequences, animal model correlates, etc)

We decided in Hinxton last week that both (1) and (2) are acceptable

Regarding queries that focus on records about the properties of human
subjects (as opposed to the properties of variants) we have never yet
tried to enable queries to distinguish between standard genotypes
(homozygous or heterozygous presence of the query allele), but if we did
this could quickly expand into asking about zygosity generally (e.g.,
hemizygous, Y chm markers or X markers in females, polyploidy, etc)
...and that would open a way into a general solution for genome counts
of an allele (which could be fractional, ranges, >, < etc) ...which
then, if openned up to variants other than single base changes, provides
a way to handle copy number variation.

In parallel we'd also want a way to specify a query on a local haplotype
(i.e., one chm rather than one genome level)

SO I PROPOSE WE ENABLE QUERIES THAT SPECIFY CLEANLY AND SIMPLY

  • TYPE OF ALLELE
  • COUNT_IN_GENOME
  • COUNT_IN_HAPLOTYPE

Beyond this, I'd like to see the Beacon query language able to ask but
whether data/records exist that relate to a genome region defined by a
start and stop base (which could be one and the same), and how those
data/annotations match the target region
(exact|exceed|begin_between|end_between|begin_and_end_between|only_begin_between|only_end_between|begin_at_start|end_at_stop)

Plus a search option for specific sequence strings

Then one could easily imagine combinations of the above, eg:

'SNP allele' located 'exact' at 'Chm:2/start_base:6543' and
'Chm:2/stop_base:6543'

'SNP allele' located 'exact' at 'Chm:2/start_base:6543' and
'Chm:2/stop_base:6543' with 'Count_In_Genome > 0.5'

'TTAGGAGG' located 'begin_between' 'Chm:2/start_base:6543' and
'Chm:2/stop_base:6553'

'Copy number variant allele X' with 'Count_In_Genome > 4'

'Copy number variant allele X' with 'Count_In_haplotype > 2' where
haplotype at 'Chm:2/start_base:5,000' and 'Chm:2/stop_base:100,000'

Thoughts...?
Tony

Michael Baudis wrote:

@heinzstockinger https://github.com/heinzstockinger Copy number
variations can have different quantitative levels. Based on a 2n
allele count, deletions can lead to 1n or 0n (homozygous). For
duplications there is no upper limit; in canver genomes, amplicons
with hundreds of repeats of the same sequence can be found (sometimes
including one or more complete CDRs; an example here is MYCN).

There are reasons to query specific copy number levels, e.g. to find
only homozygous deletions.

The VCF file format allows to provide this information through
|FORMAT| => |CN| ("Copy number genotype for imprecise events"); see
pp. 13/14 of VCF 4.3.

Calling numerically correct copy numbers is difficult (especially in
cancer w/ mixed cellularity etc.), and frequently data contains just
DUP/DEL information instead of integer count values, with the possible
addition of HOMODEL (i.e. 0n) and AMP (i.e. passing a arbitrary
threshold, e.g ≧ 4).

While there are clearly use cases for this kind of granularity,
implementation adds some complexity which makes only sense when there
are repositories actually providing this type of data & not only the
theoretical urge to do so (e.g. while we work on this for
arraymap.org, integer CN calls are not implemented yet).

Conclusion:

* At least for 0.4 implement qualitative DUP/DEL calls.
* Keep in mind a future extensibility towards integer CN thresholding.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#20 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AI_EVDjBma1psFWBHbckkhpIsSSpoKOXks5qMjmigaJpZM4IDB-p.

@mbaudis
Copy link
Member

mbaudis commented Aug 23, 2016

@antbro Maybe you move this to a separate doc which can be edited/commented on? I think it would be best to have the specific use cases listed & discussed, which is tricky with this format here on Github.

(overall your examples are in my line of thinking)

@mbaudis
Copy link
Member

mbaudis commented Nov 29, 2016

For info linked here, a write-up of options for range queries, using a VCF:INFO approach (but pointing to alternative use of other attributes):

https://docs.google.com/document/d/1uePLlLMl0FzxZxDrsF9IxsC2nYvZ84029fzUD1ULNWI/edit#

@jrambla
Copy link
Collaborator

jrambla commented Nov 29, 2016

I was in believe that in version 0.4 we will implement complex variants like the ones in the document https://docs.google.com/document/d/1uePLlLMl0FzxZxDrsF9IxsC2nYvZ84029fzUD1ULNWI
Am I wrong?

@mcupak
Copy link
Contributor

mcupak commented Nov 29, 2016

I believe the decision made on today's call was to implement the first step as described in #20 (comment) and put off the changes proposed in the document above to a later time.

@heinzstockinger
Copy link

There would just be the additional field to add: alternateBasesInfo. It's an optional parameter so it is completely backwards compatible.

Details are in the following pull request: #65

@jrambla jrambla modified the milestones: future, 0.4 Jan 10, 2017
@mcupak mcupak removed this from the future milestone Feb 7, 2017
@mcupak
Copy link
Contributor

mcupak commented May 24, 2017

To summarize related decisions made during the workshop yesterday:

  • Instead of using arrays for start and end, we use individual explicitly named fields for both sides of the intervals.
  • CNV positions should be separate from SNP positions. I.e. we'll use start, start_min, start_max, end_min, end_max. The min/max nomenclature received +1s in a follow-up email thread.
  • We'll extend existing allele request instead of implementing a separate endpoint.

We're going with #94 over #95 as the base for implementation.

@mcupak mcupak added this to the 0.4 milestone May 24, 2017
@mcupak mcupak mentioned this issue May 24, 2017
@mbaudis
Copy link
Member

mbaudis commented Jun 20, 2017

@mcupak I've added a comment to the DUP,DEL... PR https://github.com/ga4gh/beacon-team/blob/develop-proto-structural_and_ranges/src/main/proto/ga4gh/beacon.proto#L50

Actually, on re-reading VCF the reference value can stay "required", since values of A,C,G,T,N are permitted (this is conceptually slightly different from a . as recommended for a missing value, but practically the same).

Is this sufficiently verbose?

  // Reference bases for this variant (starting from `start`).
  //
  // Accepted values: see the REF field in VCF 4.2 specification
  // (https://samtools.github.io/hts-specs/VCFv4.2.pdf).
  // When querying for variants without specific base alterations (e.g.
  // imprecise structural variants with separate variant_type as well as
  // start_min & end_min ... parameters), the use of a single "N" value is
  // recommended.
  string reference_bases = 8;

@mbaudis
Copy link
Member

mbaudis commented Sep 12, 2017

Closing since implemented in develop-proto branch.

@mbaudis mbaudis closed this as completed Sep 12, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants