-
Notifications
You must be signed in to change notification settings - Fork 25
Add support to ask for more types of variants (more complex InDels and duplications) #20
Comments
Comments from the Google Doc for reference:
|
+1 to include that in version 0.4 |
+1 Heinz Stockinger wrote:
|
+1 On Tue, 14 Jun 2016 at 16:57 antbro [email protected] wrote:
|
+1 |
1 similar comment
+1 |
Obviously +1 on this. Additional elaboration: The CNV/CNA space (basic description of regional copy number imbalances vs. standard reference genomes) is a supremely suitable first extension of the current variant representation schema:
I am not overly concerned regarding specific privacy issues. Obviously, any additional datapoint in principle can provide a point of attack for re-identification attempts. However, the number of rare CNVs per sample is comparatively low; it is not trivial to query base-specific CNV boundaries (and those may freq. be approximate); somatic CNV/CNA (e.g., cancer) are currently not considered critical (see e.g. ICGC, where computed copy number is fully open). Anyway, the evaluation of possible re-identification issues is deferred to the implementer of the Beacon resource. The only open issues right now are IMHO specifics, e.g. how overlap queries & imprecise boundaries are defined & implemented, as well as how to query/return CN levels (i.e. granular, integer options beyond DUP/DEL). |
Hello Michael, |
@heinzstockinger Copy number variations can have different quantitative levels. Based on a 2n allele count, deletions can lead to 1n or 0n (homozygous). For duplications there is no upper limit; in cancer genomes, amplicons with hundreds of repeats of the same sequence can be found (sometimes including one or more complete CDRs; an example here is MYCN). There are reasons to query specific copy number levels, e.g. to find only homozygous deletions. The VCF file format allows to provide this information through Calling numerically correct copy numbers is difficult (especially in cancer w/ mixed cellularity etc.), and frequently data contains just DUP/DEL information instead of integer count values, with the possible addition of HOMODEL (i.e. 0n) and AMP (i.e. passing a arbitrary threshold, e.g ≧ 4). While there are clearly use cases for this kind of granularity, implementation adds some complexity which makes only sense when there are repositories actually providing this type of data & not only the theoretical urge to do so (e.g. while we work on this for arraymap.org, integer CN calls are not implemented yet). Conclusion:
|
In order to be consistent, we should have: INS[ATGC]+ |
+1 |
I don't understand the issue Heinz Stockinger wrote:
|
It's just a small change with respect to the proposal at the top of the page, i.e., we had: INS[ATGC]+ we now propose to update it to: INS[ATGC]+ i.e. only adding [0-9]* to DUP - so both, DUP and DEL have the possibility for integer values |
Aha - thanks! +1 T Heinz Stockinger wrote:
|
Hi All I'd like to reflect on some basics about 'what is a Beacon', and hence Currently Beacon asks about 'a single base allele'. Originally this meant But we now allow people to use Beacon to ask We decided in Hinxton last week that both (1) and (2) are acceptable Regarding queries that focus on records about the properties of human In parallel we'd also want a way to specify a query on a local haplotype SO I PROPOSE WE ENABLE QUERIES THAT SPECIFY CLEANLY AND SIMPLY
Beyond this, I'd like to see the Beacon query language able to ask but Plus a search option for specific sequence strings Then one could easily imagine combinations of the above, eg: 'SNP allele' located 'exact' at 'Chm:2/start_base:6543' and 'SNP allele' located 'exact' at 'Chm:2/start_base:6543' and 'TTAGGAGG' located 'begin_between' 'Chm:2/start_base:6543' and 'Copy number variant allele X' with 'Count_In_Genome > 4' 'Copy number variant allele X' with 'Count_In_haplotype > 2' where Thoughts...? Michael Baudis wrote:
|
@antbro Maybe you move this to a separate doc which can be edited/commented on? I think it would be best to have the specific use cases listed & discussed, which is tricky with this format here on Github. (overall your examples are in my line of thinking) |
For info linked here, a write-up of options for range queries, using a VCF:INFO approach (but pointing to alternative use of other attributes): https://docs.google.com/document/d/1uePLlLMl0FzxZxDrsF9IxsC2nYvZ84029fzUD1ULNWI/edit# |
I was in believe that in version 0.4 we will implement complex variants like the ones in the document https://docs.google.com/document/d/1uePLlLMl0FzxZxDrsF9IxsC2nYvZ84029fzUD1ULNWI |
I believe the decision made on today's call was to implement the first step as described in #20 (comment) and put off the changes proposed in the document above to a later time. |
There would just be the additional field to add: alternateBasesInfo. It's an optional parameter so it is completely backwards compatible. Details are in the following pull request: #65 |
To summarize related decisions made during the workshop yesterday:
We're going with #94 over #95 as the base for implementation. |
@mcupak I've added a comment to the DUP,DEL... PR https://github.com/ga4gh/beacon-team/blob/develop-proto-structural_and_ranges/src/main/proto/ga4gh/beacon.proto#L50 Actually, on re-reading VCF the reference value can stay "required", since values of Is this sufficiently verbose?
|
Closing since implemented in develop-proto branch. |
Proposal by Michael Baudis, please elaborate if insufficient.
For example:
INS[ATGC]+
DEL[0-9]*
DUP
Discussion on interpretation and use cases was already started in this document: https://docs.google.com/document/d/1PfSt0o0m59BRs92PtyDcP31fUl8QgMYSTiclHAXCG0s/edit?usp=sharing
The text was updated successfully, but these errors were encountered: