-
Notifications
You must be signed in to change notification settings - Fork 24
Genomic Subgroup Meeting Notes
a. Denys and team are ingesting the various variant databases along with the HGNC notations.
b. JAX and Cosmic have licensing requirements and have approved OHDSI to receive variant database for ingesting in the OHDSI vocabulary.
c. OncoKB? Waiting for Timur to obtain the files
b. Checking with Alex relax to provide a less restrictive filtering for the extract. Alex’s tiers for the criteria they apply is as below:
• Tier I evidence (large clinical trials and standard of care). The number of biomarkers meeting this standard is quite a bit smaller.
• Level C: Case Studies
• Level D: Pre-clinical
• Level E: Inferential
a. The OHDSI approach is to flag the canonical forms are identical if the HGVS representation agrees in C, G or P (coordinates, gene sequence and the Protein). Used the 3 fields Gene, SeqType and Variant to find the canonical variant (intersection).
b. For the 1st version of the Genomic vocabulary, the plan is to include from ClinVar only the variants that intersect with other variant databases (full ClinVar has too many variants).
c. Denys has provided the load script to Violetta for loading the vocabulary. Violetta will look and come back with questions.
a. Run query to validate the variant set from institutions (AJOU provided their variants).
b. NWU will take the vocabulary and do an assessment of how much of the variants are covered.
c. Tufts will have the data for assessment soon.
d. Other entities to validate against are MMRF (BAM files), TCGA and other publicly data.
e. For v0.1, the criteria for managing variants that dont overlap is as follows: If its in one of the public cancer variant collections then it’s in. If it is coming from one of our data partners and what we get is already filtered to relevance of cancer and we find out there is reasonable overlap then we add the ones that are missing. If we get a whole lot of things and no overlap so probably there was no filtering for clinical relevance in such cases we talk to you and make conclusions and decisions.
f. One of the studies we can do is we could create a large network of databases who have genomic data for their cancer patients and see where we stand against them in terms of variant overlap.
a. Peter will take the vcf file, and create the output based on vocabulary. We want to turn vcf into hgvs. we want to then decide if we want to throw away or add it in based on the overlap.
b. We want hgvs but we can take both. We want to turn vcf files into omop concepts. For the validation of the overlap, hgvs is easier.
c. Once Peter has 10 example lines of the vocabulary then he can start writing the app and can write a prototype in a week or two. The full hgvs will differ according to the transcripts we use, so the most robust way of doing this is to use vcf as an input. The way vcf gets around with this problem is it gives the chromosomal positions. You can always turn it into hgvs going from hgvs backwards is a little harder to do.
d. Peter recommended doing the lift-over accurately as the references change. The issue that the entire clinical world has to face is that we need to transition from HG19 to 37 to 38 which is coming soon that works easily for 95% of the variants. Peter stores a snippet consisting of 10 nucleotides to each site that allows you to check whether the lift-over was successful. if omop wants to be as obsessive compulsive then we do the same. CR says omop wants to stay away from this. This is a challenge with the genomic variant database.
a. Denys and team are ingesting the various variant databases along with the HGNC notations.
b. JAX, OncoKB and Cosmic have licensing requirements and have approved OHDSI to receive variant database for ingesting in the OHDSI vocabulary
a. Policy proposal : https://docs.google.com/document/d/1rsQx5yjArKWof50Vr6NMc3iWZc3vwvV6BYJH1aj6nN4/edit#
b. Proposal for Value Object Descriptors (petitioning to make part of the GA4GH VA standard) https://docs.google.com/document/d/1pV06Geh-Of3EMV_FWfVc_ZQjv9P_9SojnaPNsd-aP-U/edit
a. The idea is to use the 3 fields Gene, SeqType and Variant to find the canonical variant (intersection). We don’t use refSeq in this step as identification of the canonical variant.
b. In our approach to finding the canonical variants, we decide to use hgvs expression for several parts. We split the hgvs expression into gene, refseq, version, seqtype, variant. In doing so we noticed that one source system gives us 2 records for the same Chromosome position with the same variant but having a different refseq. We found that, for example Civic, can give us transcription type, gene type and protein type but not for all variants.
c. Here is another problems we noticed - we have different refseq even for the same position of variant. We took gene as identification of position and then combined it with seqtype and variant and then we tried to look for intersection of variants.
d. For the 1st version of the Genomic vocabulary, the plan is to include from ClinVar only the variants that intersect with other variant databases. We won’t use the full ClinVar because it will have too many variants.
e. For the Concept Table (Genes), we will use HGNC as the vocabulary for Genes. Field Concept_name will include in brackets the symbol of the Gene. As Concept_Code we will have HGNC ID to distinguish the gene, instead of entrez ID which is not as useful, they can have for the same entrez ID, 2 different variants.
f. For the field Concept_Name, Christian suggested having only the variant and coding DNA sequence, for example, ‘Homo sapiens epidermal growth factor receptor (EGFR), Deletion in position from 2127 to 2129’. (See below)
Clinvar Name -> NM_004870.4(MPDU1):c.537C>T (p.Asn179=) Concept_name -> Homo sapiens mannose-P-dolichol utilization defect 1, Substitution in position 537, Cytosine replaced by Thymine, Asparagine in position 179 replaced by Self
Clinvar Name -> NM_001134398.2(VAV2):c.2136-10_2136-9insGTGACCGCCGGGGCCGTGTGGCCCTCACGCA Concept_name -> Homo sapiens vav guanine nucleotide exchange factor 2, Insertion in position from 2136-10 to 2136-9 and insertion of GTGACCGCCGGGGCCGTGTGGCCCTCACGCA
Clinvar Name -> NM_003664.4(AP3B1):c.1040+9T>A Concept_name -> Homo sapiens adaptor related protein complex 3 subunit beta 1, Substitution in position 1040+9, Thymine replaced by Adenine
g. How to give better names to the field concept_class_id as they have a 20-character constraint. In the current design, Denys uses the values ‘Genomic var’, ‘Protein var’, ‘Transcript var’.
h. For the concept_relationship table, we are going to build a relationship from the ‘Gene Variant’ to the ‘Protein Variant’. From the ‘Protein Variant’ to the ‘Transcription variant’. From the ‘Transcription Variant’ to the Genomic Name’. We may not have some of the links since the ‘Transcription variant’ and ‘Genomic Name’ are missing in some cases.
i. In case of representation of Mitochondria, at what stage do we need to put the mitochondrial variant? Answer: It’s just like genomic. We would have a link from Gene to Mitochondria to CDNA to Protein. This is going be rare (there are only 10 such cases). We have a small gap, if the variant in the source vocab don’t have protein type of variant and only have transcript and genomic. If we miss the protein variant we also miss relation from transcript or genomic variant to gene.
j. We need to see once we do the canonization and the de-duplication, how many of them don’t have a protein attached to them and then we have to make a decision. There are services on the internet that do this for you or we could write a translator.
k. Denys will create a version 1 of the vocabulary and the relationships as we’ll take a look at what we get. Denys will modify the script based on the below changes:
1. Change concept names
2. For the relationship, use the canonical variant as the omop vocabulary and link all other vocabularies to it. So we have the relationship ‘maps_to’ from the vocabularies into the canonical OMOP ones. We don’t have any other relation between the source (civics, clinvar) other than “maps_to” even though we steal them to build the omop vocabulary we don’t actually put them in omop in version 1.
l. For Validation of the vocabulary
1. Collect what variants people have and see what’s going on.
2. For MSK, it would be easier if we have something from the vocabulary team, and go back to their team and ask them how much of this is covered.
3. We can go it against tcga and other publicly data. They have vcf.
4. Shilpa will check with AJOU
5. Tufts is setting up using foundation and getting pdf backs so getting data from them would be more than 1 step. So they will have data soon.
6. Mmrf data
a. The version of metaKB from Alex (received 7/7) is limited to those variants with tier I evidence (large clinical trials and standard of care) so we have only 145 records.
b. Alex can relax this restriction and include the evidence levels (1) Level C: Case Studies (2) Level D: Preclinical (3) Level E: Inferential. This will yield us 891 variants. If only Level C: Case Studies is included but Level D: preclinical and Level E: inferential variants are excluded, we go from 144 to 720 variants. Need also need hgvs notations in the MetaKB.
a. Alex presented the canonicalization policy to this group. Key discussion points (Alex’s approach):
i. Intent of the presentation is to clearly identify the scope of the problem we’re trying to address, challenges and provide the goals and timeline for the work.
ii. Presentation of Alex’s canonicalization policy scope and timeline proposal is as below: https://docs.google.com/presentation/d/1_axHdZdDMUuKEeL9Gkwu3i2YjimO-cdu-v24YX2Itpc/edit#slide=id.p
iii. The policy is intended to be a versioned policy for constraining representations of variants to a single facet.
iv. It is a procedure on how to select the appropriate variant form and semantics of the Genomic/Transcript/Choice
v. It is a locked version of the annotation resources to support variant representation
vi. It relies on established practices and resources for generating variant representations and focuses on interoperability
vii. The policy tries to target as many variation as possible with the intent of extending it as we run across to warrant additional context.
viii. Plan is for a draft to be sent out to OHDSI for review by August 7. Based on feedback, v0 of the policy will be released on August 14th to the OHDSI group for our testing. Based on feedback from testing, revision etc. V1 will be released in September.
x. The draft that will be sent out for review in the first week in August, gives OHDSI a chance to read through the policy, think about the implication of how we’re organizing and selecting canonical variants and provide feedback on whether it captures the use cases we have in mind. This will help Alex create the V0 policy which OHDSI can use to play with our datasets and execute use cases to see if it meets the use case.
xiii. One of the things the policy does is specify what is the reference genome we intent to use (GRCH38), and if you can represent your variant in GRCH38 then that’s the canonical form if it cannot be represented in 38 then use GRCH37, else GRCH36 or if it’s in another assembly then provide it in that other form.
b. Key discussion points (Christian’s approach):
i. Christian reviewed the OHDSI approach that is under development that uses the IDs to see whether we can build something like what Alex has in the absence of a proper policy.
ii. There are 2 options being explored by OHDSI, (1) do the combo or (2) keep them separate and connect them.
iii. If we keep them separate and connect them, Per Alex, linkages internally are fine but when you pass these around, these linked and grouped variation concepts as the variation this could be challenging.
iv. This is because if there is another variation that’s not a precise match then what does that mean. Per Christian, why not make the canonical forms separate into 3 diff levels, make a policy for all 3 and the 3 are connected to each other. The connection is called ‘transcribed’ and ‘translated’.
v. Per Alex, since the intent here is to be able to, between groups, coordinate a representative form that means a certain thing. But the notion of separating genomic transcript and protein forms as separate entities is reasonable even within the policies that the VICC has started defining. In the circumstances, if we have a specific transcript that we want to make an annotation against then we would want that to be the canonical form that we are attaching the annotation to, and then you can link that to an underlining genomic variant that links to that and when people come to research that genomic variant, they can have the transcript variant and the annotation associated with it.
vi. OHDSI needs to have concepts that represent a certain variant. We can’t say there is GRCH38 or GHRC39 and that’s your problem to deal with. We must solve this problem for them. We must make a declaration that this GRCH38 and this GRCH37 are the same thing.
vii. The one extreme is that (cobble nothing together) we can say that everything is different, different reference sequence makes it different, unless it 100% matches.
viii. We create a long list of links between them to bring them back together.
ix. The other extreme (cobble everything together) is that we create this GCP combo, Frankenstein instance and say everything there means the same thing.
x. A 3rd way is Alex’s policy where the declaration follows a canonical prescription and if you follow that policy, identical things will look identical.
xi. Christian's approach is to mark canonical forms are identical if the hgvs representation that you got somehow agrees in C, G or P (gene sequence) and the coordinates, and the gene. Alex suggested the use of gene symbol, as the anchor point on which you put a coordinate and a variant. Having a versions identifier as the anchor point would be a good starting point. When you aggregate up to the level of the gene symbols, ignoring the complexity of gene symbols, that is an issue.
xii. Peter recommended using HGNC identifiers. HGNC and college of American pathologists have started to use the hgnc ids as a way of anchoring gene symbols which are now slowly starting to change less often. Just having the gene symbol is risky because any gene has atleast 7 synonyms (gene symbols are re-used) and sometimes they are identical between genes so it’s better to use the both the approved symbol and the hgnc id.
xiii. Alex said, coordinate+gene symbol+hgnc ID, Also include identifier version to the refseq columns.
a. The version of metaKB from Alex (received 7/7) is limited to those variants with tier I evidence (large clinical trials and standard of care) so we have only 145 records.
b. Alex will relax this restriction and include the evidence levels (1) Level C: Case Studies (2) Level D: Preclinical (3) Level E: Inferential. This will yield us 891 variants. If he includes up to Level C: Case Studies but exclude Level D: preclinical and Level E: inferential variants, we go from 144 to 720 variants.
c. Need Alex's extract to include hgvs notations so we can develop a preliminary canonical structure.
d. Alex will present the version of the canonization policy to this group on 7/21. He has received some great feedback from the VICC community and will present the same to OHDSI.
2. OHDSI V1 approach: While we wait for Alex to come up with a canonization approach, the plan is to adopt our own version 1 of the same.
a. Download the variant vocabularies over selves [NCIt, CAP, CIViC, CGI, MolecularMatch and LOINC].
b. Put them into a database
c. Join everything using HGVS notations. We want to develop a simple consolidation approach of insertions, deletions and duplication based on gene symbol, sequence type (g, c, p), reference sequences, versions and locations.
3. With the exception of JAX, Cosmic and OncoKB, all the ones listed above can be openly downloaded. Will reach out to the ones that need a license to initiate a discussion to download their vocabulary.
-
For version 1, Alex needs to give us an extract of metakb in such a way that there is an entry for each variant and the different HGVS notations. It should also include the code from the sources. We can create the version 1 which will be a simple SNF repository. This is all we need at this point.
-
We can only combine those HGVS notations that they give us, since the notation requires a Reference Sequence, there is more than 1 Reference Sequence for the same Gene, if we don’t get all the reference sequences, they will look like different entries (redundant). This is okay for version 1.
-
For the first version, an incomplete perfect identity mapper for reference files is fine. We may get a few things split up into 2 but that wont be too many. All we need now is to get from the different sources the HGVS notation for the coordinates, what changed and the Reference ID for the Reference Sequence.
-
Also, we want to get data from different institutions and see if they can find the concepts in their data. In what form are the genomic data stores at different institutions? Chan’s has an input format which is a list of the variants. For evaluation of the variants, we will create the list of Concepts and put them in a spreadsheet and request institutions to find the variants in their data or look at their data and verify if the variant exists in our first version of the vocabulary.Since we don’t have the reference files, the question we need to answer is what constitutes a gene for which the coordinates are provided.
-
LOINC and NCIt -> If they have HGVS notations we are good. There are LOINC codes that say BRAF Gene variation/mutation. We don’t have these kinds of things and need to map manually. Focus on things that don’t have an HGVS notations which is fusions, translocation, copy numbers.
-
Andrew suggested that we start thinking about the separate issues around genomic as we’re starting to collect data, there are diff issues around the use of data. Are we going to sharing some strategy to the sites related to patient protection? As soon as we start bringing real data it might be worthwhile to create a strategy conceptualize and deal with data in an appropriate way so it can be handled at scale. We are better off for now because (1) we artificially limit the variants we can see. (2) we are talking somatic which is different from what your identity is. As soon as we role this outside Cancer, we need to think about a strategy around privacy. Children Hospital of Colorado has a genomic data warehouse, deidentified OMOP, they came up with a strategy for dealing with this data. We can have Tiffany come and present.
Key discussion points about metaKB extract received from Alex:
-
What we have from Alex is a is a civic extract.
-
What we still need is to arrive at a definitive canonical metakb disambiguate identity of a concept.
-
The urgent problem is the Canonical identification of a variants we are trying to get to. These variants have to be uniquely defined in the OMOP vocabulary. We need to define what a variant is.
-
What we need is to have a structure, that metaKB will provide, with a concept in the middle that represents a variant, so that everything from civic, cosmic etc will be mapped to this uniquely identifiable ID. Our idea for phase 1 is to not go too deep, there might be variations that in reality have no effect on clinical research. If we go too broad, we might miss the potential details. What are the features that make a variant the same etc.
-
Per Alex, Metakb provides a notion of clinical relevance that he can filter on and keep records that have evidence of tier 1 significance (strong clinical significance). Alex can start by filtering down the list to only those variants that have clinical significance, this will dramatically reduce this list to about 50% of total variant size.
-
Alex talked to ClinGen who is building out their own rep for ClinVar resource. Clinvar has their own mechanism for aggregating variant concepts under their own variant identifier. Alex is working with them to build out a specification for how a canonical form of a variant can be selected and once you’re able to make a selection the representation is simple and there will be a # of ways to represent. This process with ClinGen has started, meets weekly between VICC and ClinGen to draft a first version of the rules.
-
For Disambiguation -> Alex will put forth based on the canonical strategy (align with ClinGen’s strategy) and come up with one policy. This can be applied to metakb. Alex will bring it back to use and get our feedback.
Next Steps for metaKB:
-
Reduced set (clinically relevant) of variants from CIVIC. Additional variants coming from clinical resources as part of the canonical set of omop variant concepts
-
Come up with a canonical strategy on rules that will allow us to connect and make concepts cross searchable. Civic talks at the protein level, because the studies they are doing is getting people with same protein alteration irrespective of genomic changes that cause it, clinical data sets have observed genomic changes, we want to link those up, we want one to find the other. Alex has some tooling already that will make this straightforward to test for the simple variants, small SNFs and indels, and will require to build fewer concepts to support.
-
For validation/feedback of the canonical strategy, we will apply canonicalization strategies to multiple data sets (MMRF) and see how well things parse out and where it will fall apart. iv. Timeframe with regards to the first draft of the basic canonical strategy -> Goal is to get this done asap/weekly meeting/next 2-3 meetings. Before the end of the summer there will be a draft for review by various teams.
Next steps for OHDSI team:
-
We need to work on an Intermediate file that will act as a translation between metaKB and OMOP (staging).
-
With this we need to ensure that the mappings between metakb and OMOP IDs are stable across different versions. It would be helpful to sketch out what the sw would do from metakb to omop table. As soon as we have the file format, Peter can consume it for his application.
Next steps for Peter:
- Take Alex’s file as input, look in vcf and find variants in Alex’s file to output the concept id for the variants and other data from vcf and metaKB.
-
Status on open tasks
a. OHDSI cloud S3 bucket in the US-EAST (N. Virginia) region - Complete
b. AWS resources needed to process the data – Waiting on access to data
c. dbGaP access – Waiting on approval
d. Proposal for Steve (MMRF) and possibly funding request based on the proposal – Christian and Steve have worked this out and Steve has the information he needs to review the proposal for funding
e. metaKB file extract from Alex – Alex sent us the extract
g. App to convert VCF file to csv using CIVIC
Peter is written a Java program that parsers through a metaKB file, takes a vcf file as input looks for variants that are a match against the staging tables (Vocabulary) and prints out the data. The input is a vcf file, Software generates transcript interpretation file.
The staging table sitting outside the concept tables will be the input for Peter’s program to get to the concepts. The staging information needs to be centrally located so other tools can utilize it as well.
We would want a special version for ohdsi to distributed to people. The output is the concept id from ohdsi, person and date stamp.
MVP can be scoped out to include variants with less than 25 nucleotide. Software is able to cover around 95% of scenarios from the vcf files. Edge cases are not addressed. Some interesting mutations are difficult to express in vcf. To make this robust for a real study would require looking into edge cases. Focus on both coded/known and non-coded variants.
Test out the application against data, make sure we’re capturing sufficient % of variants https://github.com/pnrobinson/omopulator
Peter can provide a list of those metakb entries that are easy to obtain from the vcf. Discuss with Alex and the grp and make sure we are in the same page.
Either we or Alex gives us an extract of the necessary pieces and Peter will help us determine what the necessary pieces are.
Denys will get the Json file, what should we pick from the json file? Once we have picked this, we can have an extractor and hand it over to Peter to plug into his application. Need to sit with Peter to identify what data needs to be pulled out from the json. If Alex has an API then we can pull the data down using the API.
What we need to discuss with Alex is what do we need. We did some homework to see what we need. Are the 20 fields or so enough to parse the vcf in a reliable way and generate substantial information to perform clinical studies.
Google spreadsheet for a suggested format of the staging tables. Peter will put one up on docs/drive as a starting put to iterate on.
Andrew’s question -> Biggest challenge would be for people to get their data into the format. Is Galax can be used to template a workflow from a few widely used sources? Vast majority of data is pulled from epic or cerner, vast majority of genomic data does not make it to epic.
-
Progress/Update on set up a secure OHDSI AWS S3 bucket for MMRF CoMMPass data download from the public dbGAP domain.
(a) Lee has created the OHDSI cloud S3 bucket and called it 'ohdsi-commpass' in the US-EAST (N. Virginia) region.
(b) Lee from OHDSI can provide the AWS account needed to process the data
(c) Shilpa is working through the dbGaP approval process to get access to the MMRF vcf files
(d) Meanwhile, Christian is working on a proposal for Steve (MMRF) and possibly funding request based on the proposal.
-
Status on metaKB file extract from Alex
(a) While we wait for Alex to give us the extract of the metaKB, waiting for Timur to download the XML files.
-
App to convert VCF file to csv using CIVIC
(a) Peter is written a Java program that is like snpf, that parses through a civic file that lists the variants with the metadata (can also do the annotations which is simple to do), takes a vcf file as input, looks for variants that are identical and prints out the data. Peter will make it easy for other people to run w/o setting anything up. The idea behind the parser application is for it to become a universal ETL for sites that have a vcf or slightly modified to work at different institutions.
(b) Edge cases are not addressed. Some interesting mutations are difficult to express in vcf. For example, variable translocation from CML. The way translocation is represented in a vcf file varies because the community has not standardized it. To make this robust for a real study would require looking into edge cases. Does not make sense to worry about the 5% of the edge cases. For the 1st pass, if we’re talking about variants that are smaller than 25 nucleotides, it’s easy. If we’re talking about structural variants which might not be coming from the vcf because it’s not easy. These types of scenarios will be difficult. Its best to say, we’ll just get what’s in the vcf files and spike in new mutations and have that act as an integration test.
(c) Peter suggested that if this is interesting, we try to get someone to test out the system and basically once we’re sure we can capture sufficient % of variants then we can extend it and make it a more official part of OMOP
(d) What proportion of the difficult edge cases are important for the first pass? If we are talking about single nucleotide variants its easy, if we are talking about things like structural variants that we may want to have then might be 1-2 months.
(e) What input formats are we expecting? Vcf is going to be more or less standard at each center.
(f) For the 1st pass, we will focus on known variants, given an omop identifier, how much underlying key inform about each of the variants is represented in the omop tables.
(g) Test the application on various vcf files. Below is a link to the app https://github.com/pnrobinson/omopulator
(h) Some groups like the ‘phenopackets group’ might be interested in collaboration and possibly want to use OMOP.
(i) Once we have the metaKB extract from Alex Peter will work on developing the app. If we don’t know what the target it then its not worth spending time to do this.
(j) For non-coded variants, we should consider those as well besides the coded variants.
-
Progress/Update on set up a secure OHDSI AWS S3 bucket for MMRF CoMMPass data download from the public dbGAP domain.
a. Lee has created the OHDSI cloud S3 bucket and called it 'ohdsi-commpass' in the US-EAST (N. Virginia) region. b. Lee from OHDSI can provide the AWS account needed to process the data -> Lee asked if we have any technical requirements for the AWS computing resources we are requesting. c. Meanwhile, Christian is working on a proposal for Steve (MMRF) and possibly funding request based on the proposal.
-
Status on metaKB file extract from Alex
a. Pushed out by another 2 weeks b. In the meantime, Timur (Odysseus) will extract the XML files to load in our environment. We can use the 2 weeks to go through and do an assessment of the data based on the questions that have been brought up by Ron and Sarah. We can start doing an assessment of the size / frequency /impact of the issues. Also, evaluate the edge cases like if it’s not a simple SNP. Denys will report back next week.
-
Discussed Deny’s presentation of the mapping of the Genes, Variants, Gene-Variant relationship, Synonyms for Genes and Variants https://github.com/OHDSI/OncologyWG/blob/master/documentation/GenomicVocabulariesMVP1.pptx A few questions that came up in the meeting were as below:
i. In slide 8, what is the difference between the columns ‘synonyms’, ‘hgvs’ and ‘hgvs_p_suffix’? And why is hgvs_p_suffix not mapped to anything? 1. Per Christian, the hgvs_p_suffix field (any other representation we might get in the data) is only the suffix and is not a synonym of the whole aberration which is why we may not need to map it. 2. Not a full HGVS expression. The p. values are protein changes. So often different mutations can have the same protein change. So wont uniquely identify a variant. 3. ‘hgvs_p_suffix’ only represent protein changes without exact location and some information is already in 'hgvs' so mapping it I won't be useful. 4. Some questions that we want to consider in the future are: what do we constitute identical, is it aberration at DNA level, at RNA level, is it Protein or Higher Level? We may have a slightly hierarchical world where there would be DNA, Protein, Genes etc. in the hierarchy. We want to see the xml and see what’s happening.
ii. Also, slide 8 has around more than 20 or so values in the columns ‘synonyms’ and ‘hgvs’. Slide 10 (concept_synonym table) only has 1 value from the ‘synonym’ column mapped and 1 ‘hgvs’ record is mapped. Is this for a reason or only because you were trying to demonstrate by showing 1 example? -> Yes, it was done to show as an example.
iii. There may be sequence data for both a matched normal and tumor sample. How will this be handled? -> We are mapping only the variant currently. No sequence information. We need for each patient a collection of found variants. The fact that they have normal is irrelevant.
iv. Is the plan to Expand to novel variants -> Coding DNA sequence are not in Phase 1. The only thing represented in the vocabulary is the name of the variant. The goal is to have a measurement record, at the day the sample was taken. Everything gets thrown away. We have reference information and we have to think what we want to put in the ref information in order to parse vcf files we need more than just the names. There are already annotation pipelines that annotate vcf files and assign the known variants with the unique identifiers. As long as the unique identifiers, common names and syn are ingested, it should be straightforward and we would not need all the chromosome, underlying nucleoid change, details, genome builds. In the space of novel variant, then you will need all this information to the mapping and decide what to call the novel variants.
v. Ron Miller’s concerns If we extract all the values from the HGVS database and only rely on the gene name and the c. value (NUCLEOTIDE_CHANGE), we could run into some issues. For examples below is an example we came across within our review of the database:
FNAME SYMBOL GENEID VARIATIONID ALLELEID NUCLEOTIDE_EXPRESSSION NUCLEOTIDE_CHANGE PROTEIN_EXPRESSION hgvs4variation.txt SUN1 23353 461641 457968 NM_001367676.1:c.1796C>T c.1796C>T NP_001354605.1:p.Ser599Leu hgvs4variation.txt SUN1 23353 461642 57631 NM_001367675.1:c.1796C>T c.1796C>T NP_001354604.1:p.Thr599Met hgvs4variation.txt SUN1 23353 461643 457970 NM_001367647.1:c.1796C>T c.1796C>T NP_001354576.1:p.Ala599Val hgvs4variation.txt SUN1 23353 461643 457970 NM_001367668.1:c.1796C>T c.1796C>T NP_001354597.1:p.Ala599Val
All the variants are the same gene (SUN1) and all of the variants can have the same c. value (c.1796C>T) depending on the specific transcript that is reference. Identical variant on the base pair level, but different protein translates. If you rely only on that, it is possible you could have issues for these variants -> In many cases this is a diff genomic change as well. From the g. or p. perspective it may be a diff amino acid. There is an approach that Ron has implemented that might be useful to look into. It happens infrequently now but as more variants are identified and cataloged by HGVS it may increase. Other question for using HGVS, it’s a very gene focused, RNA focused, expression focused as compared to vcf file which is just a catalog of all variation as it sits on a chromosome. Many times a vcf files does not have a corresponding hgvs value and the reason for this wouldn’t have the associated analysis to map it back to how its impacted within the RNA molecule to give you your c. value or protein to give a p.value. So, everything that’s in dbsnp that catalogs variation is in hgvs vs there is much more variants in a dbsnp database than is in the hgvs file. Hgvs uses landmarks of genes to identify stuff but they have only 1% of the gnome. There are other mutations within the regions called intrageneric regions that you cannot use hgvs nomenclature to identify that because we don’t have protein, RNA molecule etc
-
Progress/Update on set up a secure OHDSI AWS S3 bucket for MMRF CoMMPass data download from the public dbGAP domain.
a. Lee will create the OHDSI cloud S3 bucket and call it 'ohdsi-commpass' and create it in the US-EAST (N. Virginia) region.
b. We need to provide the AWS account we will use to process the data so Lee can grant it read/write access to that S3 bucket. The AWS resources you use to process the data will need to be deployed in the same AWS region as the S3 bucket to avoid AWS intra-region transfer costs and to avoid AWS network data export costs. Uploading data to the S3 bucket from anywhere is free.
c. Who will provide the processing AWS resources? -> Christian working on a proposal for Steve which will include a request to fund for the AWS resources to perform the ETL job.
d. Link to download the data https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000748.v1.p1
e. How big are the files (Gigabytes/Terabytes)? Waiting on Shaadi to provide this information.
f. How long would you need the files to be stored? Waiting on Shaadi to provide this information.
g. How many files are involved? Waiting on Shaadi to provide this information.
h. Is there any way to automate the download if there are a large number of files? Don’t know yet but for the time being we don’t expect to download more than the 1 time for the POC.
-
Status on metaKB file inclusion from Alex -> in-progress a. Alex plans to have the metaKB extract to us by Friday of this week
-
A detailed presentation of the mapping can be found in the link below: (The data being mapped is limited to Genes, Variants, Gene-Variant relationship, Synonyms for Genes and Variants) https://github.com/OHDSI/OncologyWG/blob/master/documentation/GenomicVocabulariesMVP1.pptx A few questions that came up in the meeting were as below:
i. In slide 8, what is the difference between the columns ‘synonyms’, ‘hgvs’ and ‘hgvs_p_suffix’? And why is hgvs_p_suffix not mapped to anything? Per Christian, the hgvs_p_suffix field (any other representation we might get in the data) is only the suffix and is not a synonym of the whole aberration which is why we may not need to map it.
ii. Also, slide 8 has around more than 20 or so values in the columns ‘synonyms’ and ‘hgvs’. Slide 10 (concept_synonym table) only has 1 value from the ‘synonym’ column mapped and 1 ‘hgvs’ record is mapped. Is this for a reason or only because you were trying to demonstrate by showing 1 example?
-
Genomic project next steps:
a. Setup the AWS S3 bucket and processing environment
b. Download the CoMMPass data for processing
c. Odysseus team has the database capacity to store the metaKB extract from Alex in their environment
d. Ingest metaKB extract into the CDM vocabulary
e. Denys will start working on the preliminary code to map HGNC and metaKB to the OMOP Vocabulary
f. Peter will setup a demo to show the mapping from the VCF file to the vocabulary. They have a software tool call Jannovar (A Java Library for Exome Annotation) that is a fast way of going from VCF to transcript annotations hooked up to diff databases. Tool adds annotation field to a VCF file.
i. Peter’s application will query the OMOP Vocabulary to obtain the concept codes and low-level coordinates needed to inspect the VCF file ii. The application will process the VCF file and the codes and produce an output in the csv format. iii. Input to Peter’s application will be a VCF and a feature list with identifiers, output is Person ID, Concept ID for the variant. This can then be provided to the public internet to help them with their input format.
-
Andrew asked about the edge cases that Peter mentioned where the strategy was going to be challenged. We had determined in the last meeting that this was okay cause we wanted to get the MVP out and would consider development in phases. Are we clear on where we are drawing the boundaries of what we want to accomplish with this run? Christian commented that we can let Peter tell us through his development and testing process where the application breaks down. We can then determine the needs of the MVP and various phases in the process.
-
Below is what we have requested from Alex
i. An ID (we create)
ii. A source code (probably HGVS notation or a Variant Lexicon ID or a ClinGen ID, we don’t know yet)
iii. A human readable and computer searchable description
iv. Some category of what it is, both in terms of type of mutation as well as granularity
v. Links to the source of the information
-
Alex suggested we start with CIViC variants, as CIViC is one of the larger resources of the metaKB by variant count and they are fastest/easiest to work with. Alex is going to put together a table that he thinks makes sense given the above requirements, and we discuss on a call. Once we're happy with the table and strategy, he will expand it out to all the applicable metakb variants.
-
Possibly setup another follow-up meeting with Alex to discuss after our initial review of what we get from him
- Ingest HGNC as the gene ontology (this is based on the original plan). They provide human readable names which will be very useful. MetaKB gives us only abbreviation of genes and we need to have HGNC vocabulary as it will be useful to search the genes. We need to create relationship between Gene and Variant as they are present in metaKB and HGNC.
- If HGNC source gives us other names, then we can put it in the Concept_Synonym table
- Concept_ID in the Concept table will be created by us
- Concept_name is from the definition field in metaKB
- Concept_class_id is from the biomarker_type field in metaKB
- The concept_synonym_name field in concept_synonym table will have the value from the fields hgvs or synonyms from the metaKB files
- The field entrez_d (present in HGNC and metaKB) will be used to link the HGNC record from the concept table to the metaKB record from the concept table
- The entrez_id can be mapped to concept_code in the concept table but since HGNC has its own hgnc_id, it’s being mapped to concept_code. This needs further discussion.
- A detailed presentation of the mapping can be found in the link below: https://github.com/OHDSI/OncologyWG/blob/master/documentation/GenomicVocabulariesMVP1.pptx
- Denys will start working on the preliminary code to map HGNC and metaKB to the OMOP Vocabulary.
-
Peter has a software tool call Jannovar (A Java Library for Exome Annotation) that is a fast way of going from VCF to transcript annotations hooked up to diff databases. It adds annotation field to a VCF file. He will setup a demo going from the VCF file to the vocabulary once he receives the metaKB extract that Alex is going to provide.
-
The high-level next steps are as follows:
i. Alex to produce the extract of metaKB
ii. metaKB extract needs to be ingested into the CDM vocabulary
iii. Peter’s application will query the OMOP Vocabulary to obtain the concept codes and low-level coordinates needed to inspect the VCF file
iv. The application will process the VCF file and the codes and produce an output in the csv format.
v. Input to Peter’s application will be a VCF and a feature list with identifiers, output is Person ID, Concept ID for the variant. This can then be provided to the public internet to help them with their input format.
Review attributes Denys provided for our conceptual starting point before sending this information to Alex. Alex will pull out the relevant data from metaKB, standardize it to the specification that is required by OHDSI and create a recurring flat file that we can grab.
1) There are additional fields that we may not need. For example, we probably don’t need the drugs. Group will review
and provide feedback on google docs.
2) How can we identify the unique ID? Is Alex going to give this to us?
3) We may consider providing Alex with high level description of what we want that way he can assist with what fields
will be useful to us.
4) Request from Alex to provide a CSV that has Allele ID, Common Name, Synonyms. Allele is at DNA level then we have
the gene level.
5) Questions for Ales ->
i. If column AC (Links) is the same value for similar variants like in case of row 7 and 8.
ii. If there is one item/variable in this link that we can chose to anchor similar variants even though they are
from 2 different knowledge basis. Where is the common identifier? Is it the Allele ID?
iii. If 2 variants from 2 different knowledge-bases are the same, is it safe to assume that they have the same
Allele ID.
6) Need more examples from Denys especially for similar variants.
-
Answered use case -> What is the impact of translocation of chromosomes 4 and 14 on prognosis of Multiple Myeloma (disease progression)
-
Unanswered use case -> Does the presence of absence of Q121 translocation or insertion of 1P (Chromosome 14) impact the prognosis of high-risk patients.
-
Time to progression can be obtained from the Clinical data; Public domain has genomic and clinical data; 2 tables in clinical data gives time to progression and extracting translocation. Clinical and genomic files are maintained in 2 diff tables. We don’t create cuts of data that we maintain. We achieve this using cohorts. There is a table called cohort that preserves the data needed for the study.
-
Ask Lee Evans what the security level is on the OHDSI AWS cloud. Maybe we can do ETL with a subset of the data.
a. Plan is to download this from the public domain. Need to work on storage of the data.
b. Christian is working on resource assignment for OMOP conversion of clinical data
a. Answered use case -> What is the impact of translocation of chromosomes 4 and 14 on prognosis of Multiple Myeloma (disease progression)
b. Unanswered use case -> Does the presence of absence of Q121 translocation or insertion of 1P (Chromosome 14) impact the prognosis of high-risk patients.
a. Since the goal of the OHDSI Genomic efforts is to have a list of concepts, it was decided that metaKB is a good place to start. OHDSI Genomic WG to figure out what attributes are needed for our conceptual starting point. Based on the attributes, Alex will pull out the relevant data from metaKB, standardize it to the specification that is required by OHDSI and create a recurring flat file that we can grab. Coordinate on a repo to place this information.
b. If OHDSI gets new concepts, assuming they are in the cancer domain, OHDSI can provide these to CIVIC that has a curation interface. metaKB updates from CIVIC so it goes through the natural flow without manual intervention.
a. Dima walked through the attributes we will need for our conceptual starting point and provide this information to Alex. Alex will pull out the relevant data from metaKB, standardize it to the specification that is required by OHDSI and create a recurring flat file that we can grab.
b. Genomic Subgroup team will review the attributes for completeness for the MVP before sending it to Alex.
1) Update on meeting with MMRF to discuss next steps with their contribution to the OHDSI Genomic efforts
a. CoMMpass is in the public domain (NCBI dbGaP). URL is as below:
https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000748.v1.p1
The Multiple Myeloma Research Foundation (MMRF) CoMMpass (Relating Clinical Outcomes in MM to Personal Assessment
of Genetic Profile) trial (NCT01454297) is a longitudinal observation study of 1000 newly diagnosed myeloma
patients receiving various standard approved treatments that aim at collecting tissue samples, genetic information,
Quality of Life (QoL) and various disease and clinical outcomes over 10 years.
b. Arizona University’s local server is contracted to create CRO for MMRF. Phased project to move to this data from
the CRO to the public domain. Every 6 months the CRO releases new version of the data to the public domain. Data
can be downloaded from dbGap. Overall survival data points are present in the public domain and has all variables
for Genomic data adoption. The older version of dataset has the same variables and format we need for our POC
c. Shaadi has 2 use cases that she will be sharing with us today. Provide background and details in the next meeting
(slides to communicate the use cases).
i. Answered use case -> Impact of translocation of chromosomes 4 and 14 on prognosis of MM on disease
progression. Impact on prognosis.
ii.Unanswered use case -> Does the presence of absence of Q121 translocation or insertion of 1P (Chromosome 14)
impact the prognosis of high-risk patients.
d. Christian to discuss funding of OMOP conversion with Steve (MMRF)
e. JAX has a database (Clinical knowledge base -> ckb). There is a public version of this (link below)
https://ckbhome.jax.org
1) Meeting scheduled with MMRF on Wed, 4/30 to introduce Steve to the Genomic project and discuss next steps with their contribution to the OHDSI Genomic efforts.
a. What attributes are used to represent variant information?
i. Interval, Sequence location, Sequence state, Allele, Variation
b. How can we include these attributes into our vocabulary?
i. Discussed both MetaKB and ClinGen which include data from various sources in a harmonized manner.
ii. CAR has 910 million alleles and counting from the following sources: dbSNP, ExAC, genomAD, myvariant.info (hg19 and hg38), COSMIC, ClinVar
iii. Small # of ClinVar as it is already present in dbSNP
iv. CAR can be searched for an Allele using -> HGVS, CAid (Allele Registry ID), ClinVar RCV ID (accession number describing a specific ClinVar submission), dbSNP ID, ExAC ID (hg19), gnomAD ID (hg19), MyVariant ID (hg19), HGNC Gene Symbol, Reference Sequence and position (reference sequence, start, stop), PMID, Any identifier.
v. MetaKB (clinical relevance) is much smaller than ClinGen (Germline and somatic)
vi. MetaKB has 12K aggregate interpretation covering 3400 are unique variants in 450 genes. This is a good amount of variation we can work with. This is a sub-portion of ClinGen.
c. How did they harmonize their knowledge base (meta-knowledgebase)?
i. For ClinGen (CAR - ClinGen Allele Registry) Variants harvested from each knowledgebase were first evaluated for attributes specifying a precise genomic location, such as chromosome, start and end coordinates, variant allele, and an identifiable reference sequence.
ii. Variant names were queried against the Catalog of Somatic Mutations in Cancer (COSMIC)3v81 to infer these attributes in knowledgebases that did not provide them.
iii. Custom rules were written to transform some types of variants without clear coordinates (e.g. amplifications) into gene coordinates.
iv. All variants were then assembled into HGVS strings and submitted CAR (http://reg.clinicalgenome.org) to obtain distinct, cross-assembly allele identifiers.
v. MetaKB contains variant interpretation and the various attributes. MetaKB provides different identifier for different sources.
vi. CAR - > ClinGen Allele Registry (http://reg.clinicalgenome.org) is used to match variants across knowledgebases like ClinVar and CIVIC
vii. Below link has details on their process to aggregate data.
https://www.biorxiv.org/content/10.1101/366856v2.full.pdf
https://github.com/ohsu-comp-bio/g2p-aggregator
i. Since MetaKB is a subset of ClinGen, one approach could be that we can import the MetaKB instead of ClinGen (Germline and Somatic) into our vocabulary. Allele can be made our standard and the others are the source.
ii. We must look across the 2 knowledge bases and make sure MetaKB has a reasonable pool of variants and uses ClinGen for stable identifier then we can use MetaKB.
iii. Denys will put MetaKB into our database. Check out if Cosmic and CIVIC are covered and what is the overall coverage. How often is their data updated? Verify the variants we are interested are in MetaKB.
iv. MetaKB is open source so we can get the information to ingest. All we need is to get name and identifier Their data is aggregated from different sources. Per Denys, we might face a challenge with MetaKB as we need 1 identifier that links all sources. MetaKB provides different identifier for different sources. If we take their model, we may not be able to map it to our vocabulary.
https://github.com/OHDSI/OncologyWG/wiki/Mapping-of-Genomic-Data
-
For the next steps, the plan is to execute an existing use case from MMRF (Shaadi) that has already been run on source data. This will allow us to validate that the study can be accomplished using the new Genomic Model and Vocabulary.
-
Shaadi has 2 sets of data (1) Whole genomic (2) Gene Panel data sets which have EHR data.
-
Meeting has been setup with Steve from MMRF for 4/29 to discuss the scope of the OHDSI Genomic Model POC effort.
-
We need to understand how many data points we need to test the model. What are the requirement of the modeling team with regards to the format and mechanism to share the data?
-
One use case Shaadi brought up was adding marker – insertion/deletion of IGLs (IGL translocation is a high-risk marker). Marker detection for high risk patients. At a future point, Shaadi will find out from research team what questions they are asking but we can use an already published study to cross validate using OMOP.
-
Denys found that VICC’s approach is similar to the OHDSI teams approach with regards to handling Genomic data. They take their attributes and translate them to their own model. They are trying to build a data model.
-
With our approach we are not adding tables for location or allielID.
-
We need to figure out whether there is a way to take information they have in their tables and vocabularize it.
-
We need to decide what are the pieces of information we want. For example, do we need a concept for the Ref ID, Allelid etc.
-
Denys will provide a summary of the attributes they put in their tables. Make a recommendation if we need all of them and a proposal of how to incorporate it in the vocabulary.
-
As a 1st version, we may end up with our approach that includes non-human readable Concept names which are longer than we would like.
-
We also want to look at their meta knowledgebase and see how they harmonize the different knowledgebases.
Violetta shared statistics of ClinVar:
-
A pie chart was plotted that represented a break-down of % of variant types. SNP (single nucleotide polymorphisms) was sound to be of highest prevalence in the data and it’s important to focus on presenting this information accurately.
-
Concepts from ClinVar with empty f_name (HGVS expression) – Concepts related to the variant type ‘Deletion’ have the most % of empty f_name values followed by the variant type ‘Duplication’. Linking these concepts to other vocabularies is challenging without the HGVS expression. For the 1st version, recommendation we can ignore these.
-
Other problems
a) There are instances where records have the same chromosome location have different HGVS expression highlighting the fact that we need to identify variants using DNA and protein sequencing where possible.
b) 603 distinct HGVS expressions were duplicated (have type ‘copy number gain’ and ‘copy number loss’). This could be because these were added at different times by different submitters. For the 1st version, recommendation we can ignore these.
c) 40 distinct HGVS expressions were duplicated (have different ‘alleleid’ but same HGVS expression). For the 1st version, recommendation we can ignore these.
Kee Yuan Ngiam shared use cases as below:
-
A discovery/translation tool building bi-direction GWAS (Genome-Wide Association Studies) to pheWAS (phenome-wide association study) to query associated genomes and associated phenotype outcomes using a single model
-
Clinical pharma genomics as a discovery device. There is a need to harmonize variants that are identified so the idea is to use what the OHDSI community is building.
Additional information on GA4GH:
https://vr-spec.readthedocs.io/en/1.0/index.html
https://github.com/ga4gh/vr-spec
https://github.com/cancervariants/varlex
https://github.com/acoffman
https://www.biorxiv.org/content/10.1101/366856v2.full.pdf
https://github.com/ohsu-comp-bio/g2p-aggregator
https://vicc-metakb.readthedocs.io/en/latest/index.html
We could download the harmonized variables in a machine-readable format. They have taken variant databases and identified commonalities and normalization.
One suggestion is to just use our approach with the HGVS comparisons instead of adopting GA4GH approach. It seems like they have mappings only for a small # of variants.
Next steps with the GA4GH:
-
Need to figure out how GA4GH overlaps with what we have done in our proposal. Do a compare and contrast between adopting their approach and sticking with our approach.
-
Invite Allen if needed from VICC to discuss the GA4GH approach and questions.
Below is a link to the presentation that Ron Miller from IQVIA gave:
Seojeong presented the changes to the OMOP Model based on previous discussions. The 2 topics she covered are as below:
- How to distinguish the row of the result of the cancer panel test? -> Prior to answering this question, the design recommendation was to use an existing concept 44818720 which is ‘Lab Results’ to represent Genomic data. However, in order to represent cancer panel tests, we will need to create a new concept for expressing Genomic Sequencing. This new concept ID for Genomic Sequencing will go in the column Measurement_Type_Concept_ID in the Measurement table. Under Genomic Sequencing, there are 4 types of sequencing (‘Genomic Sequencing’ is the parent and the 4 types of sequencing are a part of the generic ‘Genomic Sequencing’) (1) Sanger Sequencing (2) NGS Gene Panel (3) Exome Sequencing (4) Genome Sequencing. Field Measurement_Concept_ID can be used for the concept ID for the different sequencing. By creating concepts for the different sequencing, we can distinguish the results of the cancer panels.
- How to indicate the absence of a variant? For the 3 variant types (1) SNP (2) Fusion (3) CNV, columns Value_As_Number, Value_As_Concept_ID (random number) and Value_Source_Value can be used. The column Value_Source_Value will be used for Gene and Variant combination and Value_As_Concept_ID for the corresponding Concept_ID. If the Variant Type is Fusion, then 2 Genes will appear in the Value_Source_Value column (Gene1&2 + Variation). If the variant type is CNV, we can use the column Value_As_Number to indicate how many times the Variant is copied. To indicate the absence of a Variant, we can use 0 in the Value_As_Number column and NULL in the column Value_As_Concept_ID. A value of 1 is used to represent presence of Variant Type SNP and Fusion. 0 means the Variant is absent in the result but the panel is targeted.
Key discussion points related to reporting panel information in the Measurement table:
- Need to understand how the concepts are stored in the vocabulary and what their relationships are to the panel.
- The challenge with representing panels will be that we then have to start vocabularizing panels, make sure the panels are uniquely identifiable and manageable, panels could also change overtime. The only problem is can we have a good comprehensive list of panels and keep them up to date.
- Identify the use cases with panels. Another use case to add to the backlog -> The effectiveness of a panel. The current use cases are identifying which variation do we observe in a patient
- Representing panel information is important as the confidence in the outcomes is also crucial.
- The new recommendation for the Measurement table: In the Measurement table, Measurement_Concept_ID should be what we have in Value_As_Concept_ID. Value_As_Concept_ID should be the result ‘yes’/’no’ etc. Measurement_Type_Concept_ID is whether we should have panel or specific panel.
- The PROS of the above recommendation (# 5) are that in the feature extraction package the concepts are divided using Measurement_Concept_ID. With the above recommendation, this will become possible as the previous recommendation had identical Measurement_Concept_ID.
- The CONS with the above recommendation (# 5) are that it may not work with the existing OHDSI feature extraction package. It’s possible to populate this in the feature extraction package but constrained due to the resources.
- In OHDSI, negative and/or missing values are not addressed. The Measurement table should also go the OHDSI way. There needs to be an alignment between OHDSI pipeline and data model. There is already an existing issue that the CDM architecture and packages are not aligned with each other. This is also and existing problem with CDM v6 which is not compliant with the feature extraction package. The library of OHDSI is not prepared for CDM v 6.
Next Steps:
- Based on all the above dialogue it was decided to only cover variants in Phase 1 of the Genomic Model Development. Phase 2 will include panels and an indicator for absence or presence of Variants.
- The new recommendation for the Measurement table: In the Measurement table, Measurement_Concept_ID should be what we have in Value_As_Concept_ID today. (Phase 1) Value_As_Concept_ID should be the result ‘yes’/’no’ etc. (Phase 2) ?? Measurement_Type_Concept_ID is panel or specific panel. (Phase 2) ??
Denys presented the work he's being doing based on feedback he received in previous meetings:
Key discussion points:
- Denys compared Civic and ClinVar and also studies the reliability of the HGVS linkages
- Civic is not only structured but functional as well. ClinVar is more specific for structured ones.
- Civic has 2500 variants presented. Out of which, 600 can be mapped to ClinVar using linkage given by Civic. Around ~400 we can mapped in a semiautomated manner. This means we have ~1000 CIViC variant that can have a strong relationship to ClinVar. Some of ClinVar is depreciated so the linkage is not reliable using the Civic relationship.
- There are a bunch of concepts like Fusions and Translocations that are not present well in ClinVar. This missing information are important in clinical areas.
- Findings about the ClinVar structure (HGVS ambiguous/duplicates etc.) -> In some cases, Clinvar has same chromosome location can have different HGVS Expression so to identify Variants we need to do them as DNA Sequence, Protein Sequence. Bunch of attributes need to be present.
** Genomic WG Meeting – 3/10/2020 **
-
Scott Campbell -> SNOMED CT author gave us a detailed overview of their approach using HGVS and SNOMED to represent Genomic data.
-
Scott in conjunction with CHC has created SNOMED CT observable entity concepts for Gene Sequence ID/Nucleotide Sequence ID for only the genes that are in the 50 gene target panel used in Nebraska. Results are in an HL7 method. This work is in production in their environment.
-
With regards to how they created the SNOMED CT observable entity -> they are measuring a property of a Nucleotide Sequence that exists in a named Gene, they call out the gene, the Gene is either located in a malignant tissue or a normal tissue. This is how they support detecting difference between Somatic and Germline Mutation. The fully specified name is the name identified by Hugo, they maintain a simple map between the Gene Concept ID and HGNC identifier.
-
The intention is it would be a hand off between a clinical term and the world of genomic maintained by HUGO etc. Simple map between a Gene and its Reference id in these commonly used Gene databases. Intent was to make a complex set of data ingestible by the EHR, it makes the Gene Sequence report msg no diff from a lab result message.
-
They have SNOMED CT mapped to named Genes and this info is repeated in the HGVS syntax and the reason for doing this is -> managing display and query times -> Genes they have created SNOMED CT Concepts for these Genes they know have clinical reference called out and representation in cancer protocols. While the more experimental concepts (Foundation Medicine etc) they can manage but preferred not to create SNOMED CT concept for those, so they use more general SNOMED concepts and let the HGVS carry the Gene name itself.
-
In terms of SNOMED it’s not a big deal to add concepts. They have created SNOMED concepts for about 1000 genes (clinical specificity). The largest panel that foundation med offers. How useful and manageable would this be is still being discussed. What would the clinical utility of this would be? Many of these foundation med genes are for research purpose only. FDA has not given foundation medicine any ability/clearance to use for clinical purposes.
-
To summarize, they created SNOMED CT concepts for the genes and not for the variants. They store the variants as nominal exp using HGVS syntax. There are a lot of variation in how the same variant gets expressed in HGVS, how do we make it computable and easy to identify across datasets? -> Scott identified this as one of the weakness. This is the direction in which folks are moving international. CAP and AMP have favored, Netherlands is doing that as well.
-
Mike asked how does it get from variant to HGVS expression-> Scott explained that they have developed the terminology on their own, Genomonc is their sign out system for their molecular pathologist and they have incorporated the terminology into their work processes. Their parse creates the HGVS syntax from the vcf file. Its an open source parser. If hgvs can continue to specify and standardize, it becomes easy to ingest this and run analysis.
-
Question -> What do you do if there are diff aberrations in a gene -> they store the HGVS exp in the database so the variant exp does not get thrown away. The variant does not get conceptualized in SNOMED with its ID. Keep the HGVS variant as a discrete data element string variable. Store the whole variant call file as an artifact elsewhere thru a link.
-
Question -> How many genes have you put into SNOMED -> 900 are in SNOMED. For the remaining, all they need to so is run a script to create 900 observable entities to go with that gene. For the 20K gene -> SNOMED international would choke if all would need to be upload. Scott can do it give it to us by running a script. The question then comes up is if its clinically actionable.
-
Foundation Med is not public -> Nebraska Lexicon’s vendor has a parsing software that does API work for Foundation Med, as they kick out their results, Nebraska can ingest using the API. Scott stated that if we use their services, they will give you the API at no cost. If MSK uses Foundation Medicine then they can get the API at no charge.
-
Scott mentioned that the SNOMED CT observable concept ids are included in the regular Lexicon that was shared with Mik.
-
VICC -> Scott would love to have insight into what VICC is doing.
** Genomic WG Meeting Notes – 3/3/2020 **
VICC meeting is on Thur 3/12. Plan is for Christian, Cong and Michael to attend the meeting, introduce OHDSI and begin the conversation around the Genomic database and linkages.
Denys walked us through ClinVar, Civic and HGVS representation on the Vocabulary. Link to the document can be found under
Some key discussion points are as below:
For readable Concept Name -> We can use Civic but it will not be full Concept Name. Civic contains Genomic locations like ClinVar. -> To link between Civic and ClinVar, fields like hgvs_expression can be used.
To-do -> Deltas between Civic and Clinvar. Civic contains information about translocation. Table Gene1 and then another table has Gene2. ClinVar contains somatic and germline mutations. Mostly germlines are in ClinVar. There are around 1000 variants only that can be linked using the HGVS expression linkage. But Civic contains functional type of variants and ClinVar contains structural type of variants. We can link Civic functional type and ClinVar structural type. The short name is not present in ClinVar.
Denys has only looked at the Substitution part, don’t know how it will be for addition, insertion etc. CR -> For the substitutions, are there Civics which are not in ClinVar, is that a problem of the notation or ClinVar? Denys -> It is a problem that Civic did not give whole information of location or good HGVS exp. It does that only for a small bunch of concepts. Other variants have empty fields for HGVS expression.
To-do -> (1) Get the other databases (OncoKB etc) in the Concept table. Denys to check links provided by Asieh and work with CR to obtain licenses where needed. (2) Check out the links using HGVS expressions and see whether the links are reliable and robust. For Substitutions, we know they are linked directly from Civic to ClinVar, take that link and see whether we can create the same link using the HGVS expression. If we decide to create a standard using the linkages, we want to make sure the HGVS exp can be relied on. Take the links and see where you cannot reproduce them.
If ClinVar becomes the standard the problem with it is, its weak on somatic and hence on Oncology variations. The alternate to using ClinVar is taking a whole lot of database inputs and create standards based on a generic desc of a variant. We want to know the notation is robust enough. In order to test that, take a set of these links and check if we can reproduce that based on the HGVS expression.
To-do -> Can we create a list of all these explorations that Denys is working on, so we know what needs to be done and the labor can be divvied up?
Seojoung walked us through the Measurement table for storing Genomic data. Key points captured as below:
-
For the columns ‘Measurement_Concept_ID’ and ‘Measurement_Source_value’, there are 2 concepts for sequencing. For targeted genomic sequence, additional information is added to the name. Domain is procedure which is probably wrong. We need more concepts for Measurement domain.
-
For measurement_type_concept_id column, we can use concept_id which can be lab_results which is most appropriate for expressing genomic sequencing data.
-
For storing variants with SNP, fusion and CNV, we can use the 3 columns ‘value_as_number’, ‘value_as_concept_id’ and ‘value_source_value’.
-
If variant type is fusion, 2 gene names will appear in the value_source_value. If variant is CNV, we use the ‘value_as_number’ column to express how many times the variant is copied. We can represent the variants if we have the variant concept_id.
-
Chan raised a concern about using the Measurement table related to loss of information related to the quality of the sequence. Information that may be important for registrars. Chan’s preference is a dedicated genomic CDM. Cong suggested that the quality is the responsibility of the institution that is ETLing their source data and not that of the model.
To-do -> Put variants in the Measurement_Concept_ID, use Measurement_Type_Concept_ID to distinguish them. Use Value_As_Number to indicate absence or presence.
- Rimma raised the requirements to establish linkages between condition and mutation and tissue and mutation. Are we planning to do any explicit linkages between tumor or is it just going to be build in the definition of the concept? The original model has reference to procedure, specimen etc. If we wanted to track, we could keep those references. Relevant information -> what specimen it came from, is the linkages between specimens, state of disease, methodology used to derive the mutation and the mutation itself. This is important for specific types of use cases like people who have colon cancer, specific variants derived from tumor from a particular pathology, particular location etc. Plan is to add this requirement to the backlog
Backlog -> Group decided for the first phase to develop the vocab for variants then we can extend to sequencing pipeline, relations between specimen’s information etc. The most urgent problem is to standardize the representation of what gets measured.
** Genomic WG meeting notes 2/25 **
- Discussed the recommendation proposed by Chan, Michael and Christian. Here are some key points:
a. The problem with using ClinVar is that it’s mostly germline with very little somatic mutation information.
b. We could use ClinVar somatic and others which specialize on Cancer. Importing a bunch of these knowledgebases and creating anonymous variants based on the HSGV notations
c. Adopt ClinVar, Ciciv, Cosmic etc. and create a union of all HGVSs, like VICC does. Or steal it from them. Let’s see what that looks like
d. VICC is already creating a summarized solution so we’ll steal it from them. Based on our assessment of VICC we can decide if we need to build the union ourselves or steal it from somewhere else?
e. Meeting with VICC on the 12th of March. We need from them a machine readable version of their specification, an aggregation of these variant definitions from other DBs, staple identifier, links to HGVS notations.
f. NU has an effort to map genomic data to HL7 (Luke Rothmen). Maybe something similar between FIHR and OMOP. They are working with GA4GH. Could be an opportunity to collaborate. Michael will try to connect with Luke for more information.
g. Next Steps:
i. understand these various Civic, Cosmic etc
ii. Figure out how reliable is the interoperability of the HGVS notations.
iii. Figure out how our representation is going to be based on the use cases.
iv. Meet with VICC
-
Some of the use cases to explore
a. Give me all patients with aberration in a pathway. Essentially comes down to listing genes in the pathway. Uniquely identify the aberration.
b. Give me all patients with aberrations that has been observed in XYZ cancer.
c. Asieh has some use cases that are geared towards outcome and predictive markers.
d. The goal is to identify uniquely an aberration unambiguously so that if a patient has the aberration, we can find these patients. What are these aberrations in the vocabulary? What are the ones that will show in the cancer data? Which DB has these?
-
Genomic-OMOP CDM
a. Seojeong walked us through the Minimized Genomic-CDM model. The existing model has 3 tables -> Genomic_Test, Target_Gene, Genomic (Combination of 2 tables Variant_Occurrence, Variant_Annotation that were previously created)
b. These genomic tables are connected to other tables of omop-cdm for storing clinical information.
c. Columns ‘gene1’ and ‘gene2’ will be replaced by ‘Gene1_Concept_ID’ and ‘Gene2_Concept_ID’
d. The model will be impacted/altered based on how the Vocabulary is modeled.
e. In keeping with the fundamentals of database normalization, tables will only have data that is relevant to the primary key everything else goes to the reference tables.
f. One of the ideas discussed is to keep HGVS nomenclature in the model. If there is no standard vocab for local variants, then we keep it with Concept ID 0.
g. Negative variants have been accounted for in the model. Both the presence and absence of the variant needs to be recorded. Value_as_Concept_ID will be “yes’ or ‘No’.
h. Some important questions to investigate:
i. The idea of using the existing Measurement table as an alternative to using the Minimized Genomic CDM needs to be explored if everything else is in the vocabulary. ii. Do we think the vocab can take every variant and every possible gene combination? iii. Copy_Number_Variation may be tricky to store in Measurement. Copy_Number_Variation is not unique for a patient and is an attribute of the variant. iv. The idea behind the Genomic Vocabulary is to create a very rich ontology with all the attributes which should be relationships. There are some variants existing in LOINC code. LOINC codes can have negative results and aberration. v. Should we use fact_relationship for the relationships between attributes. If its designated gene variant to procedure/specimen relationship type can we estimate if it’s many to many.
** Genomic WG meeting notes 2/18 **
Denys shared the vocabulary mapping documentation from HGNC and ClinVar to the OMOP standardized vocabulary
- The documentation on the vocabulary approach can be found in the link https://github.com/OHDSI/OncologyWG/wiki/ClinVar
- From the HGNC source field a. ‘location’ (cytogenic location of gene) gets mapped to the Concept_class_id fieldin the Concept table. b. ‘alias_symbol’ field from source are mapped to concept_synonym_name fields in the ‘Concept_synonym’ table. So, if there are multiple alias names then we can create multiple entries in the ‘Concept_synonym’ table. This will allow for researchers to search based on their needs -> Denys will submit a forum post to solicit guidance on whether to map all the synonyms. Denys will check if Athena shows the multiple synonyms.
- From the ClinVar source fields a. For all fields marked in Blue in the proposal (above link), we can figure out how to map them to OMOP standard vocabulary if it is needed by the Model. b. If the Blue fields are not needed, then there are examples where all other information (including ‘allelid’) is identical with the exception of the fields marked on Blue and the variable can be collapsed into 1 standard concept. c. If the Blue fields are needed, then we could still collapse the variable into 1 standard concept and record the information related to Genome Reference in the concept_relationship tables or in other tables separate from the vocabulary tables -> The modeling question here is these are the same variants with 2 diff genome reference so we really want 2 separate concepts or should we collapse them into one and store the reference somewhere. Do we have a use case that will help us with a solution? Asieh pointed out that the higher built (‘assembly’ -> GRCh38) might have more information in addition to the older build. Denys will try to model this in the next meeting to see how we can keep it in the vocabulary without any additional tables.
- Denys presented the logic behind creating a readable and searchable Concept Name as a combination of the Gene, Coding DNA Sequence and Protein Sequence. Including the Protein Sequence is important because it can link to different Genes and the Protein Sequence and Gene combination is unique. Denys will take a deeper look to refine the Concept Name and make it more usable/readable/searchable.
- Per Asieh’ s suggestion (slide # 19 -> Genomic proposal document), a relationship needs to be established between the Clinical Significance and the Site (Location).
Genomic WG Meeting Minutes 2/11
-
Denys raised the question about classification of variants like Pathogenic, Likely pathogenic, Uncertain significance, Likely benign, Benign. How will we create a relationship between variants and their classifications? Which domains etc.? Plan is to create the necessary vocab records, the necessary class concept records, the necessary relationship records -> Several questions need to be answered before we can create the records -> Do we have the links? Can we infer the links? are the links reliable?
-
Extension of CDM -> Review the Genomic extension model create by Chan. We need to go through and decide which of the vocabulary records will fit into which of these tables, which tables we don’t need anymore because vocabulary takes care of it. For example -> Gene field, vocabulary tells us for which variant what the gene is. Reference information needs to be dragged into the vocab tables and taken out of the patient data -> Next step (1) Have Chan go through the model and we decide what fields need to be dropped. (2) Figure out if the reference table strong is enough to replace the fields in the tables. If we correctly form the vocabulary then a lot of the fields in the Genomic CDM model will redundant (3) Like we did for NAACCR, write a proposal (identify the deltas, new fields, new tables etc.) on why and how we’re going to put in the vocabulary and CDM. This is required for the larger CDM group. For the CDM extension, there is a format that everyone adheres to.
-
Proposal for the creation of readable Concept Names -> HGVS to human readable format. For all the types of variants that exist (Substitutions, deletion, insertion, repeat etc.), Denys to create a proposal and share with the team. Suggestion is that since most of the researchers will query using the abbreviation of the Gene, even though a relation between Gene and Variant exist, the abbreviation should still be in the Concept Names. Details can go in the synonyms. Denys will create a complete proposal for the Concept Names taking different pieces and translating them into human readable and keyword searchable, highlighting the reasoning behind his approach. There is also a possibility that we don’t need an algorithm to generate the Concept Names that the Abbreviations are already present in the source vocabulary.
-
Publish a mini prototype version of the entire ClinVar in the OMOP Vocabulary along with the readable Concept Names. -> Start with a use case -> Michael will be able to bring basic use cases to the discussion. A very basic use case would be to show the patients that have a certain variant and along with is also their co-occurring variants.
-
Mapping of the short hand variant signatures from the Civic page -> Relations between Civic (Non-standard) and ClinVar (Standard). Denys to list all the various choices (minimal), if choices are not clear then list the alternatives. Show it to the group before the meeting so people can talk. On the ClinVar website it seems like the column ‘Protein Change’ is the variant shorthand. This in combination with the Gene Name can be the Concept Name.
-
Concept Code should map to Alleleid from ClinVar. HGVS goes in Concept Synonym. This needs to be documented as well. How did we arrive at this decision? What are the consequences of the choices we are making? https://www.ncbi.nlm.nih.gov/clinvar/?term=G12C
-
In the Clinical Trials WG they are at the point where they have reviewed bringing clinical trial data into OMOP -> One of the needs is for vocab to represent biomarkers used in trials that don’t have a current representation in the OHDSI vocabulary. Understanding the scope of what needs to be done to represent these biomarkers in the OHDSI vocabulary is of importance. What should the coordination with the Clinical Trials WG look like as this work progresses? Shilpa to make sure we coordinate with the Clinical Trial WG to setup a call -> agenda is to horizon scan about candidate vocabulary that might be relevant, come up with what the process might be to evaluate their suitability and the define a criterion that will be used in the evaluation of their suitability.
Below is the knowledge base that Cong shared. https://sbmi.uth.edu/dotAsset/27d8d4c5-01bb-44d1-af3d-6bebe7515480.txt (stores information of genetic variants as the inclusion/exclusion criteria in clinical trial)
** Meeting Notes – Genomic Subgroup 2/4 **
- Denys shared with the group an idea of what the concept table will look like with the Gene and variants mapped to Concepts from ClinVar and Civic.
- Concept Relationship has relations between Gene and Variants as well as ClinVar and Civic. Using ClinVar as the standard as it has more variants while Civic has more human readable format. ClinVar also gives us code that are related to SNOMED. If researcher wants to find what variant can cause which disease, we can give them the SNOMED conditions.
- ClinVar provides a link between ClinVar and HGVS nomenclature so you can search ClinVar using the HGVS and vice versa.
- The group discussed what should go in the Concept Code. HGVS expression ?? or Alleleid from ClinVar. Alleleid from ClinVar can be used as Concept Code in the future. Having the HGVS expression will be helpful.
- Denys has a proposal for the creation of readable Concept Names as below:
Clinvar Name -> NM_004870.4(MPDU1):c.537C>T (p.Asn179=) Concept_name -> Homo sapiens mannose-P-dolichol utilization defect 1, Substitution in position 537, Cytosine replaced by Thymine, Asparagine in position 179 replaced by Self
Clinvar Name -> NM_001134398.2(VAV2):c.2136-10_2136-9insGTGACCGCCGGGGCCGTGTGGCCCTCACGCA Concept_name -> Homo sapiens vav guanine nucleotide exchange factor 2, Insertion in position from 2136-10 to 2136-9 and insertion of GTGACCGCCGGGGCCGTGTGGCCCTCACGCA
Clinvar Name -> NM_003664.4(AP3B1):c.1040+9T>A Concept_name -> Homo sapiens adaptor related protein complex 3 subunit beta 1, Substitution in position 1040+9, Thymine replaced by Adenine
In NM_004870.4(MPDU1):c.537C>T (p.Asn179=) : MPDU1 -> the gene where mutation is 537 -> position of mutation C>T -> nucleotide substitution p.Asn179= -> 'Asn' protein that affected by mutation, '179' position of this protein and '=' with what protein is replaced by
- Chan recommended that we extend our domains to Genomics. Currently, Domain ID is Measurement. This group will proceed with creating the Domain ID -> ‘Genomics’. At some point in the future, we need to bring this up with the CDM/Vocabulary Team.
- Within ClinVar, there are also identifiers for sets of variants that might represent a genomic panel that could have a single HGVS expression. Andrew recommended that we understand the relevance of these sets. The question raised is, do we need sets to be represented instead of individual variants.
- The ClinVar data includes the HGVS expression, Alleleid, database name/identifier (dbSNP and dbVar). This database identifier tells us which of the 2 NCPI variation the variant is known to be represented in (RSID). Currently, the database identifier is not being captured in the OMOP Vocabulary. Need to better understand the role of the database identifier and if there is a need to distinguish between these in our data. Below publication shared by Asieh shows the coverage between ClinVar and the NCPI databases. https://ascopubs.org/doi/pdf/10.1200/PO.18.00371
- Michael recommended that before the data holders start mapping the source data, it would be helpful to publish a prototype version of the ClinVar within the OMOP Vocabulary (not official release to Athena) and publish it just like we did for NAACCR. That way we don’t each have to download ClinVar and try to figure out how to structure it.
- In the next meeting, Denys will provide the group information on the effort to publish the entire ClinVar in the OMOP Vocabulary along with the readable Concept Names.
- The format of the Data source will be dependent on sites. The strategy for NAACCR may apply for the most widely used panels and may speed up thing by creating a common ETL (foundation results etc) that way each site can use the generic ETL.
- Last week, Andrew had mentioned a tool called Galaxy that does data munging. Participants on the call shared their experience with Galaxy as being mostly used as an analytical tool rather than in a mapping capacity.
- The Civic page has short hand variant signatures which are more human readable compared to HGVS nomenclature. Where are these in the Vocabulary that can be used? Michael suggested that these mappings should be present in the OMOP Vocabulary itself. Jhaney confirmed that these are very valuable for research and are missing in ClinVar. Per Jhaney, there are several resources that produce this information. There is a paper called ‘Comparative analysis of public knowledge based for precision oncology’ that describes how these compare with each other, their coverage, how many variants they have etc. We can use these as Concept Names for our variants and link it with ClinVar. Denys to look further into this.
** Meeting Notes – Genomic Subgroup 1/28 **
Denys reviewed his findinds with ClinVar.
ClinVar values that are important to us are Gene, Disease, Type of Variant, Variant Location (Not possible to do this in a relational database), Pathogenic significance, Gene linked to a separate repo with Gene Location. Take what is in Concept Name and should be converted into human readable and searchable.
What is the rule we will use to create the Concept Name?? Gene will always be included in the Concept Name. In some cases, we don’t have the Gene in certain mutations. What do we do in those cases? Some of the ambiguity for simple ones are coming from the fact that they are different genome assembly. We could kick one and use the latest genome assembly which will reduce the number of notations. Looking for participants that have any experience with this. Vocabulary team will investigate this and make a proposal next time.
There is Variant Type in the ClinVar and there is a crosslink between ClinVar and Civic. ClinVar can be used as Standard and Civic can be used as a Classification system.
What’s important in Civic that people may want to use? What are the use cases that participants wish to bring to the discussion? Find out whether a patient will have a mutation detected in their blood sample. The mutation will be from publications/trial protocols. Only recruit patients with certain types of mutations. Go to clinicaltrials.gov and see how they refer to genomic information so that we can define it, so it becomes unambiguous and clean.
Are we interested in linking with lab values and imaging results? Example, using Phenotypes using HPO (Human Phenotypes Ontology – is beyond disease) terms to describe a patient. HPO terms already annotated with certain kind of mutations. We can use certain algorithms to lead us to find the mutations and order certain tests on those patients using certain gene panels. We can try linking Genotypes to Phenotypes using the concept relationship table. A lot of genetic providers are using HPO terms to write notes for their patient. – Adding this to the backlog -> Incorporation of Phenotypes ontologies to the OMOP vocab and linking to diseases other domains. Andrew will invite Peter to present in a future genomic meeting to talk about description of the Phenotype and Genotype mapping. Start recruiting data to perform feasibility analysis to figure out how to map the data to OMOP Genomic concepts. Cong is going to check with Karthik if he can get the data. Maybe we setup a working/hands-on session to just play around with the data and do some analysis. NW has data (JSON results file) that can be used. With the data, we can find the variants for genes, find the variants for disease, types of variants etc. The challenge will be in seeing if the format/results file whether there is enough information that is available can be mapped easily to the OMOP standard vocabulary. We want to make sure the problem has not been solved already. There is an open source tool called Galaxy which has data munging component to facility genetic data sets injects the data so it’s suitable for analysis.
Questions/Next Steps:
1.Continue investigating ClinVar.
2.What is the rule we will use to create the Concept Name?? Gene will always be included in the Concept Name. In some cases, we don’t have the Gene in certain mutations. What do we do in those cases?
3.What’s important in Civic that people may want to use? What are the use cases that participants wish to bring to the discussion? An example would be to find out whether a patient will have a mutation detected in their blood sample. The mutation will be from publications/trial protocols. Only recruit patients with certain types of mutations.
4.Are we interested in linking with lab values and imaging results? Adding this to the backlog -> Incorporation of Phenotypes ontologies to the OMOP vocab and linking to diseases other domains.
5.Andrew will invite Peter to present in a future genomic meeting to talk about description of the Phenotype and Genotype mapping.
6.Discuss and review the Genomic OMOP Model that was built by Chan’s team.
7.Start recruiting data to perform feasibility analysis to figure out how to map the data to OMOP Genomic concepts.
a.Cong is going to check with Karthik if he can get the data.
b.Maybe we setup a working/hands-on session to just play around with the data and do some analysis.
c.NW has data (JSON results file) that can be used.
8.With the data, we want to find the variants for genes, find the variants for disease, types of variants etc. The challenge will be in seeing if the format/results file whether there is enough information that is available can be mapped easily to the OMOP standard vocabulary.
9.Explore Galaxy
** Meeting notes : 1/15/2020
(1) Seojeong reviewed with the group her findings on the impact/information loss if were to adopt dbSNP (more research application) and ClinVar. The analysis was performed on their Lung Cancer data.
(2) The dbSNP data contains variant types such SNP, INDEL, Others. Copy Number Variation (CNV) and TRA (Fusion) are not used in dbSNP. dnSNP database is comprehensive, they have a lot of other information.
(3) The ClinVar database (classified by clinical significance) contains variable pathogeny like Benign, Likely Benign etc. They do not have classification by the type of variance (deletion, insertion etc) the reason for this is because this information is carried within the chromosome position so once you have the chromosome position you can generate classification by the type of variance directly using tools used in bio-informatics. Also, this information is redundant for ClinVar.
(4) dbSNP is a repository to host SNPs and ClinVar is repository to host variants submitted by physicians. There a possibility to communicate/report the 5% loss back to ClinVar and they would consider including it.
(5) Of the total counts of 46,297 variants that were studied by Seojeong, there is a 16% loss of data with dbSNP (16.33% did not have rsID) and 5% loss of data with ClinVar (annotation). One observation/reason for there to be a greater loss using dbSNP vs ClinVar could be due to the fact that the data that was used in the analysis contains target sequencing gene panel.
(6) The idea to link between dbSNP and ClinVar is based on the fact that if we have the position of the variants (chromosome start and end position) we will be able to map it back to any vocabulary system using position annotation methods. We will not need to use the cross-link of the IDs
(7) As a foundation vocabulary we can pick either ClinVar or dbSNP. We can then link ClinVar and dbSNP so we can easily run classification queries.
(8) Ensembl is another genome browser that includes dbSNP, ClinVar, cosmic, protein variants, gene information.
(9) Among other data sources that are included in Ensembl, there are 655M variants from dbSNP. 2 variants, 615K synonym, 648K phenotype. Complete list od data sources can be found in the link below: https://asia.ensembl.org/info/genome/variation/species/sources_documentation.html
(10) As next steps, the Vocabulary team needs to further explore these vocabulary systems to eventually create concept IDs. The vocabulary systems can be accessed using the link below.
(11) Another idea that needs to be explored further on the next call is creating concepts IDs only for diagnostic variants (Does this variant provide an explanation for disease in an affected individual?) or predictive variants (Is an individual who inherits this variant likely to develop disease or be responsive to certain drugs).
** Meeting with Cong : 12/18/2019
What is the purpose for genomic CDM, Genomics is already very standardized compared to clinical? Putting it in relational database might be challenging. Relational database is not designed for sequence query and sequence alignment. VCF files are used as they are better designed to index genome and query more efficiently. If you want to see patients with a variance in gnome position/chromosome 121023 and query all the patients in the database, relational database does not work. If we want to use Genomic CDM then we could limit to certain variants. This kind of information might be the scope for which we could use Genomic CDM. Otherwise, too much information will need to be stored in the CDM and we will make the query very difficult.
**Meeting Notes : 11/22/2019
Next Steps:
- Meera will investigate the possibility of converting data into HGVS rules to see what comes out of it.
- Python package that Chan shared is a good starting point to explore for the parsing to see what the output looks like
- Shaadi offered to take a look at the raw file with the AA change codes against the HGVS rules
- Chan is working on converting breast cancer TCGA data that he will review with this team some time in the future.
- Once we have figured out what the output from parsing the variant data looks like then we can see how it maps to standard if at all it maps to a standard.
- We will need guidance from Christian and others in the modelling group on whether using the HGVS nomenclature can be used instead of a vocabulary and if that will be an issue.
Key Points:
- The idea behind the working session was to figure out a way to parse the AA change code using the HGVS rules as guidelines.
- The group discovered that the HGVS rules can be consistently applied in spite of the interpretation. If we adopted the nomenclature as the variant vocabulary, we need to make sure everyone in the community is adhering to the same.
- Various HGVS parsers are available and we could converge on one parser that consistently applies to the rules that we can use as a recommendation.
- The idea is to have a generic parser that can parse foundation medicine results from one place to another.
- https://github.com/biocommons/hgvs parser can be explored for our purposes.
- Chan and his team wrote their own code to parse the TCGA data manually.
- Useful links shared to aid with parsing AA change patters using the HGVS rules: https://github.com/biocommons/hgvs https://drive.google.com/drive/folders/1TyJ4vWXAPbFt6gLudQDfzbU3HtvSw2Qt https://drive.google.com/file/d/1bv7Z6gaMLrlU3aKf30AIhVn-XCaRYhvq/view?usp=sharing https://hgvs.readthedocs.io/en/stable/
**Meeting Notes : 11/15/2019
Attendees: Meera Patel, Seojeong Shin, Michael Gurley, Seng Chan You, Asieh Golozar [email protected], [email protected], Gaentzsch, Ricarda [email protected]
Breast Cancer Use Case Kick-Off The ultimate goal is to standardize genomic data to enhance clinical decision-making in cancer. This requires getting the molecular signature of the somatic variance of the solid tumor. • What genomic data elements comprise the molecular signature of a tumor? • How can this molecular signature be represented in the CDM/G-CDM? Example Issues the Genomics WG was formed to tackle:
- Distinguishing between germline mutations and somatic mutations is one of the challenges in oncology genomics
- Tumor receptor status (ie HER2/ER/PR) has historically been reported as either negative or positive for each using immunohistochemistry, but now they are being reported with granularity at the variant level. How do we store and represent this information now that it isn't boolean? How can we make use of this granularity in clinical decision-making?
Components of "Molecular Signature" available for the breast cancer use case were reviewed today: • Gene o Our standardization task: creating naming/labeling conventions • Locus o Our standardization task: some loci are reported in exon ranges and it is not clear what it means. Ricarda and Adam suggested that this is due to splicing. Further investigation is required to aid standardization. • Nucleotide o Upon review today, it was concluded that this data is reported using a standard. • AA change o Loosely follows HGVS rules and one of the challenges is creating a set of rules that must be followed in parsing this data as part of a standard. Challenges in this need to be further investigated and demonstrated. • Fold change: o Unclear what this variable means. • Aberrations o Our standardization task: identify or develop standards for the ways aberrations are grouped and named
Action Items • For the next meeting, we will be assessing the coverage of the data elements we looked at to this in the TCGA vocabulary • followup on the locus exon range and what it could mean and how we can represent that in the data • followup on the challenges in parsing the AA change information into something clinically meaningful • Figure out what "fold change" represents and if there is clinical value in including it as part of the data analysis • Followup on genomic aberration nomenclature **Meeting Notes : 11/01/2019
Attendees: Meera Patel, Shaadi Mehr, Andrew Williams, Seojeong Shin, Christian Reich, Michael Gurley, Seng Chan You
Key Points:
- Shaadi shared the CoMMpass open-source Multiple Myeloma dataset (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000748.v1.p1) as an example of the type of genomics data that is available for use case development
- Shaadi will be sharing Multiple Myeloma use cases once it is approved by her organization's leadership
- Chan will reach out to TCGA to get approval to share the TCGA data for the group to explore in its use to standardize each institution's vocabularies. Michael mentioned that since most institutions use the same subset of vendors (Archer, Guardant, Tempus, Foundation Medicine, etc...) that there will be significant overlap in the format and other conventions that each of our institutional data can be found in.
- The following resources were shared for exploration: Biothings: https://biothings.io/explorer/, http://www.cbioportal.org/, GA4GH: https://www.ga4gh.org/community/
Actionable Steps for next meeting:
- Meera to prepare a brief overview of mCode and how it compares to our G-CDM model
- Chan to share the TCGA data, but this may need to be tabled for subsequent meetings if the group doesn't get approval to share it by next week.
- Use cases: Shaadi will share MMRF use cases once she receives approval, Meera will be sharing Breast Cancer use cases once she gets a greater understanding and what type of data is available at her institution and the level of granularity that is possible based on this data .
- Update Data Source wiki accordingly
**Meeting Notes : 10/25/2019
Attendees: Meera Patel [email protected]; Seng Chan You [email protected]; [email protected]; Michael J Gurley [email protected]; Christian Reich [email protected]
Key Points:
- The team decided to start with the 2 tasks in parallel in order to establish the scope of the next phase of the Genomic extension efforts (a) Issue 172 (b) Issue 174
- For issue 172 -> we have a use case from Meera's organization as well as Shaddi's organization. Issue 171 will be used to capture additional use cases from the OHDSI community.
- The use cases will help us determine what type of studies are of interest to the research community, what we need to focus our attention on and what kind of data we will need to execute the research study.
- As a starting point a basic use case we could start with is writing a query for sites that have genomic data, give us the frequency of cancer types for which genomic tests are ordered.
- Meera also suggested a typical use case from MSKCC, for breast cancer capture the genomic markers that are routinely taken.
- For issue 174 - > the goal is to do a descriptive analysis of the data we have so far (Northwestern, MSKCC etc) and the additional data we will obtain from the OHDSI community in order to determine what kind of data (survey the data) we have to run the studies.
- A repo is being created to capture some of the key data description attributes like Data/Population characterization, registry, site name, EHR, Trial etc have been identified as fields of interest. The repo will be published on the OHDSI website along with a link to Chan's paper to provide information to the participants interested in lending their data or use cases to the Genomic project.
- Issue 171 will be published out to the OHDSI community to gather additional use cases besides the ones mentioned in #4 and #5 above.
**Meeting Notes : 10/18/2019
Attendees: Meera Patel [email protected]; Seng Chan You [email protected]; Shaadi Mehr [email protected]; Asieh Golozar [email protected]; [email protected]; [email protected]; Michael J Gurley [email protected]; [email protected]; [email protected]; Denys [email protected]
Key Points:
- Chan walked through what has been accomplished so far with the Genomic extension of the CDM model (G-CDM).
- Goal of the meeting was to solicit participation to run use cases/research questions against the Genomic Common Data Model (G-CDM).
- Team has decided to explore the below tasks as next steps: • Create a list of genomic terms to come up with a mapping strategy and centralize existing standard genomic vocabularies • Articulate a minimum of 3 genomics use cases framed as high-level research questions • Fine tune the model based on the output of running additional use cases
- Shaadi has some research questions for which she needs approval from her organizations Chief Medical Office before she can bring them to the subgroup
Oncology Working Group Publications/Presentation
Data Model
- Cancer Models Representation
- EPISODE
- EPISODE_EVENT
- MEASUREMENT
- CONCEPT_NUMERIC
- Disease Episode Model
Vocabularies
OMOP Model
- Populating the OMOP Oncology Extension
- NAACCR Tumor Registry
- EHR and Claims