-
Notifications
You must be signed in to change notification settings - Fork 12
Generalizability
OHDSI Study Protocol: OHDSI_Study_Protocol_v1.0
Collaborators: Anna Ostropolets and Patrick Ryan
This study aimed to evaluate and characterize the generalizability or coverage of the Observational Medical Outcomes Partnership (OMOP) vocabulary terms included in the OMOP2OBO
mapping set to OMOP
vocabulary terms utilized in the Observational Health Data Sciences and Informatics (OHDSI) Concept Prevalence study sites.
As described here, the Concept Prevalence study was designed to provide researchers with additional context regarding the frequency at which different clinical codes occur across the OHDSI research network:
We want to study the usage patterns of Concepts across different OMOP CDM instances. This in itself could be useful information to answer many questions, but we have a concrete reason: For any one medical entity, the granularity of codes captured in a data source can vary greatly. For example, Chronic Kidney Disorder stage II can be coded as ICD9 code 585.2 Chronic kidney disease, Stage II (mild); 585.9 Chronic kidney disease, unspecified or even as 586 Renal failure, unspecified. However, this information is key for any cohort definition. Currently, researchers have no way of knowing whether a certain concept with high granularity is even available for selection, or whether they have to use a generic concept in combination with some auxiliary information to define the cohort correctly. Each data source instance is a black box and knowledge about the distribution of the concepts is limited to the very instance researchers have access to. But OHDSI Network Studies are dependent on cohort definitions that work across the network.
The main research question for this portion of the evaluation was: how does the coverage of the OMOP
vocabulary terms present in the OMOP2OBO
mappings differ across the OHDSI Concept Prevalence study sites?
The specific aims of this study were as follows:
- Examine
OMOP2OBO
coverage across the Concept Prevalence sites by identifying:- OMOP vocabulary terms that exist in
OMOP2OBO
and one or more site - OMOP vocabulary terms only present in
OMOP2OBO
and none of the Concept Prevalence sites - OMOP vocabulary terms only present in one or more the site
- OMOP vocabulary terms that exist in
- Demonstrate the potential for [molecular] biological inference of
OMOP2OBO
by characterizing differences in ontology term enrichment across the Concept Prevalence sites when varying different aspects of data provenance (e.g., site type, clinical specialty, and site location).
In addition to the Concept Prevalence
study sites (n=22
), data was obtained from two independent academic medical centers. High-level descriptions of each site, including the total number of records and concepts are provided below.
Database | Type | Location | Record Count | Concept Count |
---|---|---|---|---|
Ajou University Database (Ajou) | EHR | Non-US | 30,238,709 | 6,055 |
Australian Electronic practice based research network (AU-ePBRN) | EHR | Non-US | 11,658,378 | 5,027 |
Columbia University Medical Center Database (CUMC) | EHR | US | 938,078,465 | 21,502 |
IBM MarketScan Commercial Database (CCAE) | CLAIMS | US | 12,649,562,658 | 31,570 |
IBM MarketScan Medicare Supplemental Database (MDCR) | CLAIMS | US | 2,770,787,154 | 25,121 |
IBM MarketScan Multi-State Medicaid Database (MDCD) | CLAIMS | US | 4,283,172,117 | 19,133 |
IQVIA Disease Analyzer (DA) France | EHR | Non-US | 39,632,134 | 3,423 |
IQVIA Disease Analyzer (DA) Germany | EHR | Non-US | 851,853,377 | 9,276 |
IQVIA Longitudinal Patient Data (LPD) Australia | EHR | Non-US | 56,940,803 | 5,833 |
IQVIA US Ambulatory EMR (AmbEMR) | EHR | US | 10,634,058,375 | 62,161 |
IQVIA US Hospital Charge Data Master (CDM) | EHR | US | 4,857,228,360 | 19,352 |
IQVIA US LRxDx Open Claims (Open Claims) | CLAIMS | US | 71,678,847,042 | 20,083 |
Japan Medical Data Center database (JMDC) | EHR | Non-US | 1,184,325,523 | 6,833 |
Korea National Health Insurance Service / National Sample Cohort (NHIS/NSC Korea) | CLAIMS | Non-US | 323,096,899 | 6,667 |
Medical Information Mart for Intensive Care III (MIMIC3) | EHR | US | 124,127,038 | 3,781 |
Optum De-Identified Clinformatics Data-Mart-Database— Socio-Economic Status (SES) | CLAIMS | US | 13,369,194,028 | 36,943 |
Optum De-Identified Clinformatics Data-Mart-Database—Date of Death (DOD) | CLAIMS | US | 9,716,879,363 | 34,853 |
Optum De-identified Electronic Health Record Dataset (PANTHER) | EHR | US | 27,894,204,112 | 59,777 |
Premier Healthcare Database (PREMIER) | CLAIMS | US | 16,794,698,039 | 18,903 |
Stanford Medicine Research Data Repository (STaRR) | EHR | US | 416,175,821 | 11,161 |
The Healthcare Cost and Utilization ProjectNationwide Inpatient Sample (HCUP) | EHR | US | 744,807,853 | 9,391 |
Tufts Medical Center Database (Tufts) | EHR | US | 66,863,985 | 21,118 |
UCHealth | EHR | US | 1,215,613,326 | 19,073 |
USC PScanner | EHR | US | 29,703,213 | 11,476 |
For each data site, standard concepts used at least once in practice were obtained from the Condition Occurrence (i.e. SNOMED-CT), Drug Exposure (i.e. ingredient-level; RxNorm), and Measurement (i.e. LOINC) tables. For all concepts, the total frequency was obtained and consistent with the Concept Prevalence
study, all concepts occurring fewer than 10 times were ignored and all remaining concepts occurring fewer than 100 times were assigned a count of 100.
SQL Query: OMOP2OBO_ConceptPrevalence_ErrorAnalysis.sql
An error analysis was performed to help provide insight into the Concept Prevalence
study concepts that were not covered by the OMOP2OBO
mapping sets. The OMOP2OBO
mapping set was created off of the OMOP common data model (CDM) v5.0
, which contained vocabulary concepts with a timestamp of June 26,2018
. Given how quickly the vocabulary changes, we hypothesized that some of the concepts that were were unable to cover could be brand new concepts and/or concepts which have been updated or replaced by pre-existing concepts.
To perform this analysis, the following SQL query was against a current version of the OMOP CDM:
SELECT
DISTINCT r.relationship_id,
c1.concept_id AS SOURCE_CONCEPT_ID,
c1.concept_name AS SOURCE_CONCEPT_LABEL,
c2.concept_id AS TARGET_CONCEPT_ID,
c2.concept_name AS TARGET_CONCEPT_LABEL,
FROM
sandbox-omop.oct_2020.concept_relationship r
JOIN sandbox-omop.oct_2020.concept c1 ON c1.concept_id = r.concept_id_1
JOIN sandbox-omop.oct_2020.concept c2 ON c2.concept_id = r.concept_id_2
WHERE
r.concept_id_1 IN (SELECT concept_id FROM sandbox-tc.CHCO_DeID_Oct2018.`OMOP2OBO`_Conditions_Concepts_Merged
UNION DISTINCT
SELECT ingredient_concept_id FROM sandbox-tc.CHCO_DeID_Oct2018.`OMOP2OBO`_Medications_Concepts_Merged
UNION DISTINCT
SELECT concept_id FROM sandbox-tc.CHCO_DeID_Oct2018.`OMOP2OBO`_Measurements_Concepts_Merged)
AND r.relationship_id IN ("Concept replaced by", "Maps to", "Concept same_as from", "Concept poss_eq from", "Concept was_a from", "Is a")
AND (r.valid_start_date > '2018-06-26' AND r.valid_start_date < '2020-10-17')
ORDER BY r.relationship_id;
The relationship_id
column contains different relationships that can be utilized to explain the relationship between OMOP concept-ids
. The relationship_ids
included in the query above are organized such that they allow us to identify two types of scenarios:
-
Newly Added Concepts: Concepts that did not exist in the version of the OMOP CDM used to create the
OMOP2OBO
mappings, but that do exist in the current CDM -
Updated Concepts: Concepts that existed in the version of the OMOP CDM used to create the
OMOP2OBO
mappings, but which have been updated and now exist under a newconcept_id
.
The table below organizes the OMOP CDM relationship_ids
by scenario.
Relationship_ID | Scenario Type |
---|---|
Newly Added Concepts | Maps to |
Newly Added Concepts |
Concept poss_eq from (synonyms) |
Newly Added Concepts |
Concept same_as from (synonyms) |
Newly Added Concepts |
Concept was_a from (concept type) |
Newly Added Concepts |
Is a (concept type) |
Replaced Concept | Concept replaced by |
We used this information to categorize uncovered concepts (i.e., concepts included in the Concept Prevalence
data sets, but missing from the OMOP2OBO
mapping set). Specifically, for each clinical domain we obtained three lists:
- Uncovered concepts in the error analysis data
- Uncovered concepts in the
OMOP2OBO
mapping data, but ineligible for mapping - Uncovered concepts that were truly unable to be accounted for by existing data sources
For lists 1 and 2, we aimed to explain the uncovered concepts by categorizing them according to an explanation for their missingness (i.e., concept present in newer OMOP
vocabulary or replaced concept). For all the lists, we also obtained prevalence information for each concept as the frequency of use within and across the Concept Prevalence
data sites, which was used as metric to measure the importance of each uncovered concept.
Results are presented below by clinical domain. As shown in Figure 1
, the OMOP vocabulary terms included in the OMOP2OBO
mapping set provided exceptional coverage, which differed both by Concept Prevalence
study site and clinical domain.
Figure presents the coverage of the
OMOP2OBO
mappings using Concept Prevalence Study data, where the distribution of the Overlap (i.e., OMOP concepts that exist inOMOP2OBO
only sets and one or more Concept Prevalence sites), Concept Prevalence only andOMOP2OBO
sets are shown on the left. On the right, the Error Analysis Concepts (i.e., concepts that can be accounted for in a newer OMOP CDM version), Excluded Set (i.e., purposefully or not yet mapped concepts), and Truly Missing (i.e., the concept’s missingness cannot easily be accounted for). These distributions were created for condition concepts (A and E), drug ingredients (B and D), and measurement (C and F) results. Click on figure to enlarge it.
The OHDSI Concept Prevalence data contained 62,335 unique OMOP condition vocabulary concepts from 24 sites. After filtering the OMOP2OBO
mappings to remove all entries where all ontologies were "NONE" or "NOT YET MAPPED" and all non-standard concepts, 92,367 concepts remained eligible for use in the coverage study. This means that all purposefully unmapped concepts (i.e., findings, injuries, complications, and carrier status) were kept within the data set as long as at least one of the other mapped ontologies for the given concept was not an unmapped concept of type "NOT YET MAPPED". These data were utilized for all condition coverage experiments.
The OMOP2OBO
condition set contained 92,367 OMOP condition concept ids, which covered 92.51% (weighted coverage: 99.46%) of the 62,335 Concept Prevalence condition concepts. There were 34,704 OMOP2OBO
concepts that were not included in the Concept Prevalence set and 4,672 Concept Prevalence concepts that were not covered by the OMOP2OBO
mappings. These findings are organized into three sets and visualized in Figure 1 (A)
:
-
Overlap: 57,663
OMOP2OBO
concepts (26,807 Concepts Used in Practice, 30,856 Standard Concepts Not Used in Practice) existed inOMOP2OBO
and Concept Prevalence. On average, these concepts occurred 526.96 times (100.0-87,285,164.39). -
OMOP2OBO
Only: 34,704OMOP2OBO
concepts (2,272 Concepts Used in Practice, 32,432 Standard Concepts Not Used in Practice) existed only in theOMOP2OBO
set. On average, these concepts occurred 131.65 times (100.0-39,975.0). - Concept Prevalence Only: 4,672 OMOP concepts existed only in the Concept Prevalence set. On average, these concepts occurred 173.57 times (100.0-8,254,186.5).
Coverage by Site
This phase of the experiment aimed to demonstrate the coverage of the OMOP2OBO
condition occurrence concepts for each Concept Prevalence study site (Figure 2
). Across the Concept Prevalence study sites, coverage ranged from 93.04-99.69%. A Chi-Square test of independence with Yate's correction was run and revealed a significant association between the database and coverage (X2(23) = 7,559.11, p<0.0001). In order to better understand these findings, post-hoc tests were run using a Bonferroni adjustment and confirmed that 107 of the 276 database comparisons had significantly different OMOP2OBO
coverage (ps<0.001).
Error Analysis
The results are visualized in Figure 1 (D)
. Of the 4,672 concepts not covered by OMOP2OBO
, 367 could be accounted for by a newer version of the OMOP CDM (i.e., Error Analysis Concepts), 4,231 accounted for in the set of excluded mappings from the original mapping set (i.e., Excluded Concepts), and 74 concepts were missing and unable to be explained by existing data sources (i.e., Truly Missing Concepts). Additional details on each of these concept sets is provided below.
-
Error Analysis Concepts: A total of 367 (7.86%) missing concepts were accounted for using the current version of the OMOP CDM using the OMOP
concept_relationship
table. These concepts occurred in an average of 2.64 Concept Prevalence study sites with a mean frequency of 27,412.262 (100-3,539,698.5). The 367 missing concepts could be traced to 1,423source_concept_ids
in the originalOMOP2OBO
map set using the followingrelationship_ids
: Is a (n=1,225), Maps to (n=167), and Concept replaced by (n=31). -
Excluded Concepts: A total of 4,231 (90.56%) OMOP concepts could be found in the set of data which were initially filtered from the original
OMOP2OBO
mapping set. These concepts occurred in an average of 1.65 Concept Prevalence study sites and had a mean frequency of 6,139.32 (100-8,254,186.5). These concepts were initially excluded for one of the following reasons:- 3,400 OMOP concepts were of type "Standard Concepts Not Yet Used in Practice" with HP type "NOT YET MAPPED" and MONDO type "FINDING"
- 796 OMOP concepts were of type "Standard Concepts Not Yet Used in Practice" with HP and MONDO type "NOT YET MAPPED"
- 35 OMOP concepts were of type "Standard Concepts Not Yet Used in Practice" with HP and MONDO type "NONE"
-
Truly Missing Concepts: A total of 74 (1.58%) OMOP concepts were truly missing. These concepts occurred in an average of 2.74 Concept Prevalence study sites and had a mean frequency of 5,320.06 (100-100,483). The top five most frequently occurring missing concepts were (with average frequency across the 24 sites and number of sites with concept):
- increased fluid intake (n=100,483; 1 site)
- disease caused by 2019-nCoV (n=93,585; 1 site)
- polycystic ovary syndrome (n=62,900.33; 3 sites)
- saddle embolus of pulmonary artery with acute cor pulmonale (n=22,324.40; 10 sites)
- adjustment disorder with mixed anxiety and depressed mood (n=18,453; 1 site)
Domain expert review of these concepts found that they were likely missing as a result of being infrequently diagnosed in pediatric populations.
Database Indices - 1: Ajou University Database; 2: IQVIA US Ambulatory Electronic Medical Record; 3: IQVIA Longitudinal Patient Data Australia; 4: IQVIA Disease Analyzer France; 5: IQVIA Disease Analyzer Germany; 6: The Healthcare Cost and Utilization Project Nationwide Inpatient Sample; 7: IQVIA US Hospital Charge Data Master; 8: IBM MarketScan Commercial Database; 9: IBM MarketScan Multi-State Medicaid Database; 10: IBM MarketScan Medicare Supplemental Database; 11: Japan Medical Data Center database; 12: Medical Information Mart for Intensive Care III; 13: Korea National Health Insurance Service/National Sample Cohort; 14: Optum De-Identified Clinformatics Data-Mart-Database—Date of Death; 15: Optum De-Identified Clinformatics Data-Mart-Database— Socio-Economic Status; 16: Optum De-identified Electronic Health Record Dataset; 17: IQVIA US LRxDx Open Claims; 18: Premier Healthcare Database; 19: University of Southern California PScanner; 20: Stanford Medicine Research Data Repository; 21: Tufts Medical Center Database; 22: University of Colorado Anschutz Medical Campus Health Group; 23: Australian Electronic Practice-based Research Network; 24: Columbia University Medical Center Database
(A) Across the Concept Prevalence study sites, coverage ranged from 93.04-99.69%. A Chi-Square test of independence with Yate's correction revealed a significant association between the site and coverage (p<0.0001). (B) Post-hoc tests with Bonferroni adjustment to correct for multiple comparisons confirmed that 107 of the 276 database comparisons had significantly different coverage (ps<0.001). (C) Frequency of covered
OMOP2OBO
concepts at each Concept Prevalence site. (D) Frequency of Concept Prevalence site concepts not covered byOMOP2OBO
. Click on figure to enlarge it.
The OHDSI Concept Prevalence data contained 4,588 unique OMOP vocabulary concepts from 18 sites. The OMOP2OBO
vocabulary concepts from each of these sites was compared to the list of concepts from the OMOP2OBO
mappings. After filtering the OMOP2OBO
mappings to remove all entries where all ontologies were "NONE" or "NOT YET MAPPED" and all non-standard concepts, 8,615 concepts remained eligible for use in the coverage study. These data were utilized for all drug ingredient coverage experiments.
The OMOP2OBO
drug ingredient set contained 8,615 OMOP drug ingredient concept ids, which covered 87.99% (weighted coverage: 99.92%) of the 4,588 Concept Prevalence drug ingredient concepts. There were 4,578 OMOP2OBO
concepts that were not included in the Concept Prevalence set and 551 Concept Prevalence concepts that were not covered by the OMOP2OBO
mappings. These findings are organized into three sets and visualized in Figure 1 (B)
:
-
Overlap: 4,037
OMOP2OBO
concepts (1,639 Concepts Used in Practice, 2,398 Standard Concepts Not Used in Practice) existed inOMOP2OBO
and Concept Prevalence. On average, these concepts occurred 8,071.59 times (100.0-125,634,570.39). -
OMOP2OBO
Only: 4,578OMOP2OBO
concepts (58 Concepts Used in Practice, 5,520 Standard Concepts Not Used in Practice) existed only in theOMOP2OBO
set. On average, these concepts occurred 468.89 times (100.0-69,311.0). - Concept Prevalence Only: 551 OMOP concepts that existed only in the Concept Prevalence set. On average, these concepts occurred 801.2 times (100.0-1,795,364.83).
Coverage by Site
This phase of the experiment aimed to demonstrate the coverage of the OMOP2OBO
condition occurrence concepts for each Concept Prevalence study site (Figure 3
). Across the Concept Prevalence study sites, coverage ranged from 91.23-98.35%. A Chi-Square test of independence with Yate's correction revealed a significant association between the database and coverage (X2(17)=195.640, p<0.0001). In order to better understand these findings, post-hoc tests were run using a Bonferroni adjustment and confirmed that 53 of the 153 database comparisons had significantly different OMOP2OBO
coverage (ps<0.001).
Error Analysis
Results are visualized in Figure 1 (E)
. Of the 551 concepts not covered by OMOP2OBO
, five could be accounted for by a newer version of the OMOP CDM (i.e., Error Analysis Concepts), 456 could be accounted for in the set of excluded mappings from the original mapping set (i.e., Excluded Concepts), and 90 concepts were missing and unable to be explained by existing data sources (i.e., Truly Missing Concepts). Additional details on each of these concept sets is provided below.
-
Error Analysis Concepts: A total of five (0.91%) missing concepts were accounted for using the current version of the OMOP CDM using the OMOP
concept_relationship
table. These concepts occurred in an average of 8.4 Concept Prevalence study sites and had a mean frequency of 51,732.04 (100-221,229.71). The five missing concepts could be traced to sixsource_concept_ids
in the originalOMOP2OBO
map set using the Maps to (n=6) relationship. -
Excluded Concepts: A total of 456 (82.76%) OMOP concepts could be found in the set of data which were initially filtered from the original
OMOP2OBO
mapping set. These concepts occurred in an average of 3.88 Concept Prevalence study sites and had a mean frequency of 18,847.28 (100-1,077,258.9). These concepts were initially excluded for one of the following reasons:- 456 OMOP concepts were of type "Standard Concepts Not Yet Used in Practice" with CHEBI, PRO, NCBITaxon, and VO type "NOT YET MAPPED"
-
Truly Missing Concepts: A total of 90 (16.33%) OMOP concepts were truly missing. These concepts occurred in an average of 2.66 Concept Prevalence study sites and had a mean frequency of 3,361.15 (100-175,551.29). The top five most frequently occurring missing concepts were (with average frequency across the 24 sites and number of sites with concept):
- hepatitis A virus strain CR 326F antigen, inactivated (n=175,551.29; 14 sites)
- erenumab (n=60,618; 10 sites)
- fremanezumab (n=15,579.60; 5 sites)
- galcanezumab (n=11,594.80; 5 sites)
- baloxavir marboxil (n=11,366.68; 3 sites)
Domain expert review of these concepts found that they were likely missing as a result of hospital vendor differences or were new high-risk biologics whose safety and efficacy had not yet been tested or confirmed in pediatric populations.
Database Indices - 1: IQVIA US Ambulatory Electronic Medical Record; 2: IQVIA Longitudinal Patient Data Australia; 3: IQVIA Disease Analyzer Germany; 4: IQVIA US Hospital Charge Data Master; 5: IBM MarketScan Commercial Database; 6: IBM MarketScan Multi-State Medicaid Database; 7: IBM MarketScan Medicare Supplemental Database; 8: Japan Medical Data Center database; 9: Optum De-Identified Clinformatics Data-Mart-Database— Socio-Economic Status; 10: Optum De-identified Electronic Health Record Dataset; 11: Optum De-identified Electronic Health Record Dataset; 12: Premier Healthcare Database; 13: University of Southern California PScanner; 14: Stanford Medicine Research Data Repository; 15: Tufts Medical Center Database; 16: University of Colorado Anschutz Medical Campus Health Group; 17: Australian Electronic Practice-based Research Network; 18: Columbia University Medical Center Database.
(A) Across the Concept Prevalence study sites, coverage ranged from 91.23-98.35%. A Chi-Square test of independence with Yate's correction revealed a significant association between the site and coverage (p<0.0001). (B) Post-hoc tests with Bonferroni adjustment to correct for multiple comparisons confirmed that 53 of the 153 database comparisons had significantly different coverage (ps<0.001). (C) Frequency of covered
OMOP2OBO
concepts at each Concept Prevalence site. (D) Frequency of Concept Prevalence site concepts not covered byOMOP2OBO
. Click on figure to enlarge it.
The OHDSI Concept Prevalence data contained 23,513 unique OMOP vocabulary concepts from 18 sites. The OMOP2OBO
vocabulary concepts from each of these sites was compared to the list of concepts from the OMOP2OBO
mappings. After filtering the OMOP2OBO
mappings to remove all entries where all ontologies were "NONE", "UNSPECIFIED SAMPLE" or "UNMAPPED TEST TYPE" and all non-standard concepts, 3,827 concepts (10,673 lab test results) remained eligible for use in the coverage study. These data were utilized for all measurement result coverage experiments.
The OMOP2OBO
measurement result set contained 3,827 OMOP measurement concept ids (10,673 lab test results), which covered 11.14% (weighted coverage: 67.72%) of the 23,513 Concept Prevalence concepts. There were 1,207 OMOP2OBO
concepts that were not included in the Concept Prevalence set and 20,893 Concept Prevalence concepts were not covered by the OMOP2OBO
mappings. These findings are organized into three sets and visualized in Figure 1 (C)
:
-
Overlap: 2,620
OMOP2OBO
concepts (1,393 Concepts Used in Practice, 1,207 Standard Concepts Not Used in Practice) existed inOMOP2OBO
and Concept Prevalence. On average, these concepts occurred 3,072.33 times (100.0-183,333,482.38). -
OMOP2OBO
Only: 1,207OMOP2OBO
concepts (42 Concepts Used in Practice, 1,164 Standard Concepts Not Used in Practice) existed only in theOMOP2OBO
set. On average, these concepts occurred 346.92 times (100.0-,842,485.0). - Concept Prevalence Only: 20,893 OMOP concepts that existed only in the Concept Prevalence set. On average, these concepts occurred 669.55 times (100.0-1,219,846,862.0).
Coverage by Site
This phase of the experiment aimed to demonstrate the coverage of the OMOP2OBO
condition occurrence concepts for each Concept Prevalence study site (Figure 4
). Across the Concept Prevalence study sites, coverage ranged from 91.23-98.35%. A Chi-Square test of independence with Yate's correction revealed a significant association between the database and coverage (X2(17) = 195.640, p<0.0001). In order to better understand these findings, post-hoc tests were run using a Bonferroni adjustment and confirmed that 53 of the 153 database comparisons had significantly different OMOP2OBO
coverage (ps<0.001).
Error Analysis
Results are visualized in Figure 1 (F)
. Of the 20,893 concepts not covered by OMOP2OBO
, 13 could be accounted for by the current version of the OMOP CDM (i.e., Error Analysis Concepts), 158 were accounted for in the set of excluded mappings from the original mapping set (i.e,. Excluded Concepts), and 20,722 concepts were missing and unable to be explained by existing data sources (i.e., Truly Missing Concepts). Additional details on each of these concept sets is provided below:
-
Error Analysis Concepts: A total of 13 (0.06%) missing concepts could be accounted for by a newer version of the OMOP CDM by tracing their original concept id to their new concept id using the OMOP
concept_relationship
table. These concepts occurred in an average of 3.23 Concept Prevalence study sites and had a mean frequency of 9,836.25 (100-29,098.2). The 13 missing concepts could be traced to 13source_concept_ids
in the originalOMOP2OBO
map set using the followingrelationship_ids
: Maps to (n=2) and Concept replaced by (n=11). -
Excluded Concepts: A total of 158 (0.76%) could be found in the set of data which was initially filtered from the original
OMOP2OBO
data source. These concepts occurred in an average of 5.18 Concept Prevalence study sites and had a mean frequency of 282,115.28 (100-14,317,951.9). These concepts were initially excluded for one of the following reasons:- 76 OMOP concepts had an "UNSPECIFIED SAMPLE"
- 79 OMOP concepts had an "UNMAPPED TEST TYPE"
- 3 OMOP concepts were unable to be mapped to an ontology
-
Truly Missing Concepts: A total of 20,722 (99.18%) missing concepts were truly missing and unable to be accounted for by a current data source. These concepts occurred in an average of 2.82 Concept Prevalence study sites and had a mean frequency of 218,874.03 (100-1,219,846,862). The top five most frequently occurring missing concepts were (with average frequency across the 24 sites and number of sites with concept):
- pulse intensity of unspecified artery palpation (n=1,219,846,862, 1 site)
- penicillin g potassium [mass] of dose (n=253,609,945, 1 site)
- sodium [moles/volume] in saliva (oral fluid) (n=246,641,211, 1 site).
- cotinine/creatinine [mass ratio] in urine (n=246,063,202; 1 site)
- chloride [moles/volume] in saliva (oral fluid) (n=234,931,483; 1 site).
Domain expert review of these concepts confirmed that missing concepts were likely due to inconsistencies in the use of LOINC. This finding is consistent with what’s been observed in literaturePMID:22306382
.
Database Indices - 1: IQVIA US Ambulatory Electronic Medical Record; 2: IQVIA Longitudinal Patient Data Australia; 3: IQVIA Disease Analyzer France; 4: IQVIA Disease Analyzer Germany; 5: IBM MarketScan Commercial Database; 6: IBM MarketScan Medicare Supplemental Database; 7: Japan Medical Data Center database; 8: Medical Information Mart for Intensive Care III; 9: Korea National Health Insurance Service/National Sample Cohort; 10: Optum De-Identified Clinformatics Data-Mart-Database—Date of Death; 11: Optum De-Identified Clinformatics Data-Mart-Database— Socio-Economic Status; 12: Optum De-identified Electronic Health Record Dataset; 13: Premier Healthcare Database; 14: University of Southern California PScanner; 15: Stanford Medicine Research Data Repository; 16: University of Colorado Anschutz Medical Campus Health Group; 17: Australian Electronic Practice-based Research Network; 18: Columbia University Medical Center Database.
(A) Across the Concept Prevalence study sites, coverage ranged from 4.22-75%. A Chi-Square test of independence with Yate's correction revealed a significant association between the site and coverage (p<0.0001). (B) Post-hoc tests with Bonferroni adjustment to correct for multiple comparisons confirmed that 93 of the 153 database comparisons had significantly different coverage (ps<0.001). (C) Frequency of covered
OMOP2OBO
concepts at each Concept Prevalence site. (D) Frequency of Concept Prevalence site concepts not covered byOMOP2OBO
. Click on figure to enlarge it.