-
Notifications
You must be signed in to change notification settings - Fork 1
/
VariableClassification.Rmd
122 lines (91 loc) · 4.22 KB
/
VariableClassification.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
layout: default
title: "Building Variables defs"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Organizing Phenotypes}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
# output: html_document
date: "2023-04-28"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Get the data....
One of the important aspects of the PHESANT package is that they provided some characterization of all the phenotypes, thereby making analysis easier.
We aim to provide a similar capability for NHANES and this document provides both a description of what we did and a template for others to enhance, or correct our classifications.
We first load `phonto`, and extract all questions together with the variable name, the SasLabel etc. In the first part of this vignette we then label the different variables that are probably not useful as phenotypes. Examples are things like survey weights, comment fields, etc.
```{r cars}
library(phonto)
# t1 = paste0("SELECT TableName, Variable, Description, SaSLabel FROM ",phonto:::MetadataTable("QuestionnaireVariables"))
xx = phonto:::metadata_var() |> dplyr::select(TableName, Variable, Description, SasLabel)
#make a vector to store our variable labels
outPut = rep(NA, nrow(xx))
apply(xx,2, function(x) sum(is.na(x)))
## we can see that only Description and SaSLabel are missing
## TableName Variable Description SaSLabel
## 0 0 505 338
missLab = is.na(xx$SasLabel) |is.na(xx$Description)
usedIndex = which(missLab)
outPut[missLab] = "Missing"
```
## Any survey weights
And we can see already that some of the variables that start with WT have either a missing Description field or a missing SasLabel field.
```{r weights}
isWT = grep("^WT", xx$Variable)
outPut[isWT] = "Survey Weight"
##variance estimates also survey weights
isWT = grep("^SDMV", xx$Variable)
outPut[isWT] = "Survey Weight"
##now for Interview questions - eg what language was it done in etc
##unlikely to be relevant, but maybe
g1 = grep("*[Ii]nterview\\??$", xx$SasLabel)
outPut[g1] = "Interview"
##comments - most variables end with an LC but not all...so we also
## look at the description - the SasLabel doesn't work as they abbreviate
##comments in weird ways
g1 = grep("LC$", xx$Variable)
g2 = grep("*[Cc]omment [Cc]ode$", xx$Description)
outPut[union(g1,g2)] = "Comment"
```
Another set of variables that are not likely to be phenotypes are the interview IDs.
```{r interviewerID}
g1 = grep("DR[12D]EXMER", xx$Variable)
outPut[g1] = "Interviewer ID code"
g2 = grep("[Rr]ecall [Ss]tatus", xx$SasLabel)
outPut[g2] = "Recall Status"
g3 = grep("*LANG*", xx$Variable)
##table(outPut[g3], useNA = "always")
outPut[g3] = "Language Used"
## captured with Interview
## g4 = grep("*INTRP*", xx$Variable)
## table(outPut[g4], useNA = "always")
```
```{r classes, echo=TRUE}
table(outPut, useNA="always")
```
## Build a searchable corpus
The code below can be used to build a searchable corpus using the `corpustools` package. Once constructed the corpus is put into the `inst/extdata` subdirectory and it can be used. The corpus should be updated for every new release as mappings to the ontologies, the ontologies and the DB may all have changed.
This is not run as part of the vignette testing but rather provides the details on how to do this for a release.
```{r eval=FALSE}
library("corpustools")
nhanes_df = xx
nhanes_df$Unique = paste0(nhanes_df$Questionnaire,"_", nhanes_df$Variable)
nhanes_tc = create_tcorpus(nhanes_df, doc_column = 'Unique', text_columns = 'Description')
nhanes_tc$preprocess(use_stemming = TRUE, remove_stopwords=TRUE)
h1 = search_features(nhanes_tc, query = c(`"blood pressure"`))
h2 = search_features(nhanes_tc, query = "hypertension")
h3 = search_features(nhanes_tc, query="LDL")
save=FALSE
if( save ) {
path="/HostData/Laha/phonto/inst/extdata"
save(nhanes_tc, file= paste0(path, "/nhanes_tc.rda"), compress="xz")
save(nhanes_df, file=paste0(path, "/nhanes_df.rda"), compress="xz")
}
```
And more from DEMO_I - not clear but they don't seem to be useful phenotypes
SDDSRVYR - Data release cycle...
RIDSTATR - Interview/Examination status
SIAPROXY - Proxy used in SP Interview?
SIAINTRP - Interpreter used in SP Interview?