layout | title | output | vignette | date |
---|---|---|---|---|
default |
Building Variables defs |
rmarkdown::html_vignette |
%\VignetteIndexEntry{Organizing Phenotypes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}
|
2023-04-28 |
One of the important aspects of the PHESANT package is that they provided some characterization of all the phenotypes, thereby making analysis easier.
We aim to provide a similar capability for NHANES and this document provides both a description of what we did and a template for others to enhance, or correct our classifications.
We first load phonto
, and extract all questions together with the variable name, the SasLabel etc. In the first part of this vignette we then label the different variables that are probably not useful as phenotypes. Examples are things like survey weights, comment fields, etc.
library(phonto)
t1 = paste0("SELECT DISTINCT TableName, Variable, Description, SaSLabel FROM Metadata.QuestionnaireVariables")
xx = nhanesQuery(t1)
#make a vector to store our variable labels
outPut = rep(NA, nrow(xx))
apply(xx,2, function(x) sum(is.na(x)))
## TableName Variable Description SaSLabel
## 0 0 3639 3472
## we can see that only Description and SaSLabel are missing
## TableName Variable Description SaSLabel
## 0 0 505 338
missLab = is.na(xx$SaSLabel) |is.na(xx$Description)
usedIndex = which(missLab)
outPut[missLab] = "Missing"
And we can see already that some of the variables that start with WT have either a missing Description field or a missing SasLabel field.
isWT = grep("^WT", xx$Variable)
outPut[isWT] = "Survey Weight"
##variance estimates also survey weights
isWT = grep("^SDMV", xx$Variable)
outPut[isWT] = "Survey Weight"
##now for Interview questions - eg what language was it done in etc
##unlikely to be relevant, but maybe
g1 = grep("*[Ii]nterview\\??$", xx$SaSLabel)
outPut[g1] = "Interview"
##comments - most variables end with an LC but not all...so we also
## look at the description - the SasLabel doesn't work as they abbreviate
##comments in weird ways
g1 = grep("LC$", xx$Variable)
g2 = grep("*[Cc]omment [Cc]ode$", xx$Description)
outPut[union(g1,g2)] = "Comment"
Another set of variables that are not likely to be phenotypes are the interview IDs.
g1 = grep("DR[12D]EXMER", xx$Variable)
outPut[g1] = "Interviewer ID code"
g2 = grep("[Rr]ecall [Ss]tatus", xx$SaSLabel)
outPut[g2] = "Recall Status"
g3 = grep("*LANG*", xx$Variable)
##table(outPut[g3], useNA = "always")
outPut[g3] = "Language Used"
## captured with Interview
## g4 = grep("*INTRP*", xx$Variable)
## table(outPut[g4], useNA = "always")
table(outPut, useNA="always")
## outPut
## Comment Interview Interviewer ID code Language Used Missing
## 2792 162 66 106 3402
## Recall Status Survey Weight <NA>
## 60 1214 42107
The code below can be used to build a searchable corpus using the corpustools
package. Once constructed the corpus is put into the inst/extdata
subdirectory and it can be used. The corpus should be updated for every new release as mappings to the ontologies, the ontologies and the DB may all have changed.
This is not run as part of the vignette testing but rather provides the details on how to do this for a release.
library("corpustools")
nhanes_df = xx
nhanes_df$Unique = paste0(nhanes_df$Questionnaire,"_", nhanes_df$Variable)
nhanes_tc = create_tcorpus(nhanes_df, doc_column = 'Unique', text_columns = 'Description')
nhanes_tc$preprocess(use_stemming = TRUE, remove_stopwords=TRUE)
h1 = search_features(nhanes_tc, query = c(`"blood pressure"`))
h2 = search_features(nhanes_tc, query = "hypertension")
h3 = search_features(nhanes_tc, query="LDL")
save=FALSE
if( save ) {
path="/HostData/Laha/phonto/inst/extdata"
save(nhanes_tc, file= paste0(path, "/nhanes_tc.rda"), compress="xz")
save(nhanes_df, file=paste0(path, "/nhanes_df.rda"), compress="xz")
}
And more from DEMO_I - not clear but they don't seem to be useful phenotypes
SDDSRVYR - Data release cycle... RIDSTATR - Interview/Examination status SIAPROXY - Proxy used in SP Interview? SIAINTRP - Interpreter used in SP Interview?