Skip to content

Latest commit

 

History

History
138 lines (101 loc) · 4.48 KB

VariableClassification.md

File metadata and controls

138 lines (101 loc) · 4.48 KB
layout title output vignette date
default
Building Variables defs
rmarkdown::html_vignette
%\VignetteIndexEntry{Organizing Phenotypes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}
2023-04-28

Get the data....

One of the important aspects of the PHESANT package is that they provided some characterization of all the phenotypes, thereby making analysis easier.

We aim to provide a similar capability for NHANES and this document provides both a description of what we did and a template for others to enhance, or correct our classifications.

We first load phonto, and extract all questions together with the variable name, the SasLabel etc. In the first part of this vignette we then label the different variables that are probably not useful as phenotypes. Examples are things like survey weights, comment fields, etc.

library(phonto)
 t1 = paste0("SELECT DISTINCT TableName, Variable, Description, SaSLabel FROM  Metadata.QuestionnaireVariables")
xx = nhanesQuery(t1)
#make a vector to store our variable labels
outPut = rep(NA, nrow(xx))
apply(xx,2, function(x) sum(is.na(x)))
##   TableName    Variable Description    SaSLabel 
##           0           0        3639        3472
## we can see that only Description and SaSLabel are missing
##  TableName    Variable Description    SaSLabel
##          0           0         505         338
missLab = is.na(xx$SaSLabel) |is.na(xx$Description)
usedIndex = which(missLab)
outPut[missLab] = "Missing"

Any survey weights

And we can see already that some of the variables that start with WT have either a missing Description field or a missing SasLabel field.

 isWT = grep("^WT", xx$Variable)
 outPut[isWT] = "Survey Weight"
 
 ##variance estimates also survey weights
 isWT = grep("^SDMV", xx$Variable)
 outPut[isWT] = "Survey Weight"
 
 ##now for Interview questions - eg what language was it done in etc
 ##unlikely to be relevant, but maybe
 
 g1 = grep("*[Ii]nterview\\??$", xx$SaSLabel)
 outPut[g1] = "Interview"
 
 ##comments - most variables end with an LC but not all...so we also
 ## look at the description - the SasLabel doesn't work as they abbreviate
 ##comments in weird ways
 g1 = grep("LC$", xx$Variable)
 g2 = grep("*[Cc]omment [Cc]ode$", xx$Description)
 outPut[union(g1,g2)] = "Comment"

Another set of variables that are not likely to be phenotypes are the interview IDs.

g1 = grep("DR[12D]EXMER", xx$Variable)
outPut[g1] = "Interviewer ID code"

g2 = grep("[Rr]ecall [Ss]tatus", xx$SaSLabel)
outPut[g2] = "Recall Status"

g3 = grep("*LANG*", xx$Variable)
##table(outPut[g3], useNA = "always")

outPut[g3] = "Language Used"

## captured with Interview
## g4 = grep("*INTRP*", xx$Variable)
## table(outPut[g4], useNA = "always")
 table(outPut, useNA="always")
## outPut
##             Comment           Interview Interviewer ID code       Language Used             Missing 
##                2792                 162                  66                 106                3402 
##       Recall Status       Survey Weight                <NA> 
##                  60                1214               42107

Build a searchable corpus

The code below can be used to build a searchable corpus using the corpustools package. Once constructed the corpus is put into the inst/extdata subdirectory and it can be used. The corpus should be updated for every new release as mappings to the ontologies, the ontologies and the DB may all have changed.

This is not run as part of the vignette testing but rather provides the details on how to do this for a release.

library("corpustools")
nhanes_df = xx
nhanes_df$Unique = paste0(nhanes_df$Questionnaire,"_", nhanes_df$Variable)
nhanes_tc = create_tcorpus(nhanes_df, doc_column = 'Unique', text_columns = 'Description')
nhanes_tc$preprocess(use_stemming = TRUE, remove_stopwords=TRUE)

h1 = search_features(nhanes_tc, query = c(`"blood pressure"`))

h2 = search_features(nhanes_tc, query = "hypertension")
h3 = search_features(nhanes_tc, query="LDL")
  save=FALSE
  if( save ) {
   path="/HostData/Laha/phonto/inst/extdata"
   save(nhanes_tc, file= paste0(path, "/nhanes_tc.rda"), compress="xz")
   save(nhanes_df, file=paste0(path, "/nhanes_df.rda"), compress="xz")
  }

And more from DEMO_I - not clear but they don't seem to be useful phenotypes

SDDSRVYR - Data release cycle... RIDSTATR - Interview/Examination status SIAPROXY - Proxy used in SP Interview? SIAINTRP - Interpreter used in SP Interview?