vignettes/VariableClassification.Rmd

---
layout: default
title: "Building Variables defs"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Organizing Phenotypes}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
# output: html_document
date: "2023-04-28"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Get the data....
One of the important aspects of the PHESANT package is that they provided some characterization of all the phenotypes, thereby making analysis easier.

We aim to provide a similar capability for NHANES and this document provides both a description of what we did and a template for others to enhance, or correct our classifications.

We first load `phonto`, and extract all questions together with the variable name, the SasLabel etc.  In the first part of this vignette we then label the different variables that are probably not useful as phenotypes.  Examples are things like survey weights, comment fields, etc.

```{r cars}
library(phonto)
 # t1 = paste0("SELECT TableName, Variable, Description, SaSLabel FROM ",phonto:::MetadataTable("QuestionnaireVariables"))
xx = phonto:::metadata_var() |> dplyr::select(TableName, Variable, Description, SasLabel)
#make a vector to store our variable labels
outPut = rep(NA, nrow(xx))
apply(xx,2, function(x) sum(is.na(x)))
## we can see that only Description and SaSLabel are missing
##  TableName    Variable Description    SaSLabel
##          0           0         505         338
missLab = is.na(xx$SasLabel) |is.na(xx$Description)
usedIndex = which(missLab)
outPut[missLab] = "Missing"
```

## Any survey weights


And we can see already that some of the variables that start with WT have either a missing Description field or a missing SasLabel field.
```{r weights}
 isWT = grep("^WT", xx$Variable)
 outPut[isWT] = "Survey Weight"
 
 ##variance estimates also survey weights
 isWT = grep("^SDMV", xx$Variable)
 outPut[isWT] = "Survey Weight"
 
 ##now for Interview questions - eg what language was it done in etc
 ##unlikely to be relevant, but maybe
 
 g1 = grep("*[Ii]nterview\\??$", xx$SasLabel)
 outPut[g1] = "Interview"
 
 ##comments - most variables end with an LC but not all...so we also
 ## look at the description - the SasLabel doesn't work as they abbreviate
 ##comments in weird ways
 g1 = grep("LC$", xx$Variable)
 g2 = grep("*[Cc]omment [Cc]ode$", xx$Description)
 outPut[union(g1,g2)] = "Comment"
 
```

Another set of variables that are not likely to be phenotypes are the interview IDs.

```{r interviewerID}
g1 = grep("DR[12D]EXMER", xx$Variable)
outPut[g1] = "Interviewer ID code"

g2 = grep("[Rr]ecall [Ss]tatus", xx$SasLabel)
outPut[g2] = "Recall Status"

g3 = grep("*LANG*", xx$Variable)
##table(outPut[g3], useNA = "always")

outPut[g3] = "Language Used"

## captured with Interview
## g4 = grep("*INTRP*", xx$Variable)
## table(outPut[g4], useNA = "always")
```


```{r classes, echo=TRUE}
 table(outPut, useNA="always")
```

## Build a searchable corpus

The code below can be used to build a searchable corpus using the `corpustools` package.  Once constructed the corpus is put into the `inst/extdata` subdirectory and it can be used.  The corpus should be updated for every new release as mappings to the ontologies, the ontologies and the DB may all have changed.

This is not run as part of the vignette testing but rather provides the details on how to do this for a release.

```{r eval=FALSE}
library("corpustools")
nhanes_df = xx
nhanes_df$Unique = paste0(nhanes_df$Questionnaire,"_", nhanes_df$Variable)
nhanes_tc = create_tcorpus(nhanes_df, doc_column = 'Unique', text_columns = 'Description')
nhanes_tc$preprocess(use_stemming = TRUE, remove_stopwords=TRUE)

h1 = search_features(nhanes_tc, query = c(`"blood pressure"`))

h2 = search_features(nhanes_tc, query = "hypertension")
h3 = search_features(nhanes_tc, query="LDL")
  save=FALSE
  if( save ) {
   path="/HostData/Laha/phonto/inst/extdata"
   save(nhanes_tc, file= paste0(path, "/nhanes_tc.rda"), compress="xz")
   save(nhanes_df, file=paste0(path, "/nhanes_df.rda"), compress="xz")
  }


```
 And more from DEMO_I - not clear but they don't seem to be useful phenotypes

 SDDSRVYR - Data release cycle...
 RIDSTATR - Interview/Examination status
 SIAPROXY - Proxy used in SP Interview?
 SIAINTRP - Interpreter used in SP Interview?