TEAM2_Final_Summary.Rmd

---
title: "Heartbreakers: Estimating the Risk Factors of Cardiovascular Disease"
author: "By: Hovhannes Arestakesyan, Karina Martinez, Timberley Thompson, Chenxu Wang, & Zhe Yu"
date: "`r Sys.Date()`"
output:
  html_document:
    code_folding: hide
    number_sections: false
    toc: yes
    toc_depth: 3
    toc_float: yes
  pdf_document:
    toc: yes
    toc_depth: '3'
---

```{r init, include=FALSE}
#RMD Settings
knitr::opts_chunk$set(warning = F, results = "hide", message = F)
options(scientific=T, digits = 3) 

#Packages
library(ezids)
library(tidyverse)
library(tinytex)
require(rmarkdown)

#Data
cardio_orig = data.frame(read.csv("cardio_train.csv", header = TRUE, sep=';'))
str(cardio_orig)

cardio = cardio_orig[-c(1)] #get rid of id
cardio$age = cardio$age/365 # transfer days to years
cardio$bmi = cardio$weight/((cardio$height/100)**2) #add bmi variable

cardio_clean = outlierKD2(cardio, bmi, TRUE, TRUE, TRUE, TRUE) 
cardio_clean = outlierKD2(cardio_clean, ap_hi, TRUE, TRUE, TRUE, TRUE) 
cardio_clean = outlierKD2(cardio_clean, ap_lo, TRUE, TRUE, TRUE, TRUE)
cardio_clean = na.omit(cardio_clean) #remove NA rows

cardio_clean_final = cardio_clean
cardio_clean_final$gender = as.factor(cardio_clean$gender)
cardio_clean_final$cholesterol = as.factor(cardio_clean$cholesterol) 
cardio_clean_final$gluc = as.factor(cardio_clean$gluc)
cardio_clean_final$smoke = as.factor(cardio_clean$smoke)
cardio_clean_final$alco = as.factor(cardio_clean$alco)
cardio_clean_final$active = as.factor(cardio_clean$active)
cardio_clean_final$cardio = as.factor(cardio_clean$cardio)
```
# I. Background

Cardiovascular disease is "the leading cause of death globally," killing an estimated 17.9 million people each year (World Health Organization). Cardiovascular disease refers to any of a group of conditions affecting one's heart and blood vessels; these conditions can include heart failure, arrhythmia, aortic disease and more (Cleveland Clinic).

The World Health Organization asserts that an "unhealthy diet, physical inactivity, tobacco use" and alcohol abuse are the most important behavioral risk factors in developing heart disease. Engaging in such activities can often lead to "raised blood pressure, raised blood glucose... and obesity" (World Health Organization).

This research paper is an investigation of The World Health Organization's statements. The research team analyzed data on cardiovascular disease to determine the risk factors of the disease and recommend a model to predict cardiovascular disease in patients.


## SMART Questions

Specifically, the research team answered the following Specific, Measurable, Achievable, Relevant, and Time-oriented (SMART) questions:  

  1. How accurately can the model predict the probability that a patient has cardiovascular disease?
  2. What are the top three factors associated with cardiovascular disease? Does this corroborate the risk factors observed by the World Health Organization?  
  3. Is there any association between an individual's sex and whether they have 
cardiovascular disease? Are men more prone to cardiovascular diseases than women?


## Dataset Description

Data analysis was performed on the Cardiovascular Disease dataset from [Kaggle.com](https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset).  To clean the data, we generated a `bmi` variable from `height` and `weight`, removed outliers for `bmi`, `ap_hi` and `ap_lo`, then converted variables to factors where appropriate. The final dataset, `cardio_clean_final`, has `r nrow(cardio_clean_final)` observations and the following `r ncol(cardio_clean_final)` variables:


### Variables

Variable | Type | Definition  
  :-     |  :-- |  :----
`age`	 | Number | Age in years
`gender` |Factor w/ 2 levels | Gender (1: women, 2: men)
`height` | Number | Height in centimeters (cm)
`weight` |Number | Weight in kilograms (kg)
`bmi` | Number | Body Max Index (kg/m^2^)
`ap_hi` | Integer | Systolic blood pressure
`ap_lo` | Integer | Diastolic blood pressure
`cholesterol` |Factor w/ 3 levels| Cholesterol (1: normal, 2: above normal, 3: well above normal)
`gluc` |Factor w/ 3 levels| Glucose (1: normal, 2: above normal, 3: well above normal)
`smoke` |Factor w/ 2 levels| Whether patient smokes or not (0: no, 1: yes)
`alco` |Factor w/ 2 levels| Whether patient consumes alcohol or not (0: no, 1: yes)
`active` |Factor w/ 2 levels| Whether patient is physically active or not (0: no, 1: yes)
`cardio` |Factor w/ 2 levels| Presence or absence of cardiovascular disease (0: absent, 1: present)


Summary statistics for the cleaned dataset are presented below.  
```{r summary, results='markup'}
xkablesummary(cardio_clean_final, title = "Cardiovascular Disease Dataset Summary" )
```


# II. Exploratory Data Analysis
## Categorical Variables

```{r cat.graphs, results='markup'}
require(patchwork)
disease <- subset(cardio_clean_final, cardio==1)
healthy <- subset(cardio_clean_final, cardio==0)

#cholesterol
cholesterol.1 <- ggplot(cardio_clean_final, aes(x=cardio, fill= cholesterol)) +
  geom_bar(position="fill") +
  scale_x_discrete(labels=c("1" = "Has CVD", "0" = "No CVD")) +
  scale_y_continuous(labels = scales::percent) +
  labs( title = "Cholesterol", y = "", x = "",
        fill = "Cholesterol Level") +
  scale_fill_manual(values=c("#f4cccc", "#e06666", "#990000"), 
                    labels = c("Normal", "Above Normal", "Well Above Normal")) +
  theme_minimal()

#Glucose
gluc.1 <- ggplot(cardio_clean_final, aes(x=cardio, fill= gluc)) +
  geom_bar(position="fill") +
  scale_x_discrete(labels=c("1" = "Has CVD", "0" = "No CVD")) +
  scale_y_continuous(labels = scales::percent) +
  labs( title = "Glucose", y = "", x = "",
        fill = "Glucose Level") +
  scale_fill_manual(values=c("#f4cccc", "#e06666", "#990000"), 
                    labels = c("Normal", "Above Normal", "Well Above Normal")) +
  theme_minimal()

#Gender
cardio_clean_final$gender.yn<- as.factor(ifelse(cardio_clean_final$gender== "1", "Women", "Men"))

gender.1 <- ggplot(cardio_clean_final, aes(x=cardio, fill= gender.yn)) +
  geom_bar(position="fill") +
  scale_x_discrete(labels=c("1" = "Has CVD", "0" = "No CVD")) +
  scale_y_continuous(labels = scales::percent) +
  labs( title = "Gender",
        y = "", x = "", fill = "") +
  scale_fill_manual(values=c("#0000da", "#990000")) +
  theme_minimal()

#Smoke
cardio_clean_final$smoke.yn<- as.factor(ifelse(cardio_clean_final$smoke== "1", "Yes", "No"))

smoke.1 <- ggplot(cardio_clean_final, aes(x=cardio, fill= smoke.yn)) +
  geom_bar(position="fill") +
  scale_x_discrete(labels=c("1" = "Has CVD", "0" = "No CVD")) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Tobacco Use", y = "", x = "", fill = "Smoker") +
  scale_fill_manual(values=c("#0000da", "#990000")) +
  theme_minimal()

#alcohol
  cardio_clean_final$alco.yn<- as.factor(ifelse(cardio_clean_final$alco== "1", "Yes", "No"))

alco.1 <- ggplot(cardio_clean_final, aes(x=cardio, fill= alco.yn)) +
  geom_bar(position="fill") +
  scale_x_discrete(labels=c("1" = "Has CVD", "0" = "No CVD")) +
  scale_y_continuous(labels = scales::percent) +
  labs( title = "Alcohol Use", y = "", x = "", fill = "Drinks Alcohol") +
  scale_fill_manual(values=c("#0000da", "#990000")) +
  theme_minimal()

#active
cardio_clean_final$active.yn<- as.factor(ifelse(cardio_clean_final$active== "1", "Yes", "No"))

active.1 <- ggplot(cardio_clean_final, aes(x=cardio, fill= active.yn)) +
  geom_bar(position="fill") +
  scale_x_discrete(labels=c("1" = "Has CVD", "0" = "No CVD")) +
  scale_y_continuous(labels = scales::percent) +
  labs( title = "Physical Activity", y = "", x = "", fill = "Physically Active") +
  scale_fill_manual(values=c("#0000da", "#990000")) +
  theme_minimal() 

gender.1 + cholesterol.1 + gluc.1 + smoke.1 +  alco.1 + active.1 + plot_layout(ncol = 2)
```

By exploring the data, we found that the most noticeable differences between people with and without cardiovascular disease (CVD) were in cholesterol and glucose. `r sprintf("%.1f", (sum(disease$cholesterol!= "1")/nrow(disease)) * 100)`% of people with cardiovascular disease have above normal cholesterol levels, while only `r sprintf("%.1f", (sum(healthy$cholesterol!= "1")/nrow(disease)) * 100)`% of people *without* the disease have above normal cholesterol. Similarly, `r sprintf("%.1f", (sum(disease$gluc!="1")/nrow(disease)) * 100)`% of people with the disease have above normal glucose, while only `r sprintf("%.1f", (sum(healthy$gluc!="1")/nrow(disease)) * 100)`% of people *without* the disease have above normal glucose.

The differences between the two groups (people with and without cardiovascular disease) are less noticeable when examining gender and lifestyle factors. The percentage of women in the "has CVD" group is nearly equal to the percentage of women in the "no CVD" group. Likewise, the percentage of alcohol drinkers is nearly the same in the "has CVD" group and the "no CVD" group.

Unsurprisingly, there is a slightly higher percentage of physically active people in the "no CVD" group than in the "has CVD" group. However, it *is* surprising that there is a slightly higher percentage of smokers in the "*no CVD*" group" than in the "CVD" group. We found that `r sprintf("%.1f", (sum(healthy$smoke==1)/nrow(healthy)) * 100)`% of people without CVD were smokers, while only `r sprintf("%.1f", (sum(disease$smoke== 1)/nrow(disease)) * 100)`% of people with CVD were smokers. This finding seems to contradict experts' assertions that tobacco use can cause cardiovascular disease (FDA).


## Chi-Squared Tests
We performed $\chi^2$ tests to further explore the data and determine if the differences we observed between the "CVD" and "no CVD" groups were statistically significant.

```{r chi}
disease <- subset(cardio_clean_final, cardio==1)
healthy <- subset(cardio_clean_final, cardio==0)

#Gender
cardio_vs_gender = table(cardio_clean_final$cardio, cardio_clean_final$gender)
chi_gender = chisq.test(cardio_vs_gender)

#Cholesterol
cardio_vs_cholesterol=table(cardio_clean_final$cardio, cardio_clean_final$cholesterol)
chi_cholesterol = chisq.test(cardio_vs_cholesterol)

#Glucose
cardio_vs_glucose = table(cardio_clean_final$cardio, cardio_clean_final$gluc)
chi_glucose = chisq.test(cardio_vs_glucose)

#Smoking
cardio_vs_smoke = table(cardio_clean_final$cardio, cardio_clean_final$smoke)
chi_smoke = chisq.test(cardio_vs_smoke)

#Alcohol
cardio_vs_alco = table(cardio_clean_final$cardio, cardio_clean_final$alco)
chi_alco = chisq.test(cardio_vs_alco)

#Physical Activity
cardio_vs_active = table(cardio_clean_final$cardio, cardio_clean_final$active)
chi_active = chisq.test(cardio_vs_active)

```
Categorical Variable | Chi-Squared | P-Value | df  
  :-     |  :- |  :-| :-
  `gender`  | `r format(chi_gender$statistic, digits=3)` | `r format(chi_gender$p.value, digits=3)` | `r chi_gender$parameter`
  `cholesterol` | `r format(chi_cholesterol$statistic, digits=3)` | `r format(chi_cholesterol$p.value, digits=3)` | `r chi_cholesterol$parameter`
  `glucose` | `r format(chi_glucose$statistic, digits=3)` | `r format(chi_glucose$p.value, digits=3)` | `r chi_glucose$parameter`
  `smoke` | `r format(chi_smoke$statistic, digits=3)` | `r format(chi_smoke$p.value, digits=3)` | `r chi_smoke$parameter`
  `alco` | `r format(chi_alco$statistic, digits=3)` | `r format(chi_alco$p.value, digits=3)` | `r chi_alco$parameter`
  `active` | `r format(chi_active$statistic, digits=3)` | `r format(chi_active$p.value, digits=3)` | `r chi_active$parameter`
  

We use these results to test the following hypothesis for each variable:  

**Is the variable independent of cardiovascular disease?**  
  $H_0$: They are independent.  
  $H_1$: They are not independent. 

The p-value for `gender` is greater than 0.05, so we can conclude that gender is independent of cardiovascular disease and may not have a significant impact on whether a patient has CVD. The p-values for all other variables are less than 0.01; in fact, `cholesterol`, `glucose`, `smoke` and `active` are almost 0. Thus, we reject the null hypothesis and conclude that these variables are related to cardiovascular disease. 

The $\chi^2$ tests corroborate the observations from the bar graphs. The variables with the smallest p-values (cholesterol and glucose) are the same ones that had the most obvious differences in their bar graphs.

## Quantitative Variables

```{r cont.graphs}
disease <- subset(cardio_clean_final, cardio==1)
healthy <- subset(cardio_clean_final, cardio==0)

#age
cardio_clean_final$cardio.yn<- as.factor(ifelse(cardio_clean_final$active== "1", "Yes", "No"))

age.kde <- ggplot(cardio_clean_final, aes(x = age, fill= cardio.yn)) + 
  geom_density(position = "stack") +
  geom_vline(aes(xintercept=mean(disease$age)), color="pink3", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=mean(healthy$age)), color="steelblue", linetype="dashed", size=1) +
  scale_x_continuous(limits = c(30,70)) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Age Distribution (Stacked)",
       fill = "Has CVD", x = "Age", y = " ") +
  scale_fill_manual(values=c("#0000da", "#990000")) +
  theme_minimal()


#height
height.kde <- ggplot(cardio_clean_final, aes(x = height, fill= cardio.yn)) + 
  geom_density(position = "stack") +
  geom_vline(aes(xintercept=mean(disease$height)), color="pink3", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=mean(healthy$height)), color="steelblue", linetype="dashed", size=1) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Height Distribution (Stacked)",
       fill = "Has CVD", x = "Height (cm)", y = " ") +
  scale_fill_manual(values=c("#0000da", "#990000")) +
  theme_minimal()


#weight
weight.kde <- ggplot(cardio_clean_final, aes(x = weight, fill= cardio.yn)) + 
  geom_density(position = "stack") +
  geom_vline(aes(xintercept=mean(disease$weight)), color="pink3", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=mean(healthy$weight)), color="steelblue", linetype="dashed", size=1) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Weight Distribution (Stacked)",
       fill = "Has CVD", x = "Weight (kg)", y = " ") +
  scale_fill_manual(values=c("#0000da", "#990000")) +
  theme_minimal()


#bmi
bmi.kde <- ggplot(cardio_clean_final, aes(x = bmi, fill= cardio.yn)) + 
  geom_density(position = "stack") +
  geom_vline(aes(xintercept=mean(disease$bmi)), color="pink3", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=mean(healthy$bmi)), color="steelblue", linetype="dashed", size=1) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "BMI Distribution (Stacked)",
       fill = "Has CVD", x = "BMI", y = " ") +
  scale_fill_manual(values=c("#0000da", "#990000")) +
  theme_minimal()


#systolic bp
ap_hi.kde <- ggplot(cardio_clean_final, aes(x = ap_hi, fill= cardio.yn)) + 
  geom_density(position = "stack") +
  geom_vline(aes(xintercept=mean(disease$ap_hi)), color="pink3", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=mean(healthy$ap_hi)), color="steelblue", linetype="dashed", size=1) +
  scale_y_continuous(labels = scales::percent) +
  scale_x_continuous(breaks = c(seq(90,170,20))) +
  labs(title = "Systolic BP Distribution (Stacked)",
       fill = "Has CVD", x = "Systolic BP", y = " ") +
  scale_fill_manual(values=c("#0000da", "#990000")) +
  theme_minimal()


#diastolic bp
ap_lo.kde <- ggplot(cardio_clean_final, aes(x = ap_lo, fill= cardio.yn)) + 
  geom_density(position = "stack") +
  geom_vline(aes(xintercept=mean(disease$ap_lo)), color="pink3", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=mean(healthy$ap_lo)), color="steelblue", linetype="dashed", size=1) +
  scale_y_continuous(labels = scales::percent) +
  scale_x_continuous(breaks = c(seq(65,105,10))) +
  labs(title = "Diastolic BP Distribution (Stacked)",
       fill = "Has CVD", x = "Diastolic BP", y = " ") +
  scale_fill_manual(values=c("#0000da", "#990000")) +
  theme_minimal()


require(ggpubr)
ggarrange(age.kde, height.kde, weight.kde, bmi.kde, ap_hi.kde, ap_lo.kde, ncol=2, nrow=3, common.legend = TRUE, legend = "bottom", heights=c(2,2,2))


```

We used density plots to show the distribution of the quantitative variables in the dataset. Dashed lines were added to the plots to indicate the means for the CVD group (pink line) and the means for the non-CVD  groups (blue line).  

The plots indicate that the CVD distribution differs from the non-CVD distribution for all variables except for `height`. Patients with CVD are likely to be older and heavier than "no CVD" patients. The "has CVD" group also tends to have higher systolic and diastolic blood pressure.

## Welch Two-Sample T-Test
We used two-sample t-tests to test whether there was a statistically significant difference in means between the "CVD" and "no CVD" groups for each quantitative variable.

```{r ttest}

cardio_pos = subset(cardio_clean_final, cardio_clean_final$cardio == 1)
cardio_neg = subset(cardio_clean_final, cardio_clean_final$cardio == 0)

#age
ttest_age = t.test(cardio_pos$age, cardio_neg$age)

#height
ttest_height = t.test(cardio_pos$height, cardio_neg$height)

#weight
ttest_weight = t.test(cardio_pos$weight, cardio_neg$weight)

#bmi
ttest_bmi = t.test(cardio_pos$bmi, cardio_neg$bmi)

#ap.hi
ttest_ap_hi = t.test(cardio_pos$ap_hi, cardio_neg$ap_hi)

#ap.lo
ttest_ap_lo = t.test(cardio_pos$ap_lo, cardio_neg$ap_lo)

```  

Continuous Variable | CVD mean | No CVD mean | t | p-value
:- | :- | :- | :- | :-
`age` | `r format(ttest_age$estimate[1], digits=3)` | `r format(ttest_age$estimate[2], digits=3)` | `r format(ttest_age$statistic)` | `r format(ttest_age$p.value, digits=3)`
`height` | `r format(ttest_height$estimate[1], digits=3)` | `r format(ttest_height$estimate[2], digits=3)` | `r format(ttest_height$statistic)` | `r format(ttest_height$p.value, digits=3)`
`weight`| `r format(ttest_weight$estimate[1], digits=3)` | `r format(ttest_weight$estimate[2], digits=3)` | `r format(ttest_weight$statistic)` | `r format(ttest_weight$p.value, digits=3)`
`bmi` | `r format(ttest_bmi$estimate[1], digits=3)` | `r format(ttest_bmi$estimate[2], digits=3)` | `r format(ttest_bmi$statistic)` | `r format(ttest_bmi$p.value, digits=3)`
`ap_hi` | `r format(ttest_ap_hi$estimate[1], digits=3)` | `r format(ttest_ap_hi$estimate[2], digits=3)` | `r format(ttest_ap_hi$statistic)` | `r format(ttest_ap_hi$p.value, digits=3)`
`ap_lo` | `r format(ttest_ap_lo$estimate[1], digits=3)` | `r format(ttest_ap_lo$estimate[2], digits=3)` | `r format(ttest_ap_lo$statistic)` | `r format(ttest_ap_lo$p.value, digits=3)`

We used these results to test the following hypothesis for each variable:  
**Is the variable mean different in CVD vs non-CVD patients?**  
  $H_0$: The mean is the same  
  $H_1$: The mean is not the same 

The p-value was very low for all quantitative variables indicating that there is a statistically significant difference in the average `age`, `height`, `weight`, `bmi`, `ap_hi` and `ap_lo` of CVD and non-CVD patients. For all of the continuous/quantitative variables, we reject the null hypothesis and accept that there is a difference between the means of the CVD and non-CVD groups.  

It is worth noting that when using t-tests with large sample sizes, the hypothesis tests are able to detect even small differences, as was the case for height. The mean for both CVD and non-CVD patients was 165cm, with a `r ttest_height$conf.int[1]` cm to `r ttest_height$conf.int[2]` cm difference at 95% confidence. The t-tests tests align with the observations from the density plots, as there was a noticeable difference in the means of all continuous variables except height.  

## Correlations
We examined correlations between each of the variables to determine if a linear relationship exists with cardiovascular disease or any other variables. Because a number of variables are factors, we used the Spearman method to generate the correlation coefficients.

```{r corrplot}
loadPkg("corrplot")

col_order <- c("age", "gender", "height", "weight", "ap_hi", "ap_lo", "cholesterol", "gluc", "smoke", "alco", "active", "bmi", "cardio")
cardio_clean_cor <- cardio_clean[, col_order]

# corrplot with spearman method for categorical variables
cardio_cor <- cor(cardio_clean_cor, method="spearman")
corrplot(cardio_cor, type="lower", addCoef.col="darkred", number.cex=0.6, tl.cex=1,title="Correlation Matrix", mar=c(0,0,1,0))
```  

The correlation coefficient for each pair of variables is displayed in the correlation matrix. The coefficient ranges from 1 to -1, indicating a positive or negative correlation. The further away the correlation coefficient is from zero, the stronger the relationship between the two variables. 

The factors most strongly correlated with cardiovascular disease are `ap_hi`, `ap_low`, `cholesterol`, `age` and `bmi.` There is also a strong correlation between `height` and `gender`, `ap_hi` and `ap_low`, and `cholesterol` and `gluc`.  

Some of these correlations are expected, for instance systolic blood pressure and diastolic blood pressure typically increase and decrease together. Weight and BMI are also strongly correlated, which makes sense because BMI is a measurement of body stature that heavily relies on weight.  While the correlation matrix does not show how the relationship between variables impacts the presence or absence of CVD, it is important to keep these relationships in mind when we begin building models.   

#### Is there a relationship between blood pressure and other variables in patients with cardiovascular disease?  
```{r, results='markup'}
library(patchwork)
#install.packages("ggExtra")
#library(ggExtra)

ap_hi_lo <- ggplot(cardio_clean_final, aes(x=ap_lo, y=ap_hi, color=cardio)) +
  geom_point(size = 1.5, alpha=0.7) + 
  labs(title = "Systolic vs Diastolic BP", color='Has CVD') + 
  xlab("Diastolic BP ") + ylab("Systolic BP") +
  scale_color_manual(values=c("#0000da", "#990000"), labels = c("No", "Yes")) +
  theme_minimal()

ap1_hi <- ggplot(cardio_clean_final,aes(x = cholesterol , y = ap_hi, fill = cardio)) + 
  geom_violin(scale="width", width=1, alpha=.5, aes(linetype=NA)) + 
  geom_boxplot(width=.1, cex=.5, position=position_dodge(1), outlier.size=.7)+
  labs(title = "Systolic BP vs Cholesterol", fill='Has CVD') +
  xlab("Cholesterol") + ylab("Systolic BP") + 
  scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes")) +
  scale_x_discrete(labels=c("1" = "normal", "2" = "above normal","3" = "well above normal")) +
  theme_minimal()

ap2 <- ggplot(cardio_clean_final, aes(x=age, y=ap_hi, color=cardio)) +
  geom_point(size = 1.5, alpha=0.7) + 
  labs(title = "Systolic BP vs Age", color="Has CVD") + 
  xlab("Age ") + ylab("Systolic BP") +
  scale_color_manual(values=c("#0000da", "#990000"), labels = c("No", "Yes")) +
  theme_minimal()

ap3 <- ggplot(cardio_clean_final, aes(x=bmi, y=ap_hi, color=cardio)) +
  geom_point(size=1.5,alpha=0.7) + 
  labs(title = "Systolic BP vs BMI", color="Has CVD") + 
  xlab("BMI ") + ylab("Systolic BP") +
  scale_color_manual(values=c("#0000da", "#990000"), labels = c("No", "Yes")) +
  theme_minimal()

ggarrange(ap_hi_lo, ap1_hi, ap2, ap3, ncol=2, nrow=2, common.legend = TRUE, legend = "bottom", heights=c(2,2))
```  

We can see a strong correlation between systolic and diastolic blood pressure, as these two measurements are known to be physiologically related. According to the NIH, for most adults:  

* Normal blood pressure is systolic BP < 120 and diastolic BP < 80  
* Elevated blood pressure is systolic BP between 120 and 129 and diastolic BP < 80  
* High blood pressure is systolic BP is 130 or higher and diastolic BP 80 or higher  

The incidence of CVD increases as both systolic and diastolic BP increase.  

Due to the relationship between systolic and diastolic BP, we chose to use only systolic BP to investigate relationships between BP and other variables. We found that there is not a strong correlation between BP and cholesterol, age or BMI. However, across all cholesterol, age and BMI groups, CVD patients have higher BP than non-CVD patients. They also tend to be older and have higher BMI.

#### Is there a relationship between cholesterol, bmi, and age in patients with cardiovascular disease?  
```{r, results='markup'}

chol1 <- ggplot(cardio_clean_final,aes(x = cholesterol , y = age, fill = cardio)) + 
  geom_violin(width=1, alpha=0.5, aes(linetype=NA)) + 
  geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
  labs(title = "Age vs Cholesterol", fill="Has CVD") +
  xlab("Cholesterol") + ylab("Age") + 
  scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes")) +
  scale_x_discrete(labels=c("1" = "normal", "2" = "above normal","3" = "well above normal")) +
  theme_minimal()

chol2 <- ggplot(cardio_clean_final,aes(x = cholesterol , y = bmi, fill = cardio)) + 
  geom_violin(width=1, alpha=0.5, aes(linetype=NA)) + 
  geom_boxplot(width=.1, cex=.5, position=position_dodge(1), outlier.size=.7)+
  labs(title = "BMI vs Cholesterol", fill="Has CVD") +
  xlab("Cholesterol") + ylab("BMI") + 
  scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes")) +
  scale_x_discrete(labels=c("1" = "normal", "2" = "above normal","3" = "well above normal")) +
  theme_minimal()

age1 <- ggplot(cardio_clean_final, aes(x=age, y=bmi, color=cardio)) +
  geom_point(size=1,alpha=0.5) + 
  labs(title = "BMI vs Age", color="Has CVD") + 
  xlab("Age") + ylab("BMI") +
  scale_color_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes")) +
  theme_minimal()

#ggarrange(chol1, chol2, age1, ncol=2, nrow=2, common.legend = TRUE, legend = "bottom")
#(smoke3 + theme(legend.position = "none")) + (active3 + theme(legend.position = "none")) + (gluc3 + theme(legend.position = "none")) + (alco3 + theme(legend.position = "none")) + alco1 + guide_area() + plot_layout(guides = 'collect')

((chol1 + theme(legend.position = "none")) + (chol2 + theme(legend.position = "none"))) / age1 + plot_layout(guides = 'collect')
```  

Violin plots combine blox plots and density plots to show summary statistics and where the values are clustered. These plots show that there is a positive trend between `age` and `cholesterol`, as well as `bmi` and `cholesterol.` We can also see an increase in age and BMI between the CVD and non-CVD groups in each cholesterol group. 

The scatterplot of BMI vs age does not indicate that there is a correlation between these variables. However, patients with higher BMI and age more likely to have CVD. Also, there appears to be a higher incidence of CVD in older people with low BMI than younger people with high BMI, indicating that age may have more of an effect on CVD than body mass.

#### How do lifestyle choices influence cardiovascular disease with other variables?  

```{r, results='markup'}


#Smoking
smoke1 <- ggplot(cardio_clean_final,aes(x = smoke , y = ap_hi, fill = cardio)) + 
  geom_violin(width=1, alpha=0.5, scale='width', aes(linetype=NA)) + 
  geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
  labs(title = "Systolic Blood Pressure vs Smoke", fill="Has CVD") +
  xlab("Smoke") + ylab("Systolic BP") + 
  scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
  scale_x_discrete(labels=c("0" = "No", "1" = "Yes")) +
  theme_minimal()
  #geom_hline(yintercept = 120, linetype = 'dashed')

smoke2 <- ggplot(cardio_clean_final,aes(x = smoke , y = age, fill = cardio)) + 
  geom_violin(width=1, alpha=0.5, aes(linetype=NA)) + 
  geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
  labs(title = "Age vs Smoke", fill="Has CVD") +
  xlab("Smoke") + ylab("Age") + 
  scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
  scale_x_discrete(labels=c("0" = "No", "1" = "Yes")) +
  theme_minimal()

smoke3 <- ggplot(cardio_clean_final,aes(x = smoke , y = bmi, fill = cardio)) + 
  geom_violin(width=1, alpha=0.5, aes(linetype=NA)) + 
  geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
  labs(title = "BMI vs Smoke", fill="Has CVD") +
  xlab("Smoke") + ylab("BMI") + 
  scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
  scale_x_discrete(labels=c("0" = "No", "1" = "Yes")) +
  theme_minimal()


#Alcohol
alco1 <- ggplot(cardio_clean_final,aes(x = alco , y = ap_hi, fill = cardio)) + 
  geom_violin(width=1, alpha=0.5, scale="width", aes(linetype=NA)) + 
  geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
  labs(title = "Systolic BP vs Alcohol", fill="Has CVD") +
  xlab("Alcohol") + ylab("Systolic BP") + 
  scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
  scale_x_discrete(labels=c("0" = "No", "1" = "Yes")) +
  theme_minimal()

alco2 <- ggplot(cardio_clean_final,aes(x = alco , y = age, fill = cardio)) + 
  geom_violin(width=1, alpha=0.5, aes(linetype=NA)) + 
  geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
  labs(title = "Age vs Alcohol", fill="Has CVD") +
  xlab("Alcohol") + ylab("Age") + 
  scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
  scale_x_discrete(labels=c("0" = "No", "1" = "Yes"))

alco3 <- ggplot(cardio_clean_final,aes(x = alco , y = bmi, fill = cardio)) + 
  geom_violin(width=1, alpha=0.5, aes(linetype=NA)) + 
  geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
  labs(title = "BMI vs Alcohol", fill="Has CVD") +
  xlab("Alcohol") + ylab("BMI") + 
  scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
  scale_x_discrete(labels=c("0" = "No", "1" = "Yes")) +
  theme_minimal()


#Physical activity
active1 <- ggplot(cardio_clean_final,aes(x = active , y = ap_hi, fill = cardio)) + 
  geom_violin(width=1, alpha=0.5, scale="count", aes(linetype=NA)) + # Scale maximum width proportional to sample size
  geom_boxplot(width=.1, cex=1, position=position_dodge(.5), outlier.size=.7)+
  labs(title = "Systolic Blood Pressure vs Activity (Count Scaled)", fill="Has CVD") +
  xlab("Activity") + ylab("Systolic BP") + 
  scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
  scale_x_discrete(labels=c("0" = "No", "1" = "Yes"))

active2 <- ggplot(cardio_clean_final,aes(x = active , y = age, fill = cardio)) + 
  geom_violin(width=1, alpha=0.5, aes(linetype=NA)) + 
  geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
  labs(title = "Age vs Activity", fill="Has CVD") +
  xlab("Activity") + ylab("Age") + 
  scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
  scale_x_discrete(labels=c("0" = "No", "1" = "Yes"))

active3 <- ggplot(cardio_clean_final,aes(x = active , y = bmi, fill = cardio)) + 
  geom_violin(width=1, alpha=0.5, aes(linetype=NA)) + 
  geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
  labs(title = "BMI vs Activity", fill="Has CVD") +
  xlab("Activity") + ylab("BMI") + 
  scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
  scale_x_discrete(labels=c("0" = "No", "1" = "Yes")) +
  theme_minimal()


#Blood Glucose
gluc1 <- ggplot(cardio_clean_final,aes(x = gluc , y = ap_hi, fill = cardio)) + 
  geom_violin(aes(linetype=NA)) + 
  labs(title = "Systolic Blood Pressure vs Blood Glucose") +
  xlab("Blood Glucose") + ylab("Systolic BP") + 
  scale_fill_manual(values=c("pink3", "steelblue"),labels=c("Absent", "Present")) +
  scale_x_discrete(labels=c("1" = "normal", "2" = "above normal","3" = "well above normal"))+

  geom_violin() + 
  labs(title = "Systolic Blood Pressure vs Blood Glucose",
       fill = "Has CVD") +
  xlab("Blood Glucose") + ylab("Systolic BP") + 
  scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes")) +
  scale_x_discrete(labels=c("1" = "Normal", "2" = "Above Normal","3" = "Well Above Normal"))+
  geom_hline(yintercept = 120, linetype = 'dashed') +
  theme_minimal()

gluc2 <- ggplot(cardio_clean_final,aes(x = gluc , y = age, fill = cardio)) + 
  geom_violin(width=1, alpha=0.5, aes(linetype=NA)) + 
  geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
  labs(title = "Age vs Blood Glucose", fill="Has CVD") +
  xlab("Blood Glucose") + ylab("Age") + 
  scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
  scale_x_discrete(labels=c("1" = "normal", "2" = "above normal","3" = "well above normal")) +
  theme_minimal()
gluc2


gluc3 <- ggplot(cardio_clean_final,aes(x = gluc , y = bmi, fill = cardio)) + 
  geom_violin(width=1, alpha=0.5, aes(linetype=NA)) + 
  geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
  labs(title = "BMI vs Blood Glucose", fill="Has CVD") +
  xlab("Blood Glucose") + ylab("BMI") + 
  scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
  scale_x_discrete(labels=c("1" = "normal", "2" = "above normal","3" = "well above normal")) +
  theme_minimal()

#(smoke3 + theme(legend.position = "none")) + (active3 + theme(legend.position = "none")) + (gluc3 + theme(legend.position = "none")) + (alco3 + theme(legend.position = "none")) + alco1 + guide_area() + plot_layout(guides = 'collect')

(smoke3 + theme(legend.position = "none")) + (gluc3 + theme(legend.position = "none")) + (alco3 + theme(legend.position = "none")) + alco1 + plot_layout(guides = 'collect')
```  

Lastly, we looked at lifestyle choices to assess their relationship with other variables in the presence or absence of cardiovascular disease. The main measurement that was impacted by lifestyle is `bmi.`  BMI tends to decrease with smoking, but increase with blood glucose and alcohol. Because BMI is correlated with cardiovascular disease, it appears that diet and alcohol may help to predict CVD as well.  

Alcohol consumption also impacted systolic blood pressure. Blood pressure was higher among people who consume alcohol, and the difference between CVD and non-CVD patients was more dramatic in alcohol drinkers. This indicates tht alcohol may also impact the presence of CVD.

# III. Statistical Models
## Feature selection  
In the EDA, t-tests and $\chi^2$ tests were used to determine whether the independent variables `age`, `gender`, `height`, `weight`, `ap_hi`, `ap_lo`, `cholesterol`, `gluc`, `smoke`, `alco`, `active` and `bmi` had effects on the response variable `cardio`. The p-values of the tests are shown in the table below:  
```{r pval table} 
# results='markup'
tab <- matrix(c( '<2e-16','0.6', '8e-07', '<2e-16', '<2e-16', '<2e-16', '<2e-16', '<2e-16', '1e-06', '0.006', '<2e-16', '<2e-16'), ncol=12, byrow = TRUE)
rownames(tab) <- "p_value"
colnames(tab) <- c("Age_year", "Gender", "Height","Weight","Systolic(ap_hi)", "Diastolic(ap_low)", "Cholesterol","Glucose","smoke", "Alcohol", "Physical_activity","BMI")
tab <- as.table(tab)
xkabledply(tab, "p-value Comparison")  
```

The p-values suggest that all the independent variables, except for `gender`, have a relationship with `cardio`. We used the `bestglm` function to further confirm this. 

```{r bestglm}
#results='markup'
loadPkg("bestglm")
loadPkg("leaps")
loadPkg("ggplot2")
cardio_clean_final_glm <- cardio_clean_final
colnames(cardio_clean_final_glm)[12] <- "y"
cardio_clean_final_glm <- cardio_clean_final_glm[c("age", "gender", "height", "weight","ap_hi", "ap_lo", "cholesterol", "gluc", "smoke", "alco", "active", "bmi", "y")]

res.bestglm <- bestglm(Xy = cardio_clean_final_glm, family = binomial,
            IC = "AIC", method = "exhaustive")
```  

```{r results='markup'}
#summary(res.bestglm)
res.bestglm$BestModels
#summary(res.bestglm$BestModels)
unloadPkg("bestglm") 
unloadPkg("leaps")  
```  
`gender` and `bmi` were excluded from the model that has lowest criterion value, indicating these two variables may not have strong effects on `cardio`. But among the 5 models the function generated, the criterion values are very close, and `gender` was voted as `FALSE` three times, while `bmi` was only voted twice. So we are not confident in removing these two variables from the following modeling steps. In the KNN modeling, we compare whether these two variables can affect the model performances. Also, since the correlations matrix showed the variables `age`, `cholesterol`, `bmi`, `ap_hi` and `ap_lo` had relatively high coefficient values with `cardio`, we also performed the KNN modeling with these 5 variables.   


## K-Nearest Neighbor (KNN)     

Before KNN modeling, all the predictor variables were converted to numeric, and were scaled and centered. The dataframe structure after the pre-modeling treatment is shown in the table below.
```{r}
results='markup'
#Data Prep
#Scale and center data
cardio_num <- cardio_clean_final_glm
colnames(cardio_num)[13] <- "cardioi"
cardio_num[1:12] <- sapply(cardio_num[1:12],as.numeric)
cardio_scale <- cardio_num
cardio_scale[1:12] <- as.data.frame(scale(cardio_num[1:12], center = TRUE, scale = TRUE))
str(cardio_scale)
```  


### Model 1 (All 13 variables)  
**Train/Test Split**  
The dataframe was split into a training subset (70% of the total dataframe) and a test subset (30% of the total dataframe).  
```{r}
set.seed(42)
cardio_sample <- sample(2, nrow(cardio_scale), replace=TRUE, prob=c(0.7, 0.3))
cardio_training_x <- cardio_scale[cardio_sample==1, 1:12]
cardio_test_x <- cardio_scale[cardio_sample==2, 1:12]
cardio_training_y <- cardio_scale[cardio_sample==1, 13]
cardio_test_y <- cardio_scale[cardio_sample==2, 13]
```

**Finding the Best K-Value**  
Odd numbers from 3 to 40 were tested one by one to get the best K value. The accuracy values from the model that used these K values are shown in the line graph.

```{r}
loadPkg("FNN")
loadPkg("gmodels")
loadPkg("caret")
K_test <- seq(3, 40, by = 2)
ResultDf = data.frame()
for (k_v in K_test) {
  cardio_pred <- knn(train = cardio_training_x, test = cardio_test_x, cl=cardio_training_y, k=k_v)
  
  cm = confusionMatrix(cardio_pred, reference = cardio_test_y )
  
  cmaccu = cm$overall['Accuracy']

  cmt = data.frame(k=k_v, Total.Accuracy = cmaccu, row.names = NULL ) 
  
  ResultDf = rbind(ResultDf, cmt)
}
```

```{r}
xkabledply(ResultDf, "Total Accuracy Summary for KNN Model 1")
```

```{r}
library(ggplot2)

ggplot(ResultDf, aes(x=k, y=Total.Accuracy)) +
  geom_line(color = "skyblue", size = 1.5) +
  geom_point(size = 3, color = "blue") + 
  labs(title = "Total Accuracy vs K (Model 1)",
       x = "K", y = "Total Accuracy") +
  theme_minimal()
```  

According the results, K = 19 is the best K value for Model 1.  

**Modeling Results with the Best K Value (K = 19)**  

Detailed results and the confusion matrix using K=19 were shown below.  

```{r}
 #results='markup'
   cardio_pred_19 <- knn(train = cardio_training_x, test = cardio_test_x, cl=cardio_training_y, k=19)
  KNN_Mode_1 = confusionMatrix(cardio_pred_19, reference = cardio_test_y)
  xkabledply(as.matrix(KNN_Mode_1), title = paste("Confusion Matrix for KNN Model 1") ) 
  xkabledply(data.frame(KNN_Mode_1$overall['Accuracy']), title=paste("Total Accuracy of KNN Model 1"))
  xkabledply(data.frame(KNN_Mode_1$byClass), title=paste("Metrics of KNN Model 1"))
```  
The overall performance is acceptable, but we still want to see whether removing `gender` and `bmi` will improve the modeling. 

### Model 2 (Excluding Gender & BMI)

**Train / Test Split**  
The dataframe was split into a training subset (70% of the total dataframe) and a testing subset (30% of the total dataframe). The variables `gender` and `bmi` were excluded from the train/test split.
```{r}
cardio_scale2 <- cardio_scale[c(1, 3:11, 13)]
cardio_scale2
set.seed(42)
cardio_sample2 <- sample(2, nrow(cardio_scale2), replace=TRUE, prob=c(0.7, 0.3))
cardio_training_x2 <- cardio_scale2[cardio_sample2==1, 1:10]
cardio_test_x2 <- cardio_scale2[cardio_sample2==2, 1:10]
cardio_training_y2 <- cardio_scale2[cardio_sample2==1, 11]
cardio_test_y2 <- cardio_scale2[cardio_sample2==2, 11]
```


**Finding the Best K-Value**   
Odd numbers from 3 to 40 were used for the test. The accuracy values from the model that used these K values are shown in the table and the line graph. According the results, K = 29 is the best K value for the KNN modeling without `gender` and `bmi`.

```{r}
K_test <- seq(3, 40, by = 2)
ResultDf2 = data.frame()
for (k_v in K_test) {
  cardio_pred2 <- knn(train = cardio_training_x2, test = cardio_test_x2, cl=cardio_training_y2, k=k_v)
  
  cm2 = confusionMatrix(cardio_pred2, reference = cardio_test_y2 )
  
  cmaccu2 = cm2$overall['Accuracy']

  cmt2 = data.frame(k=k_v, Total.Accuracy = cmaccu2, row.names = NULL ) 
  
  ResultDf2 = rbind(ResultDf2, cmt2)
}
```
```{r results='markup'}
xkabledply(ResultDf2, "Total Accuracy Summary for KNN Model 2")
```

```{r}
library(ggplot2)
ggplot(ResultDf2, aes(x=k, y=Total.Accuracy)) +
  geom_line(color = "skyblue", size = 1.5) +
  geom_point(size = 3, color = "blue") + 
  labs(title = "Total.Accuracy vs K (Model 2)")
```  
**Modeling Results with the Best K Value (K = 29)** 

Detailed results and the confusion matrix of model 2 using k = 29 were shown below. Removing `gender` and `bmi` slightly improved the performance. 

```{r results='markup'}
  cardio_pred_29 <- knn(train = cardio_training_x2, test = cardio_test_x2, cl=cardio_training_y2, k=29)
  KNN_Mode_2 = confusionMatrix(cardio_pred_29, reference = cardio_test_y2)
  xkabledply(as.matrix(KNN_Mode_2), title = paste("Confusion Matrix for KNN Model 2") ) 
  xkabledply(data.frame(KNN_Mode_2$overall['Accuracy']), title=paste("Total Accuracy of KNN Model 2"))
  xkabledply(data.frame(KNN_Mode_2$byClass), title=paste("Metrics of KNN Model 2"))
```  

Next, we want to check whether only using the 5 variables  provided by the correlation matrix will make a better model.  

### Model 3 (Selected Variables)
This KNN model only use `age`, `cholesterol`, `bmi`, `ap_hi` and `ap_lo`.

**Train / Test Split**  
The dataframe was split into training subset (70% of the total dataframe) and test subset (30% of the total dataframe). Only `age`, `cholesterol`, `bmi`, `ap_hi` and `ap_lo` were used for the modeling. 
```{r}
cardio_scale3 <- cardio_scale[c(1, 5:7, 12:13)]
cardio_scale3
set.seed(42)
cardio_sample3 <- sample(2, nrow(cardio_scale3), replace=TRUE, prob=c(0.7, 0.3))
cardio_training_x3 <- cardio_scale3[cardio_sample3==1, 1:5]
cardio_test_x3 <- cardio_scale3[cardio_sample3==2, 1:5]
cardio_training_y3 <- cardio_scale3[cardio_sample3==1, 6]
cardio_test_y3 <- cardio_scale3[cardio_sample3==2, 6]
```

**Looking for the best K value** 
Odd numbers from 3 to 40 were used for the test. The accuracy values from the model that used these K values were shown in the table and the line graph. According the results, K = 31 is the best K value for the KNN modeling.

```{r}
K_test <- seq(3, 40, by = 2)
ResultDf3 = data.frame()
for (k_v in K_test) {
  cardio_pred3 <- knn(train = cardio_training_x3, test = cardio_test_x3, cl=cardio_training_y3, k=k_v)
  
  cm3 = confusionMatrix(cardio_pred3, reference = cardio_test_y3 )
  
  cmaccu3 = cm3$overall['Accuracy']

  cmt3 = data.frame(k=k_v, Total.Accuracy = cmaccu3, row.names = NULL ) 
  
  ResultDf3 = rbind(ResultDf3, cmt3)
}
```
```{r results='markup'}
xkabledply(ResultDf3, "Total Accuracy Summary")
```

```{r}
library(ggplot2)
ggplot(ResultDf3, aes(x=k, y=Total.Accuracy)) +
  geom_line(color = "skyblue", size = 1.5) +
  geom_point(size = 3, color = "blue") + 
  labs(title = "Total.Accuracy vs K")
```  

**Modeling results with the best K value (Model 3, K = 31)**  

The confusion matrix of model 3 using k = 31 were shown below. 

```{r results='markup'}
  cardio_pred_31 <- knn(train = cardio_training_x3, test = cardio_test_x3, cl=cardio_training_y3, k=31)
  KNN_Mode_3 = confusionMatrix(cardio_pred_31, reference = cardio_test_y3 )
  xkabledply(as.matrix(KNN_Mode_3), title = paste("Confusion Matrix for KNN Model 3") )
```  

```{r results='markup'}
 xkabledply(data.frame(KNN_Mode_1$overall['Accuracy'], KNN_Mode_2$overall['Accuracy'], KNN_Mode_3$overall['Accuracy']), title=paste("Total Accuracy Comparision of the Three KNN Models"))

 xkabledply(data.frame(KNN_Mode_1$byClass, KNN_Mode_2$byClass, KNN_Mode_3$byClass), title=paste("Metrics Comparision of KNN the Three KNN Models"))
```

The total accuracy and other metrics of the three models were listed in the above tables. For summary, all three models have an acceptable performance, but not outstanding. More specifically, model 2 has a overall better performance comparing to model 1. But model 2 and model 3 have their own strength and weaknesses. The accuracy and recall value of model 2 are higher than model 3, while the precision value is higher in model 3. Recall value is more related to false negative rate and the precision value is used to represent to the false positive rate. It's hard to say which model is the best one, but we should choose the proper model according to our aims. If minimizing false positives is our focus, model 3 might be a good option, and model 2 is more suitable is we want to minimize  false negative rate.   
One of the limitation of the KNN modeling is this model has difficulties in identifying feature importance. So from here we can't answer the question that which are the top 3 factors that affect the onset of cardiovascular diseases. Our following modeling using logistic regression and classification tree will answer the question.  

## Logistic Regression Model

The second type of model we used for the disease prediction is logistic regression. Logistic regression models, also called logit models, are used to model dichotomous outcome variables. In a logit model the log odds of the outcome are modeled as a linear combination of the predictor variables. 

The data set has 13 variables. In contrast to KNN modeling, here we used logistic regression to remove variables one by one. The alpha value was set equal to 0.1. This means that p-values greater than 0.1 will be removed.

For consistency between models used in this study we used 70% of data set as training data set, and the remaining 30% as testing data set. The seed point was set to 42.

In this study we built 4 logistic regression models based on the variables that have significant affect on the prediction.
```{r}
#load dataset
data <- cardio_clean_final

#view summary of dataset
summary(data)

 #find total observations in dataset
nrow(data)

```

```{r}

#Create Training and Test Samples

#make this example reproducible
set.seed(42)

#Use 70% of dataset as training set and remaining 30% as testing set
sample <- sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.7 , 0.3))
train <- data[sample, ]
test <- data[!sample, ]

#view train summary
summary(train)

```


### Modeling

**Logistic Regression Model 1**

For the 1st model we use all 13 variables.

```{r results='markup'}
#Fit the logistic regression model including all variables

logit_reg_model_ALL <- glm(cardio ~ gender + age + bmi + height + weight + ap_hi + ap_lo + cholesterol + gluc + smoke + alco + active, family="binomial", data=cardio_clean_final)

#disable scientific notation for model summary
options(scipen = 999)

#view model summary
#summary(logit_reg_model_ALL)
xkabledply(logit_reg_model_ALL, title = "logit_reg_model_ALL")

```

According to the first model, `gender`, `bmi`, `height`, `weight` and `gluc` do not have significant affect.

**Logistic Regression Model 2**

For the second model we included all variables but gender.

```{r results='markup'}
#Fit the logistic regression model excluding gender

logit_reg_model_wogender <- glm(cardio ~ age + bmi + height + weight + ap_hi + ap_lo + cholesterol + gluc + smoke + alco + active, family="binomial", data=cardio_clean_final)

#disable scientific notation for model summary
options(scipen = 999)

#view model summary
#summary(logit_reg_model_wogender)
xkabledply(logit_reg_model_wogender, title = "logit_reg_model_wogender")

```

The results showed that the bmi, height, weight and gluc are still remaining not detrimental for the disease prediction.

**Logistic Regression Model 3**

Next we excluded gender and bmi to build the third model.  

```{r results='markup'}
#Fit the logistic regression model excluding gender and bmi

logit_reg_model_wogenderbmi <- glm(cardio ~ age + height + weight + ap_hi + ap_lo + cholesterol + gluc + smoke + alco + active, family="binomial", data=cardio_clean_final)

#disable scientific notation for model summary
options(scipen = 999)

#view model summary
#summary(logit_reg_model_wogenderbmi)
xkabledply(logit_reg_model_wogenderbmi, title = "logit_reg_model_wogenderbmi")


```

Interestingly, when we removed the `bmi` variable from the model, `height` and `weight` started to show significant impact on disease prediction. However, gluc2 is still not significant. 

**Logistic Regression Model 4**

For the last model we excluded the `gluc` variable from our dataset. 

```{r results='markup'}
#Remove gender, bmi and gluc from the modeling because they do not any effect on cardio

logit_reg_model_cleaned <- glm(cardio ~ age + height + weight + ap_hi + ap_lo + cholesterol + smoke + alco + active, family="binomial", data=cardio_clean_final)

#disable scientific notation for model summary
options(scipen = 999)

#view model summary
#summary(logit_reg_model_cleaned)
xkabledply(logit_reg_model_cleaned, title = "logit_reg_model_cleaned")

```

As we can see from the above presented outcome of the Model 4, all remaining 10 variables have an significant effect on disease prediction. Therefore for the further evaluation we will use Model 4.

### Evaluation

**Coefficient Exponential** 

To confirm which variable have positive or negative impact on disease prediction we calculated the coefficient exponential for all variables from Model 4.

```{r exp, results=T}
expcoeff1 = exp(coef(logit_reg_model_cleaned))
#summary(expcoeff1)
xkabledply( as.table(expcoeff1), title = "Exponential of coefficients in cardio" )

```
For exponential of coefficients, there are five variables having positive effect on dependent variables `cardio`, and four having negative effect. For example, if person smoke, they will have more possibility to develop cardiovascular diseases.

**Confusion matrix**

Confusion matrix is a table that will categorize the predictions against the actual values. It includes two dimensions, among them one will indicate the predicted values and another one will represent the actual values.
```{r confusionMatrix, results='markup'}
loadPkg("regclass")
#confusion_matrix(logit_reg_model_cleaned)
xkabledply( confusion_matrix(logit_reg_model_cleaned), title = "Confusion matrix from Logit Model" )
unloadPkg("regclass")
```

```{r confusion matrix, results=F}
loadPkg("regclass")
cfmatrix1 = confusion_matrix(logit_reg_model_cleaned)
accuracy1 <- (cfmatrix1[1,1]+cfmatrix1[2,2])/cfmatrix1[3,3]
precision1 <- cfmatrix1[2,2]/(cfmatrix1[2,2]+cfmatrix1[1,2])
recall1 <- cfmatrix1[2,2]/(cfmatrix1[2,2]+cfmatrix1[2,1])
specificity1 <- cfmatrix1[1,1]/(cfmatrix1[1,1]+cfmatrix1[1,2])
F1_score1 <- 2*(precision1)*(recall1)/(precision1 + recall1)
accuracy1
precision1
recall1
specificity1
F1_score1
```
From confusion matrix, we can conclude that 

 - Accuracy =`r accuracy1`
 - Precision=`r precision1`
 - Recall=`r recall1`
 - Specificity=`r specificity1`
 - F1 score=`r F1_score1`
 
 Now let us explain what each value means:
 
 The success rate or the accuracy of the model can be easily calculated using 2x2 confusion matrix. The accuracy score reads as 72.3% for the given data and observations. It is straightforward that the error rate will be 27.7%.
 Precision is a measure of how many of the positive predictions made are correct (true positives). It is 75.2% for our model.
 Recall is a measure of how many of the positive cases the classifier correctly predicted, over all the positive cases in the data. It is sometimes also referred to as Sensitivity. For our model it is 65.6%.
 Specificity is a measure of how many negative predictions made are correct (true negatives). The value for our model is 78.9%.
 F1-Score is a measure combining both precision and recall. It is generally described as the harmonic mean of the two. Harmonic mean is just another way to calculate an "average" of values, geneerally described as more suitable for ratios (such as precision and recall) than the traditional arithmetic mean. F1-Score for our model is 0.7 which is 70% and pretty high number.

**ROC and AUC**

The Area Under the ROC Curve (AUC) is an aggregated metric that evaluates how well a logistic regression model classifies positive and negative outcomes at all possible cutoffs. It can range from 0.5 to 1, and the larger is the number it is better. AUC represents the probability that a random positive example is positioned to the right of a random negative example. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.

```{r roc_auc}
loadPkg("pROC") 
prob=predict(logit_reg_model_cleaned, type = "response" )
logit_reg_model_cleaned$prob=prob
h = roc(cardio~prob, data=cardio_clean_final)
auc(h) 
plot(h)
```
We have here the area-under-curve of `r auc(h)`, which is greater than 0.5. This means that the proposed model (Model 4) is considered a good fit.

**McFadden**

In typical linear regression, we use R-squared as a way to assess how well a model fits the data. this number ranges from 0 to 1, with higher values indicating better model fit. However, there is no such R-squared value for logistic regression. Instead, we can compute a metric known as McFadden's R-squared or pseudo-R-squared, which ranges from 0 to just under 1. Values close to 0 indicate that the model fits the data very well.

```{r McFadden}
loadPkg("pscl")
cardio_logit_reg = pR2(logit_reg_model_cleaned)
cardio_logit_reg
unloadPkg("pscl") 
```
With the McFadden value of 0.189, which is analogous to the coefficient of determination $R^2$, only about 18.9% of the variations in y is explained by the explanatory variables in the model.

**Conclusion:** 

Overall the Logistic Regression Model 4 is considered a good model. The model has correctly predicted 72.3% for the given data. precision of the model is 75.2%. Specificity of correct negative predictions is 78.9%, with the F1-Score of 70%. The only limitation of the model is relatively low McFadden value (0.189)


## Classification Tree

This model is Classification Tree model. We could use the classification tree model to predict the absence and presence of disease visually. It is very convenient. And we will discuss the performance of this model below.

First step, cleaning the dataset for preparing to build the model. Ensure the variables are all numerical variables. And delete the ID, which has no relationship to the results.

```{r , message=FALSE}

#read in original data set
cardio_tree_orig = data.frame(read.csv("cardio_train.csv", header = TRUE, sep=';'))

#remove id column, convert age to years, add bmi col
cardio_tree_new = cardio_tree_orig[-c(1)]
cardio_tree_new$age = cardio_tree_new$age/365
cardio_tree_new$bmi = cardio_tree_new$weight/(cardio_tree_new$height/100)**2

#remove outliers from bmi, api_hi and api_lo
cardio_clean_1 = outlierKD2(cardio_tree_new, bmi, TRUE, FALSE, FALSE, FALSE) 
cardio_clean_2 = outlierKD2(cardio_clean_1, ap_hi, TRUE, FALSE, FALSE, FALSE) 
cardio_clean_3 = outlierKD2(cardio_clean_2, ap_lo, TRUE, FALSE, FALSE, FALSE)

#remove NA rows
cardio_clean_3 = na.omit(cardio_clean_3)
cardio_clean_tree_final <- transform(cardio_clean_3, cardio=as.factor(cardio))
cardio_tree = cardio_clean_tree_final[,c(1,2,3,4,13,5,6,7,8,9,10,11,12)]
str(cardio_tree)


```

Then we did the feature selection by random forest. This is a fundamental outcome of the random forest and it shows, for each variable, how important it is in classifying the data.The mean decrease in Gini coefficient is a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest. The higher the value of mean decrease Gini score, the higher the importance of the variable in the model. And as we can see from the picture, we chose the top 6 features to build this classification tree model{ap_lo (the opposite of ap_hi) is highly correlated to ap_hi, so it doesn’t gain anything to use it together.}.

```{r Classification Tree feature selection, results=TRUE}
library(randomForest)
fit_im = randomForest(cardio_tree$cardio~., data=cardio_tree)
# Create an importance based on mean decreasing gini
#importance(fit_im)
varImpPlot(fit_im)
```

Next, we try to find the best depths in this model. We tried different depths and summary the results
```{r Classification Tree depths result}
loadPkg("rpart")
loadPkg("caret")


# create an empty dataframe to store the results from confusion matrices
confusionMatrixResultDf = data.frame( Depth=numeric(0), Accuracy= numeric(0), Sensitivity=numeric(0), Specificity=numeric(0), Pos.Pred.Value=numeric(0), Neg.Pred.Value=numeric(0), Precision=numeric(0), Recall=numeric(0), F1=numeric(0), Prevalence=numeric(0), Detection.Rate=numeric(0), Detection.Prevalence=numeric(0), Balanced.Accuracy=numeric(0), row.names = NULL )

for (deep in 2:8) {
  kfit <- rpart(cardio ~age+ap_lo+bmi+weight+height+cholesterol, data= cardio_tree, method="class", control = list(maxdepth = deep) )
  # 
  cm = confusionMatrix( predict(kfit, type = "class"), reference = cardio_tree[, "cardio"] ) # from caret library
  # 
  cmaccu = cm$overall['Accuracy']
  # print( paste("Total Accuracy = ", cmaccu ) )
  # 
  cmt = data.frame(Depth=deep, Accuracy = cmaccu, row.names = NULL ) # initialize a row of the metrics 
  cmt = cbind( cmt, data.frame( t(cm$byClass) ) ) # the dataframe of the transpose, with k valued added in front
  confusionMatrixResultDf = rbind(confusionMatrixResultDf, cmt)
  # print("Other metrics : ")
}

unloadPkg("caret")
```

As we can see from the results,from the depths 4, the accuracy of different depths is almost same. Therefore, we choosed the depths 6 to build the tree model.
```{r , results="asis", results=TRUE}
xkabledply(confusionMatrixResultDf, title="Cardio Classification Trees summary with varying MaxDepth")
```


Then we built this classification tree model and plot it.
```{r , echo = T, fig.dim=c(6,4), results=TRUE}
set.seed(1)
cardio_tree_fit <- rpart(cardio ~ age+ap_lo+bmi+weight+height+cholesterol , data= cardio_tree, method="class", control = list(maxdepth = 6) )

#printcp(cardio_tree_fit) # display the results 
plotcp(cardio_tree_fit) # visualize cross-validation results 
#summary(cardio_tree_fit) # detailed summary of splits

# plot tree 
plot(cardio_tree_fit, uniform=TRUE, main="Classification Tree for Cardio")
text(cardio_tree_fit, use.n=TRUE, all=TRUE, cex=.8)

```


```{r  }
# create attractive postcript plot of tree 
post(cardio_tree_fit, file = "cardiotreeTree2.ps", title = "Classification Tree for Cardio")
```


### Results and confusion matrix
And these are the summary results of this model and its confusion matrix.
```{r , include=T, results=TRUE}
loadPkg("caret") 
cm = confusionMatrix( predict(cardio_tree_fit, type = "class") , reference = cardio_tree[, "cardio"])
#print('Overall: ')
#cm$overall
#print('Class: ')
xkabledply(data.frame(cm$byClass), title="Tree Metrics")
unloadPkg("caret")
```

We can see the accuracy of classification tree is 0.695, precision is 0.673, recall is 0.773 and F1 scores is 0.72.


```{r Classification Tree, results="asis"}
xkabledply(cm$table, "confusion matrix")
```

24579 rows of data with outcome as 0 are correctly predicted by model(True Negative)
19019 rows of data with outcome as 1 are correctly predicted by model(True Positive)
11943 rows of data which has actual outcome of 0 are predicted wrongly as 1 by the model(False Positive)
7207 rows of data which has actual outcome of 1 are predicted wrongly as 0 by the model(False Negative)


### Tree plot
We also tried two other ways to plot the tree, with library `rpart.plot` and a "fancy" plot using the library `rattle` to make the tree plot more beautifully. 
```{r Classification Tree fancyplot, results=TRUE}
loadPkg("rpart.plot")
rpart.plot(cardio_tree_fit)
loadPkg("rattle") 
fancyRpartPlot(cardio_tree_fit)
```

Take a look at a plot of the tree, we can find that the leaf nodes predict the label (level of factor variable) .The tree selected contains 3 variables with 5 splits and displayed 6 terminal leaf nodes. 


### ROC curve
Finally, we also checked the ROC and AUC of this classification tree model.   
```{R ROC curve of Classification Tree, results=TRUE}
library(rpart)
rp <- rpart(cardio ~ age+ap_lo+bmi+weight+height+cholesterol, data = cardio_tree)
library(ROCR)
pred <- prediction(predict(cardio_tree_fit, type = "prob")[, 2], cardio_tree$cardio)
tree.predict.prob <- predict(cardio_tree_fit, type = "prob")[, 2]
plot(performance(pred, "tpr", "fpr"), main = "ROC Cardio")
auc = performance(pred, 'auc')
#slot(auc, 'y.values')
abline(0, 1, lty = 2)

```

The area-under-curve is 0.727 which is close to 0.8. Hence,this model is a good model.


**Conclusion:** 
According to the mean decrease in Gini, the top 3 risk factors are age, ap_hi and bmi.
Although the accuracy of this model is 69.5%, the area-under-curve is 0.727 which is close to 0.8. 
Hence, this model is kind of a good model. 


## Model Comparison

Metric    | KNN | Logistic | Tree 
  :---     |  :-- |  :-- |  :--
Accuracy | 0.722  | 0.695 | 0.723
Precision | 0.740 |0.679 |  0.752
Recall | 0.676 | 0.773 | 0.656 
F1 Score | 0.753 | 0.72 | 0.7


# IV. Conclusion
**1. How accurately can the model predict the probability that a patient has cardiovascular disease?**  
The logistic model had the highest accuracy in predicting cardiovascular disease. It's accuracy rate is 72.3%.
  
**2. What are the top three factors associated with cardiovascular disease? Does this corroborate the risk factors observed by the World Health Organization?**  
According to the classification tree, the top 3 risk factors are age, diastolic blood pressure and cholesterol.

**3. Is there an association between and individual's sex and whether they have cardiovascular disease?**  
The data does not support the idea that gender influences the presence of cardiovascular disease.


# References

Cleveland Clinic medical professional (Ed.). (2021, October 5). *Cardiovascular disease: Heart disease causes and symptoms.* Cleveland Clinic. Retrieved August 2, 2022, from https://my.clevelandclinic.org/health/diseases/21493-cardiovascular-disease#:~:text=Cardiovascular%20disease%20is%20the%20leading,women%20die%20from%20cardiovascular%20disease. 

FDA Center for Tobacco Products. (2021, November 9). *How smoking affects heart health*. U.S. Food and Drug Administration. Retrieved August 5, 2022, from https://www.fda.gov/tobacco-products/health-effects-tobacco-use/how-smoking-affects-heart-health#References 

National Institute of Health. *High Blood Pressure and Older Adults.* Retrieved August 7, 2022 from https://www.nia.nih.gov/health/high-blood-pressure-and-older-adults

World Health Organization. (n.d.). *Cardiovascular diseases*. World Health Organization. Retrieved August 2, 2022, from https://www.who.int/health-topics/cardiovascular-diseases#tab=tab_1