forked from harestakesyan/Team2_final-project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTEAM2_Final_Summary.Rmd
1260 lines (942 loc) · 61.7 KB
/
TEAM2_Final_Summary.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Heartbreakers: Estimating the Risk Factors of Cardiovascular Disease"
author: "By: Hovhannes Arestakesyan, Karina Martinez, Timberley Thompson, Chenxu Wang, & Zhe Yu"
date: "`r Sys.Date()`"
output:
html_document:
code_folding: hide
number_sections: false
toc: yes
toc_depth: 3
toc_float: yes
pdf_document:
toc: yes
toc_depth: '3'
---
```{r init, include=FALSE}
#RMD Settings
knitr::opts_chunk$set(warning = F, results = "hide", message = F)
options(scientific=T, digits = 3)
#Packages
library(ezids)
library(tidyverse)
library(tinytex)
require(rmarkdown)
#Data
cardio_orig = data.frame(read.csv("cardio_train.csv", header = TRUE, sep=';'))
str(cardio_orig)
cardio = cardio_orig[-c(1)] #get rid of id
cardio$age = cardio$age/365 # transfer days to years
cardio$bmi = cardio$weight/((cardio$height/100)**2) #add bmi variable
cardio_clean = outlierKD2(cardio, bmi, TRUE, TRUE, TRUE, TRUE)
cardio_clean = outlierKD2(cardio_clean, ap_hi, TRUE, TRUE, TRUE, TRUE)
cardio_clean = outlierKD2(cardio_clean, ap_lo, TRUE, TRUE, TRUE, TRUE)
cardio_clean = na.omit(cardio_clean) #remove NA rows
cardio_clean_final = cardio_clean
cardio_clean_final$gender = as.factor(cardio_clean$gender)
cardio_clean_final$cholesterol = as.factor(cardio_clean$cholesterol)
cardio_clean_final$gluc = as.factor(cardio_clean$gluc)
cardio_clean_final$smoke = as.factor(cardio_clean$smoke)
cardio_clean_final$alco = as.factor(cardio_clean$alco)
cardio_clean_final$active = as.factor(cardio_clean$active)
cardio_clean_final$cardio = as.factor(cardio_clean$cardio)
```
# I. Background
Cardiovascular disease is "the leading cause of death globally," killing an estimated 17.9 million people each year (World Health Organization). Cardiovascular disease refers to any of a group of conditions affecting one's heart and blood vessels; these conditions can include heart failure, arrhythmia, aortic disease and more (Cleveland Clinic).
The World Health Organization asserts that an "unhealthy diet, physical inactivity, tobacco use" and alcohol abuse are the most important behavioral risk factors in developing heart disease. Engaging in such activities can often lead to "raised blood pressure, raised blood glucose... and obesity" (World Health Organization).
This research paper is an investigation of The World Health Organization's statements. The research team analyzed data on cardiovascular disease to determine the risk factors of the disease and recommend a model to predict cardiovascular disease in patients.
## SMART Questions
Specifically, the research team answered the following Specific, Measurable, Achievable, Relevant, and Time-oriented (SMART) questions:
1. How accurately can the model predict the probability that a patient has cardiovascular disease?
2. What are the top three factors associated with cardiovascular disease? Does this corroborate the risk factors observed by the World Health Organization?
3. Is there any association between an individual's sex and whether they have
cardiovascular disease? Are men more prone to cardiovascular diseases than women?
## Dataset Description
Data analysis was performed on the Cardiovascular Disease dataset from [Kaggle.com](https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset). To clean the data, we generated a `bmi` variable from `height` and `weight`, removed outliers for `bmi`, `ap_hi` and `ap_lo`, then converted variables to factors where appropriate. The final dataset, `cardio_clean_final`, has `r nrow(cardio_clean_final)` observations and the following `r ncol(cardio_clean_final)` variables:
### Variables
Variable | Type | Definition
:- | :-- | :----
`age` | Number | Age in years
`gender` |Factor w/ 2 levels | Gender (1: women, 2: men)
`height` | Number | Height in centimeters (cm)
`weight` |Number | Weight in kilograms (kg)
`bmi` | Number | Body Max Index (kg/m^2^)
`ap_hi` | Integer | Systolic blood pressure
`ap_lo` | Integer | Diastolic blood pressure
`cholesterol` |Factor w/ 3 levels| Cholesterol (1: normal, 2: above normal, 3: well above normal)
`gluc` |Factor w/ 3 levels| Glucose (1: normal, 2: above normal, 3: well above normal)
`smoke` |Factor w/ 2 levels| Whether patient smokes or not (0: no, 1: yes)
`alco` |Factor w/ 2 levels| Whether patient consumes alcohol or not (0: no, 1: yes)
`active` |Factor w/ 2 levels| Whether patient is physically active or not (0: no, 1: yes)
`cardio` |Factor w/ 2 levels| Presence or absence of cardiovascular disease (0: absent, 1: present)
Summary statistics for the cleaned dataset are presented below.
```{r summary, results='markup'}
xkablesummary(cardio_clean_final, title = "Cardiovascular Disease Dataset Summary" )
```
# II. Exploratory Data Analysis
## Categorical Variables
```{r cat.graphs, results='markup'}
require(patchwork)
disease <- subset(cardio_clean_final, cardio==1)
healthy <- subset(cardio_clean_final, cardio==0)
#cholesterol
cholesterol.1 <- ggplot(cardio_clean_final, aes(x=cardio, fill= cholesterol)) +
geom_bar(position="fill") +
scale_x_discrete(labels=c("1" = "Has CVD", "0" = "No CVD")) +
scale_y_continuous(labels = scales::percent) +
labs( title = "Cholesterol", y = "", x = "",
fill = "Cholesterol Level") +
scale_fill_manual(values=c("#f4cccc", "#e06666", "#990000"),
labels = c("Normal", "Above Normal", "Well Above Normal")) +
theme_minimal()
#Glucose
gluc.1 <- ggplot(cardio_clean_final, aes(x=cardio, fill= gluc)) +
geom_bar(position="fill") +
scale_x_discrete(labels=c("1" = "Has CVD", "0" = "No CVD")) +
scale_y_continuous(labels = scales::percent) +
labs( title = "Glucose", y = "", x = "",
fill = "Glucose Level") +
scale_fill_manual(values=c("#f4cccc", "#e06666", "#990000"),
labels = c("Normal", "Above Normal", "Well Above Normal")) +
theme_minimal()
#Gender
cardio_clean_final$gender.yn<- as.factor(ifelse(cardio_clean_final$gender== "1", "Women", "Men"))
gender.1 <- ggplot(cardio_clean_final, aes(x=cardio, fill= gender.yn)) +
geom_bar(position="fill") +
scale_x_discrete(labels=c("1" = "Has CVD", "0" = "No CVD")) +
scale_y_continuous(labels = scales::percent) +
labs( title = "Gender",
y = "", x = "", fill = "") +
scale_fill_manual(values=c("#0000da", "#990000")) +
theme_minimal()
#Smoke
cardio_clean_final$smoke.yn<- as.factor(ifelse(cardio_clean_final$smoke== "1", "Yes", "No"))
smoke.1 <- ggplot(cardio_clean_final, aes(x=cardio, fill= smoke.yn)) +
geom_bar(position="fill") +
scale_x_discrete(labels=c("1" = "Has CVD", "0" = "No CVD")) +
scale_y_continuous(labels = scales::percent) +
labs(title = "Tobacco Use", y = "", x = "", fill = "Smoker") +
scale_fill_manual(values=c("#0000da", "#990000")) +
theme_minimal()
#alcohol
cardio_clean_final$alco.yn<- as.factor(ifelse(cardio_clean_final$alco== "1", "Yes", "No"))
alco.1 <- ggplot(cardio_clean_final, aes(x=cardio, fill= alco.yn)) +
geom_bar(position="fill") +
scale_x_discrete(labels=c("1" = "Has CVD", "0" = "No CVD")) +
scale_y_continuous(labels = scales::percent) +
labs( title = "Alcohol Use", y = "", x = "", fill = "Drinks Alcohol") +
scale_fill_manual(values=c("#0000da", "#990000")) +
theme_minimal()
#active
cardio_clean_final$active.yn<- as.factor(ifelse(cardio_clean_final$active== "1", "Yes", "No"))
active.1 <- ggplot(cardio_clean_final, aes(x=cardio, fill= active.yn)) +
geom_bar(position="fill") +
scale_x_discrete(labels=c("1" = "Has CVD", "0" = "No CVD")) +
scale_y_continuous(labels = scales::percent) +
labs( title = "Physical Activity", y = "", x = "", fill = "Physically Active") +
scale_fill_manual(values=c("#0000da", "#990000")) +
theme_minimal()
gender.1 + cholesterol.1 + gluc.1 + smoke.1 + alco.1 + active.1 + plot_layout(ncol = 2)
```
By exploring the data, we found that the most noticeable differences between people with and without cardiovascular disease (CVD) were in cholesterol and glucose. `r sprintf("%.1f", (sum(disease$cholesterol!= "1")/nrow(disease)) * 100)`% of people with cardiovascular disease have above normal cholesterol levels, while only `r sprintf("%.1f", (sum(healthy$cholesterol!= "1")/nrow(disease)) * 100)`% of people *without* the disease have above normal cholesterol. Similarly, `r sprintf("%.1f", (sum(disease$gluc!="1")/nrow(disease)) * 100)`% of people with the disease have above normal glucose, while only `r sprintf("%.1f", (sum(healthy$gluc!="1")/nrow(disease)) * 100)`% of people *without* the disease have above normal glucose.
The differences between the two groups (people with and without cardiovascular disease) are less noticeable when examining gender and lifestyle factors. The percentage of women in the "has CVD" group is nearly equal to the percentage of women in the "no CVD" group. Likewise, the percentage of alcohol drinkers is nearly the same in the "has CVD" group and the "no CVD" group.
Unsurprisingly, there is a slightly higher percentage of physically active people in the "no CVD" group than in the "has CVD" group. However, it *is* surprising that there is a slightly higher percentage of smokers in the "*no CVD*" group" than in the "CVD" group. We found that `r sprintf("%.1f", (sum(healthy$smoke==1)/nrow(healthy)) * 100)`% of people without CVD were smokers, while only `r sprintf("%.1f", (sum(disease$smoke== 1)/nrow(disease)) * 100)`% of people with CVD were smokers. This finding seems to contradict experts' assertions that tobacco use can cause cardiovascular disease (FDA).
## Chi-Squared Tests
We performed $\chi^2$ tests to further explore the data and determine if the differences we observed between the "CVD" and "no CVD" groups were statistically significant.
```{r chi}
disease <- subset(cardio_clean_final, cardio==1)
healthy <- subset(cardio_clean_final, cardio==0)
#Gender
cardio_vs_gender = table(cardio_clean_final$cardio, cardio_clean_final$gender)
chi_gender = chisq.test(cardio_vs_gender)
#Cholesterol
cardio_vs_cholesterol=table(cardio_clean_final$cardio, cardio_clean_final$cholesterol)
chi_cholesterol = chisq.test(cardio_vs_cholesterol)
#Glucose
cardio_vs_glucose = table(cardio_clean_final$cardio, cardio_clean_final$gluc)
chi_glucose = chisq.test(cardio_vs_glucose)
#Smoking
cardio_vs_smoke = table(cardio_clean_final$cardio, cardio_clean_final$smoke)
chi_smoke = chisq.test(cardio_vs_smoke)
#Alcohol
cardio_vs_alco = table(cardio_clean_final$cardio, cardio_clean_final$alco)
chi_alco = chisq.test(cardio_vs_alco)
#Physical Activity
cardio_vs_active = table(cardio_clean_final$cardio, cardio_clean_final$active)
chi_active = chisq.test(cardio_vs_active)
```
Categorical Variable | Chi-Squared | P-Value | df
:- | :- | :-| :-
`gender` | `r format(chi_gender$statistic, digits=3)` | `r format(chi_gender$p.value, digits=3)` | `r chi_gender$parameter`
`cholesterol` | `r format(chi_cholesterol$statistic, digits=3)` | `r format(chi_cholesterol$p.value, digits=3)` | `r chi_cholesterol$parameter`
`glucose` | `r format(chi_glucose$statistic, digits=3)` | `r format(chi_glucose$p.value, digits=3)` | `r chi_glucose$parameter`
`smoke` | `r format(chi_smoke$statistic, digits=3)` | `r format(chi_smoke$p.value, digits=3)` | `r chi_smoke$parameter`
`alco` | `r format(chi_alco$statistic, digits=3)` | `r format(chi_alco$p.value, digits=3)` | `r chi_alco$parameter`
`active` | `r format(chi_active$statistic, digits=3)` | `r format(chi_active$p.value, digits=3)` | `r chi_active$parameter`
We use these results to test the following hypothesis for each variable:
**Is the variable independent of cardiovascular disease?**
$H_0$: They are independent.
$H_1$: They are not independent.
The p-value for `gender` is greater than 0.05, so we can conclude that gender is independent of cardiovascular disease and may not have a significant impact on whether a patient has CVD. The p-values for all other variables are less than 0.01; in fact, `cholesterol`, `glucose`, `smoke` and `active` are almost 0. Thus, we reject the null hypothesis and conclude that these variables are related to cardiovascular disease.
The $\chi^2$ tests corroborate the observations from the bar graphs. The variables with the smallest p-values (cholesterol and glucose) are the same ones that had the most obvious differences in their bar graphs.
## Quantitative Variables
```{r cont.graphs}
disease <- subset(cardio_clean_final, cardio==1)
healthy <- subset(cardio_clean_final, cardio==0)
#age
cardio_clean_final$cardio.yn<- as.factor(ifelse(cardio_clean_final$active== "1", "Yes", "No"))
age.kde <- ggplot(cardio_clean_final, aes(x = age, fill= cardio.yn)) +
geom_density(position = "stack") +
geom_vline(aes(xintercept=mean(disease$age)), color="pink3", linetype="dashed", size=1) +
geom_vline(aes(xintercept=mean(healthy$age)), color="steelblue", linetype="dashed", size=1) +
scale_x_continuous(limits = c(30,70)) +
scale_y_continuous(labels = scales::percent) +
labs(title = "Age Distribution (Stacked)",
fill = "Has CVD", x = "Age", y = " ") +
scale_fill_manual(values=c("#0000da", "#990000")) +
theme_minimal()
#height
height.kde <- ggplot(cardio_clean_final, aes(x = height, fill= cardio.yn)) +
geom_density(position = "stack") +
geom_vline(aes(xintercept=mean(disease$height)), color="pink3", linetype="dashed", size=1) +
geom_vline(aes(xintercept=mean(healthy$height)), color="steelblue", linetype="dashed", size=1) +
scale_y_continuous(labels = scales::percent) +
labs(title = "Height Distribution (Stacked)",
fill = "Has CVD", x = "Height (cm)", y = " ") +
scale_fill_manual(values=c("#0000da", "#990000")) +
theme_minimal()
#weight
weight.kde <- ggplot(cardio_clean_final, aes(x = weight, fill= cardio.yn)) +
geom_density(position = "stack") +
geom_vline(aes(xintercept=mean(disease$weight)), color="pink3", linetype="dashed", size=1) +
geom_vline(aes(xintercept=mean(healthy$weight)), color="steelblue", linetype="dashed", size=1) +
scale_y_continuous(labels = scales::percent) +
labs(title = "Weight Distribution (Stacked)",
fill = "Has CVD", x = "Weight (kg)", y = " ") +
scale_fill_manual(values=c("#0000da", "#990000")) +
theme_minimal()
#bmi
bmi.kde <- ggplot(cardio_clean_final, aes(x = bmi, fill= cardio.yn)) +
geom_density(position = "stack") +
geom_vline(aes(xintercept=mean(disease$bmi)), color="pink3", linetype="dashed", size=1) +
geom_vline(aes(xintercept=mean(healthy$bmi)), color="steelblue", linetype="dashed", size=1) +
scale_y_continuous(labels = scales::percent) +
labs(title = "BMI Distribution (Stacked)",
fill = "Has CVD", x = "BMI", y = " ") +
scale_fill_manual(values=c("#0000da", "#990000")) +
theme_minimal()
#systolic bp
ap_hi.kde <- ggplot(cardio_clean_final, aes(x = ap_hi, fill= cardio.yn)) +
geom_density(position = "stack") +
geom_vline(aes(xintercept=mean(disease$ap_hi)), color="pink3", linetype="dashed", size=1) +
geom_vline(aes(xintercept=mean(healthy$ap_hi)), color="steelblue", linetype="dashed", size=1) +
scale_y_continuous(labels = scales::percent) +
scale_x_continuous(breaks = c(seq(90,170,20))) +
labs(title = "Systolic BP Distribution (Stacked)",
fill = "Has CVD", x = "Systolic BP", y = " ") +
scale_fill_manual(values=c("#0000da", "#990000")) +
theme_minimal()
#diastolic bp
ap_lo.kde <- ggplot(cardio_clean_final, aes(x = ap_lo, fill= cardio.yn)) +
geom_density(position = "stack") +
geom_vline(aes(xintercept=mean(disease$ap_lo)), color="pink3", linetype="dashed", size=1) +
geom_vline(aes(xintercept=mean(healthy$ap_lo)), color="steelblue", linetype="dashed", size=1) +
scale_y_continuous(labels = scales::percent) +
scale_x_continuous(breaks = c(seq(65,105,10))) +
labs(title = "Diastolic BP Distribution (Stacked)",
fill = "Has CVD", x = "Diastolic BP", y = " ") +
scale_fill_manual(values=c("#0000da", "#990000")) +
theme_minimal()
require(ggpubr)
ggarrange(age.kde, height.kde, weight.kde, bmi.kde, ap_hi.kde, ap_lo.kde, ncol=2, nrow=3, common.legend = TRUE, legend = "bottom", heights=c(2,2,2))
```
We used density plots to show the distribution of the quantitative variables in the dataset. Dashed lines were added to the plots to indicate the means for the CVD group (pink line) and the means for the non-CVD groups (blue line).
The plots indicate that the CVD distribution differs from the non-CVD distribution for all variables except for `height`. Patients with CVD are likely to be older and heavier than "no CVD" patients. The "has CVD" group also tends to have higher systolic and diastolic blood pressure.
## Welch Two-Sample T-Test
We used two-sample t-tests to test whether there was a statistically significant difference in means between the "CVD" and "no CVD" groups for each quantitative variable.
```{r ttest}
cardio_pos = subset(cardio_clean_final, cardio_clean_final$cardio == 1)
cardio_neg = subset(cardio_clean_final, cardio_clean_final$cardio == 0)
#age
ttest_age = t.test(cardio_pos$age, cardio_neg$age)
#height
ttest_height = t.test(cardio_pos$height, cardio_neg$height)
#weight
ttest_weight = t.test(cardio_pos$weight, cardio_neg$weight)
#bmi
ttest_bmi = t.test(cardio_pos$bmi, cardio_neg$bmi)
#ap.hi
ttest_ap_hi = t.test(cardio_pos$ap_hi, cardio_neg$ap_hi)
#ap.lo
ttest_ap_lo = t.test(cardio_pos$ap_lo, cardio_neg$ap_lo)
```
Continuous Variable | CVD mean | No CVD mean | t | p-value
:- | :- | :- | :- | :-
`age` | `r format(ttest_age$estimate[1], digits=3)` | `r format(ttest_age$estimate[2], digits=3)` | `r format(ttest_age$statistic)` | `r format(ttest_age$p.value, digits=3)`
`height` | `r format(ttest_height$estimate[1], digits=3)` | `r format(ttest_height$estimate[2], digits=3)` | `r format(ttest_height$statistic)` | `r format(ttest_height$p.value, digits=3)`
`weight`| `r format(ttest_weight$estimate[1], digits=3)` | `r format(ttest_weight$estimate[2], digits=3)` | `r format(ttest_weight$statistic)` | `r format(ttest_weight$p.value, digits=3)`
`bmi` | `r format(ttest_bmi$estimate[1], digits=3)` | `r format(ttest_bmi$estimate[2], digits=3)` | `r format(ttest_bmi$statistic)` | `r format(ttest_bmi$p.value, digits=3)`
`ap_hi` | `r format(ttest_ap_hi$estimate[1], digits=3)` | `r format(ttest_ap_hi$estimate[2], digits=3)` | `r format(ttest_ap_hi$statistic)` | `r format(ttest_ap_hi$p.value, digits=3)`
`ap_lo` | `r format(ttest_ap_lo$estimate[1], digits=3)` | `r format(ttest_ap_lo$estimate[2], digits=3)` | `r format(ttest_ap_lo$statistic)` | `r format(ttest_ap_lo$p.value, digits=3)`
We used these results to test the following hypothesis for each variable:
**Is the variable mean different in CVD vs non-CVD patients?**
$H_0$: The mean is the same
$H_1$: The mean is not the same
The p-value was very low for all quantitative variables indicating that there is a statistically significant difference in the average `age`, `height`, `weight`, `bmi`, `ap_hi` and `ap_lo` of CVD and non-CVD patients. For all of the continuous/quantitative variables, we reject the null hypothesis and accept that there is a difference between the means of the CVD and non-CVD groups.
It is worth noting that when using t-tests with large sample sizes, the hypothesis tests are able to detect even small differences, as was the case for height. The mean for both CVD and non-CVD patients was 165cm, with a `r ttest_height$conf.int[1]` cm to `r ttest_height$conf.int[2]` cm difference at 95% confidence. The t-tests tests align with the observations from the density plots, as there was a noticeable difference in the means of all continuous variables except height.
## Correlations
We examined correlations between each of the variables to determine if a linear relationship exists with cardiovascular disease or any other variables. Because a number of variables are factors, we used the Spearman method to generate the correlation coefficients.
```{r corrplot}
loadPkg("corrplot")
col_order <- c("age", "gender", "height", "weight", "ap_hi", "ap_lo", "cholesterol", "gluc", "smoke", "alco", "active", "bmi", "cardio")
cardio_clean_cor <- cardio_clean[, col_order]
# corrplot with spearman method for categorical variables
cardio_cor <- cor(cardio_clean_cor, method="spearman")
corrplot(cardio_cor, type="lower", addCoef.col="darkred", number.cex=0.6, tl.cex=1,title="Correlation Matrix", mar=c(0,0,1,0))
```
The correlation coefficient for each pair of variables is displayed in the correlation matrix. The coefficient ranges from 1 to -1, indicating a positive or negative correlation. The further away the correlation coefficient is from zero, the stronger the relationship between the two variables.
The factors most strongly correlated with cardiovascular disease are `ap_hi`, `ap_low`, `cholesterol`, `age` and `bmi.` There is also a strong correlation between `height` and `gender`, `ap_hi` and `ap_low`, and `cholesterol` and `gluc`.
Some of these correlations are expected, for instance systolic blood pressure and diastolic blood pressure typically increase and decrease together. Weight and BMI are also strongly correlated, which makes sense because BMI is a measurement of body stature that heavily relies on weight. While the correlation matrix does not show how the relationship between variables impacts the presence or absence of CVD, it is important to keep these relationships in mind when we begin building models.
#### Is there a relationship between blood pressure and other variables in patients with cardiovascular disease?
```{r, results='markup'}
library(patchwork)
#install.packages("ggExtra")
#library(ggExtra)
ap_hi_lo <- ggplot(cardio_clean_final, aes(x=ap_lo, y=ap_hi, color=cardio)) +
geom_point(size = 1.5, alpha=0.7) +
labs(title = "Systolic vs Diastolic BP", color='Has CVD') +
xlab("Diastolic BP ") + ylab("Systolic BP") +
scale_color_manual(values=c("#0000da", "#990000"), labels = c("No", "Yes")) +
theme_minimal()
ap1_hi <- ggplot(cardio_clean_final,aes(x = cholesterol , y = ap_hi, fill = cardio)) +
geom_violin(scale="width", width=1, alpha=.5, aes(linetype=NA)) +
geom_boxplot(width=.1, cex=.5, position=position_dodge(1), outlier.size=.7)+
labs(title = "Systolic BP vs Cholesterol", fill='Has CVD') +
xlab("Cholesterol") + ylab("Systolic BP") +
scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes")) +
scale_x_discrete(labels=c("1" = "normal", "2" = "above normal","3" = "well above normal")) +
theme_minimal()
ap2 <- ggplot(cardio_clean_final, aes(x=age, y=ap_hi, color=cardio)) +
geom_point(size = 1.5, alpha=0.7) +
labs(title = "Systolic BP vs Age", color="Has CVD") +
xlab("Age ") + ylab("Systolic BP") +
scale_color_manual(values=c("#0000da", "#990000"), labels = c("No", "Yes")) +
theme_minimal()
ap3 <- ggplot(cardio_clean_final, aes(x=bmi, y=ap_hi, color=cardio)) +
geom_point(size=1.5,alpha=0.7) +
labs(title = "Systolic BP vs BMI", color="Has CVD") +
xlab("BMI ") + ylab("Systolic BP") +
scale_color_manual(values=c("#0000da", "#990000"), labels = c("No", "Yes")) +
theme_minimal()
ggarrange(ap_hi_lo, ap1_hi, ap2, ap3, ncol=2, nrow=2, common.legend = TRUE, legend = "bottom", heights=c(2,2))
```
We can see a strong correlation between systolic and diastolic blood pressure, as these two measurements are known to be physiologically related. According to the NIH, for most adults:
* Normal blood pressure is systolic BP < 120 and diastolic BP < 80
* Elevated blood pressure is systolic BP between 120 and 129 and diastolic BP < 80
* High blood pressure is systolic BP is 130 or higher and diastolic BP 80 or higher
The incidence of CVD increases as both systolic and diastolic BP increase.
Due to the relationship between systolic and diastolic BP, we chose to use only systolic BP to investigate relationships between BP and other variables. We found that there is not a strong correlation between BP and cholesterol, age or BMI. However, across all cholesterol, age and BMI groups, CVD patients have higher BP than non-CVD patients. They also tend to be older and have higher BMI.
#### Is there a relationship between cholesterol, bmi, and age in patients with cardiovascular disease?
```{r, results='markup'}
chol1 <- ggplot(cardio_clean_final,aes(x = cholesterol , y = age, fill = cardio)) +
geom_violin(width=1, alpha=0.5, aes(linetype=NA)) +
geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
labs(title = "Age vs Cholesterol", fill="Has CVD") +
xlab("Cholesterol") + ylab("Age") +
scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes")) +
scale_x_discrete(labels=c("1" = "normal", "2" = "above normal","3" = "well above normal")) +
theme_minimal()
chol2 <- ggplot(cardio_clean_final,aes(x = cholesterol , y = bmi, fill = cardio)) +
geom_violin(width=1, alpha=0.5, aes(linetype=NA)) +
geom_boxplot(width=.1, cex=.5, position=position_dodge(1), outlier.size=.7)+
labs(title = "BMI vs Cholesterol", fill="Has CVD") +
xlab("Cholesterol") + ylab("BMI") +
scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes")) +
scale_x_discrete(labels=c("1" = "normal", "2" = "above normal","3" = "well above normal")) +
theme_minimal()
age1 <- ggplot(cardio_clean_final, aes(x=age, y=bmi, color=cardio)) +
geom_point(size=1,alpha=0.5) +
labs(title = "BMI vs Age", color="Has CVD") +
xlab("Age") + ylab("BMI") +
scale_color_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes")) +
theme_minimal()
#ggarrange(chol1, chol2, age1, ncol=2, nrow=2, common.legend = TRUE, legend = "bottom")
#(smoke3 + theme(legend.position = "none")) + (active3 + theme(legend.position = "none")) + (gluc3 + theme(legend.position = "none")) + (alco3 + theme(legend.position = "none")) + alco1 + guide_area() + plot_layout(guides = 'collect')
((chol1 + theme(legend.position = "none")) + (chol2 + theme(legend.position = "none"))) / age1 + plot_layout(guides = 'collect')
```
Violin plots combine blox plots and density plots to show summary statistics and where the values are clustered. These plots show that there is a positive trend between `age` and `cholesterol`, as well as `bmi` and `cholesterol.` We can also see an increase in age and BMI between the CVD and non-CVD groups in each cholesterol group.
The scatterplot of BMI vs age does not indicate that there is a correlation between these variables. However, patients with higher BMI and age more likely to have CVD. Also, there appears to be a higher incidence of CVD in older people with low BMI than younger people with high BMI, indicating that age may have more of an effect on CVD than body mass.
#### How do lifestyle choices influence cardiovascular disease with other variables?
```{r, results='markup'}
#Smoking
smoke1 <- ggplot(cardio_clean_final,aes(x = smoke , y = ap_hi, fill = cardio)) +
geom_violin(width=1, alpha=0.5, scale='width', aes(linetype=NA)) +
geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
labs(title = "Systolic Blood Pressure vs Smoke", fill="Has CVD") +
xlab("Smoke") + ylab("Systolic BP") +
scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
scale_x_discrete(labels=c("0" = "No", "1" = "Yes")) +
theme_minimal()
#geom_hline(yintercept = 120, linetype = 'dashed')
smoke2 <- ggplot(cardio_clean_final,aes(x = smoke , y = age, fill = cardio)) +
geom_violin(width=1, alpha=0.5, aes(linetype=NA)) +
geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
labs(title = "Age vs Smoke", fill="Has CVD") +
xlab("Smoke") + ylab("Age") +
scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
scale_x_discrete(labels=c("0" = "No", "1" = "Yes")) +
theme_minimal()
smoke3 <- ggplot(cardio_clean_final,aes(x = smoke , y = bmi, fill = cardio)) +
geom_violin(width=1, alpha=0.5, aes(linetype=NA)) +
geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
labs(title = "BMI vs Smoke", fill="Has CVD") +
xlab("Smoke") + ylab("BMI") +
scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
scale_x_discrete(labels=c("0" = "No", "1" = "Yes")) +
theme_minimal()
#Alcohol
alco1 <- ggplot(cardio_clean_final,aes(x = alco , y = ap_hi, fill = cardio)) +
geom_violin(width=1, alpha=0.5, scale="width", aes(linetype=NA)) +
geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
labs(title = "Systolic BP vs Alcohol", fill="Has CVD") +
xlab("Alcohol") + ylab("Systolic BP") +
scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
scale_x_discrete(labels=c("0" = "No", "1" = "Yes")) +
theme_minimal()
alco2 <- ggplot(cardio_clean_final,aes(x = alco , y = age, fill = cardio)) +
geom_violin(width=1, alpha=0.5, aes(linetype=NA)) +
geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
labs(title = "Age vs Alcohol", fill="Has CVD") +
xlab("Alcohol") + ylab("Age") +
scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
scale_x_discrete(labels=c("0" = "No", "1" = "Yes"))
alco3 <- ggplot(cardio_clean_final,aes(x = alco , y = bmi, fill = cardio)) +
geom_violin(width=1, alpha=0.5, aes(linetype=NA)) +
geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
labs(title = "BMI vs Alcohol", fill="Has CVD") +
xlab("Alcohol") + ylab("BMI") +
scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
scale_x_discrete(labels=c("0" = "No", "1" = "Yes")) +
theme_minimal()
#Physical activity
active1 <- ggplot(cardio_clean_final,aes(x = active , y = ap_hi, fill = cardio)) +
geom_violin(width=1, alpha=0.5, scale="count", aes(linetype=NA)) + # Scale maximum width proportional to sample size
geom_boxplot(width=.1, cex=1, position=position_dodge(.5), outlier.size=.7)+
labs(title = "Systolic Blood Pressure vs Activity (Count Scaled)", fill="Has CVD") +
xlab("Activity") + ylab("Systolic BP") +
scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
scale_x_discrete(labels=c("0" = "No", "1" = "Yes"))
active2 <- ggplot(cardio_clean_final,aes(x = active , y = age, fill = cardio)) +
geom_violin(width=1, alpha=0.5, aes(linetype=NA)) +
geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
labs(title = "Age vs Activity", fill="Has CVD") +
xlab("Activity") + ylab("Age") +
scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
scale_x_discrete(labels=c("0" = "No", "1" = "Yes"))
active3 <- ggplot(cardio_clean_final,aes(x = active , y = bmi, fill = cardio)) +
geom_violin(width=1, alpha=0.5, aes(linetype=NA)) +
geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
labs(title = "BMI vs Activity", fill="Has CVD") +
xlab("Activity") + ylab("BMI") +
scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
scale_x_discrete(labels=c("0" = "No", "1" = "Yes")) +
theme_minimal()
#Blood Glucose
gluc1 <- ggplot(cardio_clean_final,aes(x = gluc , y = ap_hi, fill = cardio)) +
geom_violin(aes(linetype=NA)) +
labs(title = "Systolic Blood Pressure vs Blood Glucose") +
xlab("Blood Glucose") + ylab("Systolic BP") +
scale_fill_manual(values=c("pink3", "steelblue"),labels=c("Absent", "Present")) +
scale_x_discrete(labels=c("1" = "normal", "2" = "above normal","3" = "well above normal"))+
geom_violin() +
labs(title = "Systolic Blood Pressure vs Blood Glucose",
fill = "Has CVD") +
xlab("Blood Glucose") + ylab("Systolic BP") +
scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes")) +
scale_x_discrete(labels=c("1" = "Normal", "2" = "Above Normal","3" = "Well Above Normal"))+
geom_hline(yintercept = 120, linetype = 'dashed') +
theme_minimal()
gluc2 <- ggplot(cardio_clean_final,aes(x = gluc , y = age, fill = cardio)) +
geom_violin(width=1, alpha=0.5, aes(linetype=NA)) +
geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
labs(title = "Age vs Blood Glucose", fill="Has CVD") +
xlab("Blood Glucose") + ylab("Age") +
scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
scale_x_discrete(labels=c("1" = "normal", "2" = "above normal","3" = "well above normal")) +
theme_minimal()
gluc2
gluc3 <- ggplot(cardio_clean_final,aes(x = gluc , y = bmi, fill = cardio)) +
geom_violin(width=1, alpha=0.5, aes(linetype=NA)) +
geom_boxplot(width=.1, cex=1, position=position_dodge(1), outlier.size=.7)+
labs(title = "BMI vs Blood Glucose", fill="Has CVD") +
xlab("Blood Glucose") + ylab("BMI") +
scale_fill_manual(values=c("#0000da", "#990000"),labels=c("No", "Yes"))+
scale_x_discrete(labels=c("1" = "normal", "2" = "above normal","3" = "well above normal")) +
theme_minimal()
#(smoke3 + theme(legend.position = "none")) + (active3 + theme(legend.position = "none")) + (gluc3 + theme(legend.position = "none")) + (alco3 + theme(legend.position = "none")) + alco1 + guide_area() + plot_layout(guides = 'collect')
(smoke3 + theme(legend.position = "none")) + (gluc3 + theme(legend.position = "none")) + (alco3 + theme(legend.position = "none")) + alco1 + plot_layout(guides = 'collect')
```
Lastly, we looked at lifestyle choices to assess their relationship with other variables in the presence or absence of cardiovascular disease. The main measurement that was impacted by lifestyle is `bmi.` BMI tends to decrease with smoking, but increase with blood glucose and alcohol. Because BMI is correlated with cardiovascular disease, it appears that diet and alcohol may help to predict CVD as well.
Alcohol consumption also impacted systolic blood pressure. Blood pressure was higher among people who consume alcohol, and the difference between CVD and non-CVD patients was more dramatic in alcohol drinkers. This indicates tht alcohol may also impact the presence of CVD.
# III. Statistical Models
## Feature selection
In the EDA, t-tests and $\chi^2$ tests were used to determine whether the independent variables `age`, `gender`, `height`, `weight`, `ap_hi`, `ap_lo`, `cholesterol`, `gluc`, `smoke`, `alco`, `active` and `bmi` had effects on the response variable `cardio`. The p-values of the tests are shown in the table below:
```{r pval table}
# results='markup'
tab <- matrix(c( '<2e-16','0.6', '8e-07', '<2e-16', '<2e-16', '<2e-16', '<2e-16', '<2e-16', '1e-06', '0.006', '<2e-16', '<2e-16'), ncol=12, byrow = TRUE)
rownames(tab) <- "p_value"
colnames(tab) <- c("Age_year", "Gender", "Height","Weight","Systolic(ap_hi)", "Diastolic(ap_low)", "Cholesterol","Glucose","smoke", "Alcohol", "Physical_activity","BMI")
tab <- as.table(tab)
xkabledply(tab, "p-value Comparison")
```
The p-values suggest that all the independent variables, except for `gender`, have a relationship with `cardio`. We used the `bestglm` function to further confirm this.
```{r bestglm}
#results='markup'
loadPkg("bestglm")
loadPkg("leaps")
loadPkg("ggplot2")
cardio_clean_final_glm <- cardio_clean_final
colnames(cardio_clean_final_glm)[12] <- "y"
cardio_clean_final_glm <- cardio_clean_final_glm[c("age", "gender", "height", "weight","ap_hi", "ap_lo", "cholesterol", "gluc", "smoke", "alco", "active", "bmi", "y")]
res.bestglm <- bestglm(Xy = cardio_clean_final_glm, family = binomial,
IC = "AIC", method = "exhaustive")
```
```{r results='markup'}
#summary(res.bestglm)
res.bestglm$BestModels
#summary(res.bestglm$BestModels)
unloadPkg("bestglm")
unloadPkg("leaps")
```
`gender` and `bmi` were excluded from the model that has lowest criterion value, indicating these two variables may not have strong effects on `cardio`. But among the 5 models the function generated, the criterion values are very close, and `gender` was voted as `FALSE` three times, while `bmi` was only voted twice. So we are not confident in removing these two variables from the following modeling steps. In the KNN modeling, we compare whether these two variables can affect the model performances. Also, since the correlations matrix showed the variables `age`, `cholesterol`, `bmi`, `ap_hi` and `ap_lo` had relatively high coefficient values with `cardio`, we also performed the KNN modeling with these 5 variables.
## K-Nearest Neighbor (KNN)
Before KNN modeling, all the predictor variables were converted to numeric, and were scaled and centered. The dataframe structure after the pre-modeling treatment is shown in the table below.
```{r}
results='markup'
#Data Prep
#Scale and center data
cardio_num <- cardio_clean_final_glm
colnames(cardio_num)[13] <- "cardioi"
cardio_num[1:12] <- sapply(cardio_num[1:12],as.numeric)
cardio_scale <- cardio_num
cardio_scale[1:12] <- as.data.frame(scale(cardio_num[1:12], center = TRUE, scale = TRUE))
str(cardio_scale)
```
### Model 1 (All 13 variables)
**Train/Test Split**
The dataframe was split into a training subset (70% of the total dataframe) and a test subset (30% of the total dataframe).
```{r}
set.seed(42)
cardio_sample <- sample(2, nrow(cardio_scale), replace=TRUE, prob=c(0.7, 0.3))
cardio_training_x <- cardio_scale[cardio_sample==1, 1:12]
cardio_test_x <- cardio_scale[cardio_sample==2, 1:12]
cardio_training_y <- cardio_scale[cardio_sample==1, 13]
cardio_test_y <- cardio_scale[cardio_sample==2, 13]
```
**Finding the Best K-Value**
Odd numbers from 3 to 40 were tested one by one to get the best K value. The accuracy values from the model that used these K values are shown in the line graph.
```{r}
loadPkg("FNN")
loadPkg("gmodels")
loadPkg("caret")
K_test <- seq(3, 40, by = 2)
ResultDf = data.frame()
for (k_v in K_test) {
cardio_pred <- knn(train = cardio_training_x, test = cardio_test_x, cl=cardio_training_y, k=k_v)
cm = confusionMatrix(cardio_pred, reference = cardio_test_y )
cmaccu = cm$overall['Accuracy']
cmt = data.frame(k=k_v, Total.Accuracy = cmaccu, row.names = NULL )
ResultDf = rbind(ResultDf, cmt)
}
```
```{r}
xkabledply(ResultDf, "Total Accuracy Summary for KNN Model 1")
```
```{r}
library(ggplot2)
ggplot(ResultDf, aes(x=k, y=Total.Accuracy)) +
geom_line(color = "skyblue", size = 1.5) +
geom_point(size = 3, color = "blue") +
labs(title = "Total Accuracy vs K (Model 1)",
x = "K", y = "Total Accuracy") +
theme_minimal()
```
According the results, K = 19 is the best K value for Model 1.
**Modeling Results with the Best K Value (K = 19)**
Detailed results and the confusion matrix using K=19 were shown below.
```{r}
#results='markup'
cardio_pred_19 <- knn(train = cardio_training_x, test = cardio_test_x, cl=cardio_training_y, k=19)
KNN_Mode_1 = confusionMatrix(cardio_pred_19, reference = cardio_test_y)
xkabledply(as.matrix(KNN_Mode_1), title = paste("Confusion Matrix for KNN Model 1") )
xkabledply(data.frame(KNN_Mode_1$overall['Accuracy']), title=paste("Total Accuracy of KNN Model 1"))
xkabledply(data.frame(KNN_Mode_1$byClass), title=paste("Metrics of KNN Model 1"))
```
The overall performance is acceptable, but we still want to see whether removing `gender` and `bmi` will improve the modeling.
### Model 2 (Excluding Gender & BMI)
**Train / Test Split**
The dataframe was split into a training subset (70% of the total dataframe) and a testing subset (30% of the total dataframe). The variables `gender` and `bmi` were excluded from the train/test split.
```{r}
cardio_scale2 <- cardio_scale[c(1, 3:11, 13)]
cardio_scale2
set.seed(42)
cardio_sample2 <- sample(2, nrow(cardio_scale2), replace=TRUE, prob=c(0.7, 0.3))
cardio_training_x2 <- cardio_scale2[cardio_sample2==1, 1:10]
cardio_test_x2 <- cardio_scale2[cardio_sample2==2, 1:10]
cardio_training_y2 <- cardio_scale2[cardio_sample2==1, 11]
cardio_test_y2 <- cardio_scale2[cardio_sample2==2, 11]
```
**Finding the Best K-Value**
Odd numbers from 3 to 40 were used for the test. The accuracy values from the model that used these K values are shown in the table and the line graph. According the results, K = 29 is the best K value for the KNN modeling without `gender` and `bmi`.
```{r}
K_test <- seq(3, 40, by = 2)
ResultDf2 = data.frame()
for (k_v in K_test) {
cardio_pred2 <- knn(train = cardio_training_x2, test = cardio_test_x2, cl=cardio_training_y2, k=k_v)
cm2 = confusionMatrix(cardio_pred2, reference = cardio_test_y2 )
cmaccu2 = cm2$overall['Accuracy']
cmt2 = data.frame(k=k_v, Total.Accuracy = cmaccu2, row.names = NULL )
ResultDf2 = rbind(ResultDf2, cmt2)
}
```
```{r results='markup'}
xkabledply(ResultDf2, "Total Accuracy Summary for KNN Model 2")
```
```{r}
library(ggplot2)
ggplot(ResultDf2, aes(x=k, y=Total.Accuracy)) +
geom_line(color = "skyblue", size = 1.5) +
geom_point(size = 3, color = "blue") +
labs(title = "Total.Accuracy vs K (Model 2)")
```
**Modeling Results with the Best K Value (K = 29)**
Detailed results and the confusion matrix of model 2 using k = 29 were shown below. Removing `gender` and `bmi` slightly improved the performance.
```{r results='markup'}
cardio_pred_29 <- knn(train = cardio_training_x2, test = cardio_test_x2, cl=cardio_training_y2, k=29)
KNN_Mode_2 = confusionMatrix(cardio_pred_29, reference = cardio_test_y2)
xkabledply(as.matrix(KNN_Mode_2), title = paste("Confusion Matrix for KNN Model 2") )
xkabledply(data.frame(KNN_Mode_2$overall['Accuracy']), title=paste("Total Accuracy of KNN Model 2"))
xkabledply(data.frame(KNN_Mode_2$byClass), title=paste("Metrics of KNN Model 2"))
```
Next, we want to check whether only using the 5 variables provided by the correlation matrix will make a better model.
### Model 3 (Selected Variables)
This KNN model only use `age`, `cholesterol`, `bmi`, `ap_hi` and `ap_lo`.
**Train / Test Split**
The dataframe was split into training subset (70% of the total dataframe) and test subset (30% of the total dataframe). Only `age`, `cholesterol`, `bmi`, `ap_hi` and `ap_lo` were used for the modeling.
```{r}
cardio_scale3 <- cardio_scale[c(1, 5:7, 12:13)]
cardio_scale3
set.seed(42)
cardio_sample3 <- sample(2, nrow(cardio_scale3), replace=TRUE, prob=c(0.7, 0.3))
cardio_training_x3 <- cardio_scale3[cardio_sample3==1, 1:5]
cardio_test_x3 <- cardio_scale3[cardio_sample3==2, 1:5]
cardio_training_y3 <- cardio_scale3[cardio_sample3==1, 6]
cardio_test_y3 <- cardio_scale3[cardio_sample3==2, 6]
```
**Looking for the best K value**
Odd numbers from 3 to 40 were used for the test. The accuracy values from the model that used these K values were shown in the table and the line graph. According the results, K = 31 is the best K value for the KNN modeling.
```{r}
K_test <- seq(3, 40, by = 2)
ResultDf3 = data.frame()
for (k_v in K_test) {
cardio_pred3 <- knn(train = cardio_training_x3, test = cardio_test_x3, cl=cardio_training_y3, k=k_v)
cm3 = confusionMatrix(cardio_pred3, reference = cardio_test_y3 )
cmaccu3 = cm3$overall['Accuracy']
cmt3 = data.frame(k=k_v, Total.Accuracy = cmaccu3, row.names = NULL )
ResultDf3 = rbind(ResultDf3, cmt3)
}
```
```{r results='markup'}
xkabledply(ResultDf3, "Total Accuracy Summary")
```
```{r}
library(ggplot2)
ggplot(ResultDf3, aes(x=k, y=Total.Accuracy)) +
geom_line(color = "skyblue", size = 1.5) +
geom_point(size = 3, color = "blue") +
labs(title = "Total.Accuracy vs K")
```
**Modeling results with the best K value (Model 3, K = 31)**
The confusion matrix of model 3 using k = 31 were shown below.
```{r results='markup'}
cardio_pred_31 <- knn(train = cardio_training_x3, test = cardio_test_x3, cl=cardio_training_y3, k=31)
KNN_Mode_3 = confusionMatrix(cardio_pred_31, reference = cardio_test_y3 )
xkabledply(as.matrix(KNN_Mode_3), title = paste("Confusion Matrix for KNN Model 3") )
```
```{r results='markup'}
xkabledply(data.frame(KNN_Mode_1$overall['Accuracy'], KNN_Mode_2$overall['Accuracy'], KNN_Mode_3$overall['Accuracy']), title=paste("Total Accuracy Comparision of the Three KNN Models"))
xkabledply(data.frame(KNN_Mode_1$byClass, KNN_Mode_2$byClass, KNN_Mode_3$byClass), title=paste("Metrics Comparision of KNN the Three KNN Models"))
```
The total accuracy and other metrics of the three models were listed in the above tables. For summary, all three models have an acceptable performance, but not outstanding. More specifically, model 2 has a overall better performance comparing to model 1. But model 2 and model 3 have their own strength and weaknesses. The accuracy and recall value of model 2 are higher than model 3, while the precision value is higher in model 3. Recall value is more related to false negative rate and the precision value is used to represent to the false positive rate. It's hard to say which model is the best one, but we should choose the proper model according to our aims. If minimizing false positives is our focus, model 3 might be a good option, and model 2 is more suitable is we want to minimize false negative rate.
One of the limitation of the KNN modeling is this model has difficulties in identifying feature importance. So from here we can't answer the question that which are the top 3 factors that affect the onset of cardiovascular diseases. Our following modeling using logistic regression and classification tree will answer the question.
## Logistic Regression Model
The second type of model we used for the disease prediction is logistic regression. Logistic regression models, also called logit models, are used to model dichotomous outcome variables. In a logit model the log odds of the outcome are modeled as a linear combination of the predictor variables.
The data set has 13 variables. In contrast to KNN modeling, here we used logistic regression to remove variables one by one. The alpha value was set equal to 0.1. This means that p-values greater than 0.1 will be removed.
For consistency between models used in this study we used 70% of data set as training data set, and the remaining 30% as testing data set. The seed point was set to 42.
In this study we built 4 logistic regression models based on the variables that have significant affect on the prediction.
```{r}
#load dataset
data <- cardio_clean_final
#view summary of dataset
summary(data)
#find total observations in dataset
nrow(data)
```
```{r}
#Create Training and Test Samples
#make this example reproducible
set.seed(42)
#Use 70% of dataset as training set and remaining 30% as testing set
sample <- sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.7 , 0.3))
train <- data[sample, ]
test <- data[!sample, ]
#view train summary
summary(train)
```
### Modeling
**Logistic Regression Model 1**
For the 1st model we use all 13 variables.
```{r results='markup'}
#Fit the logistic regression model including all variables
logit_reg_model_ALL <- glm(cardio ~ gender + age + bmi + height + weight + ap_hi + ap_lo + cholesterol + gluc + smoke + alco + active, family="binomial", data=cardio_clean_final)
#disable scientific notation for model summary
options(scipen = 999)
#view model summary
#summary(logit_reg_model_ALL)
xkabledply(logit_reg_model_ALL, title = "logit_reg_model_ALL")
```
According to the first model, `gender`, `bmi`, `height`, `weight` and `gluc` do not have significant affect.
**Logistic Regression Model 2**
For the second model we included all variables but gender.
```{r results='markup'}
#Fit the logistic regression model excluding gender
logit_reg_model_wogender <- glm(cardio ~ age + bmi + height + weight + ap_hi + ap_lo + cholesterol + gluc + smoke + alco + active, family="binomial", data=cardio_clean_final)
#disable scientific notation for model summary
options(scipen = 999)
#view model summary
#summary(logit_reg_model_wogender)
xkabledply(logit_reg_model_wogender, title = "logit_reg_model_wogender")
```
The results showed that the bmi, height, weight and gluc are still remaining not detrimental for the disease prediction.
**Logistic Regression Model 3**
Next we excluded gender and bmi to build the third model.
```{r results='markup'}
#Fit the logistic regression model excluding gender and bmi
logit_reg_model_wogenderbmi <- glm(cardio ~ age + height + weight + ap_hi + ap_lo + cholesterol + gluc + smoke + alco + active, family="binomial", data=cardio_clean_final)
#disable scientific notation for model summary
options(scipen = 999)
#view model summary
#summary(logit_reg_model_wogenderbmi)
xkabledply(logit_reg_model_wogenderbmi, title = "logit_reg_model_wogenderbmi")
```
Interestingly, when we removed the `bmi` variable from the model, `height` and `weight` started to show significant impact on disease prediction. However, gluc2 is still not significant.
**Logistic Regression Model 4**
For the last model we excluded the `gluc` variable from our dataset.
```{r results='markup'}
#Remove gender, bmi and gluc from the modeling because they do not any effect on cardio
logit_reg_model_cleaned <- glm(cardio ~ age + height + weight + ap_hi + ap_lo + cholesterol + smoke + alco + active, family="binomial", data=cardio_clean_final)
#disable scientific notation for model summary
options(scipen = 999)
#view model summary
#summary(logit_reg_model_cleaned)
xkabledply(logit_reg_model_cleaned, title = "logit_reg_model_cleaned")
```
As we can see from the above presented outcome of the Model 4, all remaining 10 variables have an significant effect on disease prediction. Therefore for the further evaluation we will use Model 4.
### Evaluation
**Coefficient Exponential**
To confirm which variable have positive or negative impact on disease prediction we calculated the coefficient exponential for all variables from Model 4.
```{r exp, results=T}
expcoeff1 = exp(coef(logit_reg_model_cleaned))
#summary(expcoeff1)
xkabledply( as.table(expcoeff1), title = "Exponential of coefficients in cardio" )
```
For exponential of coefficients, there are five variables having positive effect on dependent variables `cardio`, and four having negative effect. For example, if person smoke, they will have more possibility to develop cardiovascular diseases.