forked from HimesGroup/BMIN503_Final_Project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathBMIN503 Final Project OM.qmd
975 lines (790 loc) · 38.4 KB
/
BMIN503 Final Project OM.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
---
title: "Evaluating Family Engagement and Disparities in Health Information Access through Patient Portal Use within a NICU"
subtitle: "BMIN503 Final Project"
author: "Osvaldo Mercado"
date: "`r Sys.Date()`"
format:
html:
fig-width: 8
fig-height: 6
toc: true
toc-location: left
css: "https://github.research.chop.edu/pages/CQI/chop-bootstrap/bootstrap-3/bootstrap.min.css"
editor: visual
embed-resources: true
execute:
warning: false
message: false
---
```{=html}
<style>
#TOC {
background-color: #f9f9f9; /* Light gray background */
border: 1px solid #ddd; /* Add a subtle border */
padding: 10px; /* Add padding inside the box */
border-radius: 5px; /* Rounded corners */
box-shadow: 2px 2px 5px rgba(0, 0, 0, 0.1); /* Optional shadow for depth */
}
</style>
```
```{r setup, include=FALSE}
knitr::opts_knit$set(root.dir = "C:/Users/mercadoo/OneDrive - Children's Hospital of Philadelphia/Informatics/Penn Informatics Courses/BMIN 503")
```
# Overview
This project aims to analyze how families in the neonatal intensive care unit (NICU) engage with their infant's patient portal or MyCHOP accounts (an online electronic health record) to access various aspects of their child’s care. The project will examine patterns of patient portal use, including frequency, types of information accessed, and timing, with a focus on understanding how these interactions vary across demographics and socioeconomic factors.
# Introduction
Families with infants in the NICU experience high levels of stress and complexity in navigating their child’s healthcare. Access to MyCHOP, an online portal for EHR access, provides a valuable resource for families to view updates on their child’s treatment, review provider notes, and stay informed about care decisions (1). However, disparities may exist in how effectively families from different backgrounds can use this tool to stay informed and engaged. Understanding how families access their MyCHOP accounts, what types of data they prioritize, and when they engage with the portal could reveal insights into potential barriers to access or disparities in engagement. These insights are especially crucial in the NICU setting, where clear communication and access to information are critical for family-centered care. Previous studies have shown disparities in patient portal access within the outpatient setting, specifically lower activation rates for non-white and non-English preferred families (2). This study will analyze usage patterns among families, considering factors such as race, ethnicity, patient characteristics (such as length of stay, gestational age, weight), to identify trends and inform interventions that could enhance equitable access to health information within a NICU.
The U.S. government first mandated patient access to medical information with the Health Insurance Portability and Accountability Act (HIPAA) in 1996, which gave patients the right to access their records.(3) A major update came with the 21st Century Cures Act, signed into law in 2016, which enhanced data interoperability and expanded electronic access to health records.(4) The most recent revision, effective April 5, 2021, under the Cures Act, mandates that eight categories of clinical notes in electronic health records (EHRs) be immediately available to patients via a secure online portal, prohibiting healthcare providers from blocking or delaying access to this information.(5)
The Children's Hospital of Philadelphia (CHOP) has provided access to MyCHOP for many years. During the pandemic, there was a significant effort to enhance the delivery of online information to families. As a result, MyCHOP utilization and access saw a substantial increase post-pandemic. This project delves deeper into the characteristics and trends of MyCHOP usage for CHOP NICU patients.
# Methods
The dataset for this project was sourced from the Children's Hospital of Philadelphia (CHOP) Clinical Data Warehouse and Chronicles database. It was retrieved from tables stored within both Clarity and the Trusted Data Layer. As this dataset is not publicly accessible, the data tables cannot be shared. This project serves as an exploratory analysis of the variables described earlier. The methods section begins with loading the required packages for this analysis.
## Install packages
```{r}
#| eval: false
install.packages("readr")
install.packages("tidyverse")
install.packages("ggplot2")
install.packages("lubridate")
install.packages("gtsummary")
install.packages("dplyr")
install.packages("forcats")
install.packages("tinymodels")
install.packages("ggwordcloud")
install.packages("vip")
install.packages("dotwhisker")
install.packages("randomforest")
install.packages("glmnet")
install.packages("xgboost")
```
## Load packages
```{r}
library(readr) # Importing data
library(tidyverse) # Data manipulation and analysis
library(ggplot2) # Creating visualizations
library(lubridate) # Working with dates
library(gtsummary) # Summary statistics and tables
library(dplyr) # Data cleaning and filtering
library(forcats) # Handling factor variables
library(tidymodels) # Building and tuning models
library(RColorBrewer) # Additional color options
library(ggwordcloud) # Creating word clouds
library(vip) # Visualizing variable importance
library(dotwhisker) # Coefficient plots
library(randomForest) # Random forest modeling
library(glmnet) # Regression modeling
library(xgboost) # Gradient boosting for machine learning
library(kernlab) # Kernel-based machine learning
```
## Upload data
```{r}
#Upload necessary tables
patient_data <- read.csv(file.choose())
access_data <- read.csv(file.choose())
epic_lookup_table <- read.csv(file.choose())
```
## Exploration and Manipulation of Patient Data Table
```{r}
# Converting various column dates from "character" to "date + time" to begin filtering data
patient_data <- patient_data |>
mutate(
DOB = as.POSIXct(DOB, format = "%Y-%m-%d %H:%M:%S"),
HOSPITAL_ADMIT_DATE = as.POSIXct(HOSPITAL_ADMIT_DATE, format = "%Y-%m-%d %H:%M:%S"),
HOSPITAL_DISCHARGE_DATE = as.POSIXct(HOSPITAL_DISCHARGE_DATE, format = "%Y-%m-%d %H:%M:%S"),
EPISODE_START_DATE = as.POSIXct(EPISODE_START_DATE, format = "%Y-%m-%d %H:%M:%S"),
EPISODE_END_DATE = as.POSIXct(EPISODE_END_DATE, format = "%Y-%m-%d %H:%M:%S")
)
# Count total unique IDs, gives unique patients, prior to September 30, 2024 (available access data)
patient_data <- patient_data |>
filter(HOSPITAL_ADMIT_DATE < "2024-09-30")
total_unique <- length(unique(patient_data$MRN))
# Count occurrences of repeated IDs, gives sum of repeat patient encounters
repeated <- sum(table(patient_data$MRN) > 1)
# Print results
cat("Total unique IDs:", total_unique, "\nNumber of repeated IDs:", repeated, "\n")
```
```{r}
# Sort the data by MRN and NICU episode admission date
patient_data_ordered <- patient_data[order(patient_data$MRN, patient_data$EPISODE_START_DATE), ]
# Keep only the first occurrence of each MRN
initial_admissions <- patient_data_ordered[!duplicated(patient_data_ordered$MRN), ]
# Count total records
total_records_initial <- nrow(initial_admissions)
# Display the total count
cat("Total records in the intial admission table:", total_records_initial, "\n") #Equivalent count to unique ID's above
```
```{r}
# Calculate total gestational age in TOTAL days: combination of two columns
initial_admissions$GESTATIONAL_AGE_TOTAL_DAYS <-
initial_admissions$GESTATIONAL_AGE_COMPLETE_WEEKS * 7 + #multiply by 7 days in week
initial_admissions$GESTATIONAL_AGE_REMAINDER_DAYS
# Calculation of average GA in TOTAL days
average_GA_days <- mean(na.omit(initial_admissions$GESTATIONAL_AGE_TOTAL_DAYS))
# Convert the average back to weeks and days for interpretation
average_weeks <- floor(average_GA_days / 7) #Using floor to round down weeks
average_days <- round(average_GA_days %% 7, 1) #Using %% to obtain remainder when dividing by 7
# Combine the results
cat("Average Gestational Age:", average_weeks, "weeks and", average_days, "days\n")
```
To analyze the stored data effectively, the data table needs to be properly converted and formatted. Below are steps to prepare the data so it can be summarized and manipulated correctly using factors.
Sex Column Exploration and Conversion
```{r}
# Additional analysis and conversion of data stored: SEX
unique(initial_admissions$SEX) # Stored as M, F, U
initial_admissions <- initial_admissions |> # Remove limited 3 U
dplyr::filter(SEX != "U")
# Conversion to factor
initial_admissions$SEX <- as.factor(initial_admissions$SEX)
initial_admissions$SEX <- droplevels(initial_admissions$SEX)
table(initial_admissions$SEX)
```
Race Column Exploration and Conversion
```{r}
# Additional analysis and conversion of data stored: RACE
## Will combine/group certain variables for clarity and simplicity
unique(initial_admissions$RACE)
initial_admissions <- initial_admissions |>
dplyr::mutate(RACE = case_when(
RACE %in% c("Refused", "Choose not to disclose") ~ "Refused",
RACE %in% c("Unknown", "Asked but unknown", "") ~ "Unknown",
is.na(RACE) ~ "Unknown",
TRUE ~ RACE
)
)
# Conversion to factor
initial_admissions$RACE <- as.factor(initial_admissions$RACE)
initial_admissions$RACE <- fct_infreq(initial_admissions$RACE) #rearrange
table(initial_admissions$RACE)
```
Ethnicity Column Exploration and Conversion
```{r}
# Additional analysis and conversion of data stored: ETHNICITY
## Will combine/group certain variables for clarity and simplicity
unique(initial_admissions$ETHNICITY)
initial_admissions <- initial_admissions |>
dplyr::mutate(ETHNICITY = case_when(
ETHNICITY %in% c("Refused", "Choose Not To Disclose") ~ "Refused",
ETHNICITY %in% c("Non-hispanic Or Non-latino") ~ "Non-Hispanic or Non-Latino",
ETHNICITY %in% c("Hispanic Or Latino") ~ "Hispanic or Latino",
ETHNICITY %in% c("Asked But Unknown", "", NA) ~ "Unknown", # Include NA and blank spaces as Unknown
TRUE ~ ETHNICITY
)
)
# Conversion to factor
initial_admissions$ETHNICITY <- as.factor(initial_admissions$ETHNICITY)
initial_admissions$ETHNICITY <- fct_infreq(initial_admissions$ETHNICITY)
table(initial_admissions$ETHNICITY)
```
Preferred Language Column Exploration and Conversion
```{r}
# Additional analysis and conversion of data stored: PREFERRED LANGUAGE
## Will combine/group certain variables for clarity and simplicity
unique(initial_admissions$PREFERRED_LANGUAGE)
# Group languages with fewer than 15 occurrences as "Other" and keep "Unknown" and NULL seperate
initial_admissions <- initial_admissions |>
dplyr::mutate(PREFERRED_LANGUAGE = case_when(
PREFERRED_LANGUAGE %in% names(which(table(PREFERRED_LANGUAGE) < 10)) ~ "Other", # Group languages with <10 occurrences
PREFERRED_LANGUAGE %in% c("", NA) ~ "Unknown", # Group blank and NA values into Unknown
PREFERRED_LANGUAGE == "Other" ~ "Other", # Ensure "Other" stays as "Other"
TRUE ~ str_to_title(PREFERRED_LANGUAGE) # Capitalize appropriately
)
)
# Conversion to factor
initial_admissions$PREFERRED_LANGUAGE <- as.factor(initial_admissions$PREFERRED_LANGUAGE)
initial_admissions$PREFERRED_LANGUAGE <- fct_infreq(initial_admissions$PREFERRED_LANGUAGE)
table(initial_admissions$PREFERRED_LANGUAGE)
```
Payor Group Column Exploration and Conversion
```{r}
# Additional analysis and conversion of data stored: PAYOR GROUP
unique(initial_admissions$PAYOR_GROUP)
#Recode Medical Assistance as Medicaid
initial_admissions$PAYOR_GROUP <-
forcats::fct_recode(initial_admissions$PAYOR_GROUP, "Medicaid" = "MEDICAL ASSISTANCE")
initial_admissions <- initial_admissions |>
dplyr::mutate(PAYOR_GROUP = case_when(
PAYOR_GROUP %in% c("COMMERCIAL") ~ "Commercial", # Group Commercial
PAYOR_GROUP %in% c("CHARITY CARE") ~ "Other", # Group Charity Care into Other
PAYOR_GROUP %in% c("OTHER") ~ "Other", # Ensure Other remains Other
PAYOR_GROUP %in% c("", NA) ~ "Unknown", # Group missing or blank values into Unknown
TRUE ~ PAYOR_GROUP # Leave all other values unchanged
))
# Conversion to factor
initial_admissions$PAYOR_GROUP <- as.factor(initial_admissions$PAYOR_GROUP)
initial_admissions$PAYOR_GROUP <- fct_infreq(initial_admissions$PAYOR_GROUP)
table(initial_admissions$PAYOR_GROUP)
```
Additional data manipulation (include dates, MyCHOP activation, MyCHOP declined, inborn)
```{r}
# Mutating access table so that access time (UA_time) can be more readily used
access_data <- access_data |>
mutate(UA_TIME = substr(UA_TIME, 1, 19)) |>
mutate(UA_TIME = as.POSIXct(UA_TIME, format = "%Y-%m-%d %H:%M:%S"))
# Exploration of MyCHOP Columns - Noting that this is MyCHOP activation at ANY point (can occur after discharge which is not part of our study goal)
# Count rows where MYCHOP_ACTIVATION_IND is 1
activation_count <- sum(initial_admissions$MYCHOP_ACTIVATION_IND == 1, na.rm = TRUE)
# Count rows where MYCHOP_DECLINED_IND is 1
declined_count <- sum(initial_admissions$MYCHOP_DECLINED_IND == 1, na.rm = TRUE)
# Display the results
cat("Number of activations (MYCHOP_ACTIVATION_IND = 1):", activation_count,
"\nNumber of declines (MYCHOP_DECLINED_IND = 1):", declined_count, "\n")
# Count the number of inborn patients (inborn_IND = 1) - patients born inside CHOP and not transferred or admitted from ED / general floors
inborn_count <- sum(initial_admissions$INBORN_IND == 1, na.rm = TRUE)
cat("Number of inborn patients:", inborn_count, "out of", total_unique, "\n")
```
Exploration of LOS column
```{r}
# LOS exploration
summary(patient_data$NICU_LOS_DAYS)
# Exploration of the null values demonstrated that these patients were still admitted. These patients were removed from our datasets
patient_data <- patient_data |>
filter(!is.na(NICU_LOS_DAYS))
initial_admissions <- initial_admissions |>
filter(!is.na(NICU_LOS_DAYS))
# Exploring the lowest LOS patients. Ultimately removed if admission was less than 4 hours
lowest_los_patients <- patient_data |>
filter(NICU_LOS_DAYS < .17) # Filtering patients less than 4 hours >> 46 patients
patient_data <- patient_data |>
filter(NICU_LOS_DAYS > 0.17)
initial_admissions <- initial_admissions |>
filter(NICU_LOS_DAYS > 0.17)
```
Exploration of birth weight column
```{r}
summary(initial_admissions$BIRTH_WEIGHT_GRAMS)
# After manual chart review two lowest are incorrect values. Removed from data set.
BW_350 <- patient_data |>
filter(BIRTH_WEIGHT_GRAMS < 350)
patient_data <- patient_data |>
filter(is.na(BIRTH_WEIGHT_GRAMS) | BIRTH_WEIGHT_GRAMS > 350)
initial_admissions <- initial_admissions |>
filter(is.na(BIRTH_WEIGHT_GRAMS) | BIRTH_WEIGHT_GRAMS > 350)
```
## Exploration and Manipulation of Access Table
```{r}
#When I initially queried the access information table for my included patients, it contained 37 million rows of information. After some data cleaning and in the interest of simpliying my data utilization in RStudio, I filtered specifically for pta.ua_time BETWEEN fne.episode_start_date AND fne.episode_end_date to make sure that the results only occured while a patient was admitted during my time window.
# Count total unique rows, gives total number of messages provided in filtered data set
total_unique_access <- nrow(access_data)
# Count total unique patients, gives total number of patients who had their MyChart accessed during NICU admission
total_unique_access_pat <- length(unique(access_data$MRN))
# Display results
cat("Total unique accesses:", total_unique_access, "\nNumber of unique patients:", total_unique_access_pat, "\n")
```
Filtering table to only include patient access to MyChart (remove ShareEveryhere and External_Client)
```{r}
# Table to show counts of different individuals accessing MyChart
table(access_data$UA_PERSON_TYPE_C)
#Filtering for only patient and user (proxy)
access_data <- access_data |>
dplyr::filter(access_data$UA_PERSON_TYPE_C %in% c(1, 2))
```
Utilization of a lookup table to understand the numeric values of MYC_UA_TYPE_C. This column uses a numeric value which I then associated with the lookup table I created from the Epic User Web to understand the action being completed.
```{r}
#Visualizing the top 30 actions of a total 466 variables
head(epic_lookup_table, n = 30)
# Join `access_data` with `epic_lookup_table`
access_data <- access_data |>
left_join(epic_lookup_table, by = c("MYC_UA_TYPE_C" = "Code")) |>
rename(Activity_Description = Description)
```
In exploring the data and process mapping the steps involved in accessing MyChart, we identified that meaningful use of MyChart extends beyond initial setup and login activity. To ensure we capture meaningful engagement, we defined 'meaningful access' as instances where a patient had three or more actions in MyChart.
```{r}
# Filtering patients with less than 3
patients_less_than_3 <- access_data |>
group_by(MRN) |>
summarise(access_count = n()) |>
filter(access_count < 3)
# Bring along activity descriptions by filtering the original dataset
patients_less_than_3_activities <- access_data |>
semi_join(patients_less_than_3, by = "MRN") |> # Filter to keep only patients with less than 5 accesses
left_join(epic_lookup_table, by = c("MYC_UA_TYPE_C" = "Code")) # Join to get activity descriptions
# Summarize
summary_activities <- patients_less_than_3_activities |>
group_by(Activity_Description) |>
summarise(
Num_Patients = n_distinct(MRN),
Total_Accesses = n()
) |>
arrange(desc(Total_Accesses))
# Display results
summary_activities
```
Addition of a column into the initial admissions table to account for whether or not a patient accessed their MyChart account DURING their hospital stay
```{r}
# Identify patients who accessed their MyChart DURING their initial NICU admission
accessed_mychart <- access_data |>
inner_join(initial_admissions, by = "MRN") |>
filter(UA_TIME >= EPISODE_START_DATE & UA_TIME <= EPISODE_END_DATE) |>
select(MRN) |>
distinct()
# Remove MRNs with less than 3 accesses from accessed_mychart
accessed_mychart_filtered <- accessed_mychart |>
filter(!MRN %in% patients_less_than_3$MRN)
# Add MyChart_Access column to initial_admissions
initial_admissions <- initial_admissions |>
mutate(MyChart_Access = ifelse(MRN %in% accessed_mychart_filtered$MRN, 1, 0))
# Conversion to factor
initial_admissions$MyChart_Access <- factor(initial_admissions$MyChart_Access,
levels = c(0, 1),
labels = c("No Access", "Accessed"))
# Total number of patients who accessed MyChart meaningfully
patients_accessed_mychart <- sum(initial_admissions$MyChart_Access == "Accessed", na.rm = TRUE)
# Update total records initial after filtering tables
total_records_initial <- nrow(initial_admissions)
# Display results
cat("Total number of patients who accessed MyChart:", patients_accessed_mychart, "\nNumber of patients admitted:", total_records_initial, "\n")
```
# Results
## NICU Patient Characteristics
The initial part of this section contains a demographics and characteristics table for patients admitted to the NICU from January 2018 to September 2024. Displayed below is the summary table for the overall patient data set split by access to MyChart.
```{r}
# Split initial_admissions into two groups
accessed_mychart_plot <- initial_admissions |> filter(MyChart_Access == "Accessed")
not_accessed_mychart_plot <- initial_admissions |> filter(MyChart_Access == "No Access")
# Create summary tables for each group
summary_accessed <- accessed_mychart_plot |>
tbl_summary(
include = c(SEX, RACE, ETHNICITY, PREFERRED_LANGUAGE, PAYOR_GROUP,
GESTATIONAL_AGE_TOTAL_DAYS, BIRTH_WEIGHT_GRAMS, NICU_LOS_DAYS,
INBORN_IND),
statistic = list(
all_continuous() ~ "{median} ({p25}, {p75})",
all_categorical() ~ "{n} ({p}%)",
INBORN_IND ~ "{n}/{N} ({p}%)"
),
label = list(
SEX ~ "Gender",
RACE ~ "Race",
ETHNICITY ~ "Ethnicity",
PREFERRED_LANGUAGE ~ "Preferred Language",
PAYOR_GROUP ~ "Insurance",
GESTATIONAL_AGE_TOTAL_DAYS ~ "Gestational Age (days)",
BIRTH_WEIGHT_GRAMS ~ "Birth Weight (grams)",
NICU_LOS_DAYS ~ "NICU Length of Stay (days)",
INBORN_IND ~ "Inborn"
),
missing = "no" #Exclude missing values
) %>%
modify_header(label = "**Characteristic**") %>%
modify_header(stat_0 = "**{N} (%)**") %>%
bold_labels() %>%
modify_caption("**Summary Table: Accessed**")
summary_not_accessed <- not_accessed_mychart_plot |>
tbl_summary(
include = c(SEX, RACE, ETHNICITY, PREFERRED_LANGUAGE, PAYOR_GROUP,
GESTATIONAL_AGE_TOTAL_DAYS, BIRTH_WEIGHT_GRAMS, NICU_LOS_DAYS, INBORN_IND),
statistic = list(
all_continuous() ~ "{median} ({p25}, {p75})", # Median (IQR) for continuous variables
all_categorical() ~ "{n} ({p}%)", # Count and percentage for categorical variables
INBORN_IND ~ "{n}/{N} ({p}%)"
),
label = list(
SEX ~ "Gender",
RACE ~ "Race",
ETHNICITY ~ "Ethnicity",
PREFERRED_LANGUAGE ~ "Preferred Language",
PAYOR_GROUP ~ "Insurance",
GESTATIONAL_AGE_TOTAL_DAYS ~ "Gestational Age (days)",
BIRTH_WEIGHT_GRAMS ~ "Birth Weight (grams)",
NICU_LOS_DAYS ~ "NICU Length of Stay (days)",
INBORN_IND ~ "Inborn"
),
missing = "no" #Exclude missing values
) %>%
modify_header(label = "**Characteristic**") %>%
modify_header(stat_0 = "**{N} (%)**") %>%
bold_labels() %>%
modify_caption("**Summary Table: Did Not Access**")
# Combine the two summary tables side-by-side
final_summary <- tbl_merge(
tbls = list(summary_accessed, summary_not_accessed),
tab_spanner = c("**Accessed**", "**Did Not Access**")
)
# Print the final summary table
final_summary
```
```{r}
#Distrubution of Patients by Sex
sex_plot <- ggplot(initial_admissions, aes(x = SEX)) +
geom_bar(fill = "steelblue") +
labs(
title = "Sex",
x = "Sex",
y = "Count"
)
sex_plot
# Distribution of Patients by Race
race_plot <- ggplot(initial_admissions, aes(x = fct_infreq(RACE))) +
geom_bar(fill = "#008080") +
geom_text(
stat = "count", # Use counts as the statistic
aes(label = after_stat(count)), # Display the count on top of bars
vjust = -0.5, # Position the text slightly above the bars
size = 3 # Adjust text size
) +
labs(
title = "Race",
x = "Race",
y = "Count"
) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1) # Rotate x-axis labels
)
race_plot
# Distribution of Patients by Ethnicity
ethnicity_plot <- ggplot(initial_admissions, aes(x = fct_infreq(ETHNICITY))) +
geom_bar(fill = "#DAA520") +
geom_text(
stat = "count",
aes(label = after_stat(count)),
vjust = -0.5,
size = 3
) +
labs(
title = "Ethnicity",
x = "Ethnicity",
y = "Count"
)
ethnicity_plot
# Distribution of Patients by Payor Group
payor_plot <- ggplot(initial_admissions, aes(x = fct_infreq(PAYOR_GROUP))) +
geom_bar(fill = "#FF7F50") +
geom_text(
stat = "count",
aes(label = after_stat(count)),
vjust = -0.5,
size = 3
) +
labs(
title = "Payor Group",
x = "Payor Group",
y = "Count"
)
payor_plot
```
```{r}
# Create violin plots for each continuous variable
# NICU_LOS_Days
ggplot(initial_admissions, aes(x = MyChart_Access, y = NICU_LOS_DAYS, fill = MyChart_Access)) +
geom_violin(trim = FALSE) +
labs(title = "Violin Plot of NICU Length of Stay by MyChart Access",
x = "MyChart Access",
y = "NICU Length of Stay (Days)") +
theme_minimal() +
theme(legend.position = "none")
# BIRTH_WEIGHT_GRAMS
ggplot(initial_admissions, aes(x = MyChart_Access, y = BIRTH_WEIGHT_GRAMS, fill = MyChart_Access)) +
geom_violin(trim = FALSE) +
labs(title = "Violin Plot of Birth Weight by MyChart Access",
x = "MyChart Access",
y = "Birth Weight (Grams)") +
theme_minimal() +
theme(legend.position = "none")
# GESTATIONAL_AGE_TOTAL_DAYS
ggplot(initial_admissions, aes(x = MyChart_Access, y = GESTATIONAL_AGE_TOTAL_DAYS, fill = MyChart_Access)) +
geom_violin(trim = FALSE) +
labs(title = "Violin Plot of Gestational Age by MyChart Access",
x = "MyChart Access",
y = "Gestational Age (Days)") +
theme_minimal() +
theme(legend.position = "none")
```
```{r}
# Graph showing overall admissions to the NICU (includes re-admitted patients to NICU)
# Aggregate data by month
monthly_data_all <- patient_data |>
mutate(month = as.Date(cut(EPISODE_START_DATE, breaks = "month"))) |>
group_by(month) |>
summarise(frequency = n())
# Creating a timeline
ggplot(monthly_data_all, aes(x = month, y = frequency)) +
geom_col(fill = "skyblue", color = "black") +
labs(
title = "Monthly Frequency of Patient Admissions",
x = "Month",
y = "Number of Admissions"
) +
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
```
## MyChart Access Data Display
```{r}
# Aggregate monthly totals directly from access_data
monthly_access_data <- access_data |>
mutate(month = as.Date(cut(UA_TIME, breaks = "month"))) |>
group_by(month) |>
summarise(
total_accesses = n() # Total MyChart accesses per month
)
# Create a line plot for total MyChart accesses by month
ggplot(monthly_access_data, aes(x = month, y = total_accesses)) +
geom_line() +
geom_point() +
labs(
title = "Monthly MyChart Accesses",
x = "Month",
y = "Total Accesses"
) +
scale_x_date(date_labels = "%b %Y") # Format x-axis labels as "Month Year"
```
```{r}
#Exploration of MyChart access. First we will filter by MRN to assess how many times MyChart is accessed per MRN
access_summary_per_patient <- access_data |>
group_by(MRN) |>
summarise(access_count = n())
access_summary_per_patient <- access_summary_per_patient |>
arrange(desc(access_count))
head(access_summary_per_patient$access_count, n = 10) #showing some of the most active users having 56k, 45k, 35k, 33k etc
summary(access_summary_per_patient$access_count)
access_summary_per_patient_graph <- access_summary_per_patient %>%
mutate(
capped_access_count = ifelse(access_count > 5000, 5000, access_count)
)
# Plot the data with grouped values
ggplot(access_summary_per_patient_graph, aes(x = capped_access_count)) +
geom_histogram(bins = 100, fill = "blue", color = "black") +
scale_x_continuous(
breaks = seq(0, 5000, by = 500),
labels = c(seq(0, 9000, by = 1000), ">10k")
) +
labs(
title = "Distribution of MyChart Access",
x = "Number of MyChart Accesses Grouped",
y = "Number of patients"
)
```
```{r}
# Exploration of what people are accessing most frequently accessing
# Count the frequency of codes in access_data
code_frequencies <- access_data %>%
group_by(MYC_UA_TYPE_C) %>%
summarise(Count = n()) %>%
arrange(desc(Count)) # Sort by descending order of frequency
# Join with the epic_lookup_table to get descriptions
code_with_description <- code_frequencies %>%
left_join(epic_lookup_table, by = c("MYC_UA_TYPE_C" = "Code"))
# View the result (top 50 most frequent actions)
top_actions <- code_with_description |>
select(MYC_UA_TYPE_C, Description, Count) |>
head(50)
top_actions
# Creation of word cloud
ggplot(top_actions, aes(label = Description, size = Count, color = Count)) +
geom_text_wordcloud() +
scale_size_area(max_size = 15) + # Adjust the maximum size of the words
scale_color_gradient(low = "red", high = "blue") +
theme_minimal() +
labs(
title = "Word Cloud of MyChart Actions"
)
```
```{r}
# Exploration of time of day most often accessed
# Group access by hour
access_by_hour <- access_data |>
mutate(hour = hour(UA_TIME)) |> # Extract the hour from UA_TIME
group_by(hour) |>
summarize(access_count = n(), .groups = "drop")
ggplot(access_by_hour, aes(x = hour, y = access_count)) +
geom_line(color = "blue") +
geom_point(color = "red") +
labs(
title = "MyChart Access Frequency by Hour",
x = "Hour of the Day (0-23)",
y = "Access Count"
) +
scale_x_continuous(breaks = 0:23) + # Ensure all hours are displayed
scale_y_continuous(labels = comma)
```
## Predictive Modeling
Below are a few predictive models that I created based on the datasets I created. I recognize the fair amount of missing variables found within certain characterisitcs such as Race, Ethnicity, Language, etc. I am hopeful to improve my datasets through future iterative data exploration within CHOP's data warehouse.
```{r}
## Create the primary table that I will use for various models
primary_model <- initial_admissions %>%
select(
SEX,
RACE,
ETHNICITY,
PREFERRED_LANGUAGE,
PAYOR_GROUP,
BIRTH_WEIGHT_GRAMS,
EPISODE_START_DATE,
NICU_LOS_DAYS,
INBORN_IND,
GESTATIONAL_AGE_TOTAL_DAYS,
MyChart_Access
)
# Converting date to time periods and factor we can use
primary_model$date_period <- dplyr::case_when(
primary_model$EPISODE_START_DATE < as.Date("2020-01-01") ~ "01-2018 to 12-2019",
primary_model$EPISODE_START_DATE >= as.Date("2020-01-01") & primary_model$EPISODE_START_DATE < as.Date("2022-01-01") ~ "01-2020 to 12-2021",
primary_model$EPISODE_START_DATE >= as.Date("2022-01-01") ~ "01-2022 to 09-2024"
)
# Convert outcome variables to factor
primary_model$INBORN_IND <- as.factor(primary_model$INBORN_IND)
primary_model$date_period <- as.factor(primary_model$date_period)
```
```{r}
## Logistic regression model
# Set the engine for logistic regression
logistic_model <-
logistic_reg() |>
set_engine("glm")
# Fit the model
logistic_fit <-
logistic_model |>
fit(MyChart_Access ~ SEX + RACE + ETHNICITY + PREFERRED_LANGUAGE + PAYOR_GROUP +
BIRTH_WEIGHT_GRAMS + NICU_LOS_DAYS + INBORN_IND + GESTATIONAL_AGE_TOTAL_DAYS + date_period
, data = primary_model)
# Visualize information
tidy(logistic_fit)
# Creation of whisker plot fit coefficients
tidy(logistic_fit) |>
dwplot(dot_args = list(size = 2, color = "black"),
whisker_args = list(color = "black"),
vline = geom_vline(xintercept = 0, colour = "grey50", linetype = 2)) +
theme_bw()
#
logistic_training_pred <- bind_cols(
truth = primary_model$MyChart_Access,
predict(logistic_fit, primary_model, type = "class"),
predict(logistic_fit, primary_model, type = "prob")
)
logistic_training_pred
# Plot of ROC curve
logistic_training_pred |>
roc_curve(truth, ".pred_No Access") |>
ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_path(color = "darkblue") +
geom_abline(color = "darkgrey") +
coord_equal() +
theme_bw()
# Area under the ROC curve
roc_auc(logistic_training_pred,
truth,
".pred_No Access")
```
```{r}
## 10-Fold Cross Validation and use of a Logistic Regression Model
set.seed(1234)
pmodel_folds <- vfold_cv(primary_model, v = 10)
pmodel_folds
# Create a workflow() for fitting the glm
glm_wf <- workflow() |>
add_model(logistic_model) |>
add_formula(MyChart_Access ~ SEX + RACE + ETHNICITY + PREFERRED_LANGUAGE + PAYOR_GROUP +
BIRTH_WEIGHT_GRAMS + NICU_LOS_DAYS + INBORN_IND + GESTATIONAL_AGE_TOTAL_DAYS + date_period)
# Use workflow to fit model with each fold of resampled data
glm_fit_cv <- glm_wf |>
fit_resamples(pmodel_folds, control = control_resamples(save_pred = TRUE))
# Collect predictions out of folds into one tibble
pmodels_glm_cv_preds <- collect_predictions(glm_fit_cv)
pmodels_glm_cv_preds
# Overall metrics
collect_metrics(glm_fit_cv)
# To see performance of each fold
pmodels_glm_cv_preds |>
group_by(id) |>
roc_auc(MyChart_Access, ".pred_No Access")
# ROC curves of each fold
pmodels_glm_cv_preds |>
group_by(id) |>
roc_curve(MyChart_Access, ".pred_No Access") |>
autoplot()
```
```{r}
## Random Forest Modeling
# Check for missing data
summary(primary_model)
# Impute missing numerical values with the median
primary_model <- primary_model |>
mutate(
BIRTH_WEIGHT_GRAMS = ifelse(is.na(BIRTH_WEIGHT_GRAMS), median(BIRTH_WEIGHT_GRAMS, na.rm = TRUE), BIRTH_WEIGHT_GRAMS),
GESTATIONAL_AGE_TOTAL_DAYS = ifelse(is.na(GESTATIONAL_AGE_TOTAL_DAYS), median(GESTATIONAL_AGE_TOTAL_DAYS, na.rm = TRUE), GESTATIONAL_AGE_TOTAL_DAYS)
)
# Create a random index for the training set
set.seed(123)
train_index <- sample(seq_len(nrow(primary_model)), size = 0.8 * nrow(primary_model))
# Split the primary_model into training and testing sets
train_data <- primary_model[train_index, ] # 80% of the data
test_data <- primary_model[-train_index, ] # Remaining 20%
# Random Forest Model specification
rf_spec <-
rand_forest(trees = 1000, min_n = 5) |> #number of trees and min # of data points
set_engine("randomForest", importance = TRUE) |>
set_mode("classification")
# Fit our data to the random forest model
rf_fit <- rf_spec |>
fit(MyChart_Access ~ SEX + RACE + ETHNICITY + PREFERRED_LANGUAGE + PAYOR_GROUP +
BIRTH_WEIGHT_GRAMS + NICU_LOS_DAYS + INBORN_IND + GESTATIONAL_AGE_TOTAL_DAYS + date_period, data = train_data)
rf_fit
pmodel.rf.pred.values <- bind_cols(
truth = test_data$MyChart_Access,
predict(rf_fit, test_data),
predict(rf_fit, test_data, type = "prob")
)
pmodel.rf.pred.values
roc_auc(pmodel.rf.pred.values,
truth,
".pred_No Access")
# Creation of the ROC curve
autoplot(roc_curve(pmodel.rf.pred.values,
truth,
".pred_No Access"))
# Helps to understand the variables that contributed the most to the classification
rf_fit |>
extract_fit_engine() |>
importance()
rf_fit |>
extract_fit_engine() |>
vip()
```
```{r}
## XGBoost Modeling
# Model specification
bt_spec <-
boost_tree(trees = 50,
tree_depth = 4) |>
set_mode("classification") |>
set_engine("xgboost")
bt_recipe <-
recipe(MyChart_Access ~ SEX + RACE + ETHNICITY + PREFERRED_LANGUAGE + PAYOR_GROUP +
BIRTH_WEIGHT_GRAMS + NICU_LOS_DAYS + INBORN_IND + GESTATIONAL_AGE_TOTAL_DAYS + date_period, data = primary_model) |>
step_dummy(all_nominal_predictors()) |>
step_normalize(all_predictors())
# Workflow specification
bt_workflow <- workflow() |>
add_model(bt_spec) |>
add_recipe(bt_recipe)
# Model fit to training data
bt_fit <- fit(bt_workflow, data = primary_model)
bt_fit
pmodel.bt.pred.values <- bind_cols(
truth = primary_model$MyChart_Access,
predict(bt_fit, primary_model),
predict(bt_fit, primary_model, type = "prob")
)
# Model AUC suprisingly highly successful
roc_auc(pmodel.bt.pred.values,
truth,
".pred_No Access")
# Plot ROC curve
autoplot(roc_curve(pmodel.bt.pred.values,
truth,
".pred_No Access"))
# Variable importance extracted with vip
vip(bt_fit)
```
# Conclusion
In this project, we explored the association between MyChart access and patient characteristics among NICU admissions. Patients who accessed MyChart were found to have longer NICU stays, lower gestational ages, and lower birth weights—factors commonly associated with extended NICU admissions. Monthly trends revealed a surge in MyChart usage during the pandemic, peaking in 2022, followed by a slight decline in subsequent post-pandemic years, a trend warranting further exploration. As expected, the most frequently accessed activities in MyChart included "Visits," "Inpatient Admissions," and "Test Results," though it was notable that clinical notes were accessed less frequently than these features. MyChart access was observed to occur most frequently around midday, with usage tapering off overnight. Interestingly, there was a wide variation in access frequencies among patients, with some showing minimal engagement and others exhibiting extremely high utilization. Lastly, the tabular summary highlighted significant differences between MyChart access groups, including disparities in race, language preference, and insurance types, as well as longer NICU stays and a higher prevalence of inborn deliveries among those accessing MyChart. These findings highlight patterns of MyChart utilization and suggest potential links to patient demographics and care outcomes.
The predictive modeling aspect of this project was primarily an interest of mine to better understand the data. I was pleasantly surprised by the results of the ROC curves, as I had not anticipated achieving such respectable outcomes. With ROC AUCs predominantly ranging from 0.8 to 0.9, the models demonstrated good discrimination overall, with the XGBoost model emerging as the most successful. While I do not believe these models have current practical implications, they provide valuable insights into the characteristics and key factors influencing patients most likely to access MyChart.
## Limitations
Our team acknowledges the presence of a significant amount of missing information, particularly in key demographic categories such as race, ethnicity, language, and payor group. To address these gaps, we are planning several initiatives, including leveraging birth certificate data, examining interpreter utilization records at CHOP, and exploring payor group information within billing records. The predictive modeling aspect of this project primarily reflects my personal interest in practicing and exploring prediction modeling techniques. I also recognize the need to fine-tune the models to address the skewed nature of MyChart access data by date, which likely contributes substantially to the models' success.
# References
1. Children's Hospital of Philadelphia. MyChart FAQ. Retrieved December 5, 2024, from https://mychop.chop.edu/MyChart/Authentication/Login?mode=stdfile&option=faq#EQ_what
2. Bush RA, Vemulakonda VM, Richardson AC, Deakyne Davies SJ, Chiang GJ. Providing Access: Differences in Pediatric Portal Activation Begin at Patient Check-in. Appl Clin Inform. 2019 Aug;10(4):670-678. doi: 10.1055/s-0039-1695792. Epub 2019 Sep 11. PMID: 31509879; PMCID: PMC6739202.
3. Office of the Assistant Secretary for Planning and Evaluation. (August 20, 1996). Health Insurance Portability and Accountability Act of 1996. U.S. Department of Health and Human Services. https://aspe.hhs.gov/reports/health-insurance-portability-accountability-act-1996
4. U.S. Food and Drug Administration. (n.d.). 21st Century Cures Act. U.S. Department of Health and Human Services. https://www.fda.gov/regulatory-information/selected-amendments-fdc-act/21st-century-cures-act
5. Harvard Risk Management Foundation. (2021, December 9). Cures Act overview. Harvard Risk Management Foundation. https://www.rmf.harvard.edu/Risk-Prevention-and-Education/Article-Catalog-Page/Articles/2021/Cures-Act-Overview