final_project_LESCANO.qmd

---
title: "PURPOSe: Predicting Utilization of Resources in Psychiatry Outpatient Services"
subtitle: "BMIN503/EPID600 Final Project"
author: "Nicolas Lescano"
editor: visual
format:
  html:
    css: "style.css"
    self-contained: true
    embed-resources: true
    toc: true
    toc-depth: 5
    toc-location: left
    code-fold: true
    code-fold-default: true
    code-tools: true
execute:
  message: false
  warning: false
---

[![](images/banner.png){fig-alt="banner"}](https://ibi.med.upenn.edu/)

------------------------------------------------------------------------

## Overview {#sec-overview}

This project leverages a decade of data from the Outpatient Psychiatry Clinic (OPC) of the Penn Medicine Department of Psychiatry to develop predictive models aimed at optimizing resource utilization. The goal is to address operational inefficiencies in scheduling and care allocation, ultimately enhancing patient outcomes, clinic efficiency, and provider satisfaction. Integrating principles from psychiatry, healthcare operations, and data science, the project aims to offer actionable insights into improving outpatient mental health care delivery.

Key insights were gathered from Dr. Theodore Satterthwaite (Director of Penn Lifespan Informatics and Neuroimaging Center), who provided guidance on refining the project scope and advanced predictive analytics, and Rucha Kelkar (Epic Cogito Technical Services for Penn Medicine), who offered expertise on Epic Analytics features and potential Epic Cognitive Developer Platform custom model integration.

The complete project repository, including scripts and datasets, is available [here](https://github.com/lescanico/BMIN503_Final_Project).

## Introduction {#sec-introduction}

### Challenges in Outpatient Psychiatry

Outpatient psychiatric services face unique challenges in managing the balance between provider availability and patient demand. These challenges are exacerbated by the inherent unpredictability of patient attendance and the varying degrees of care required. Inefficient scheduling can lead to prolonged wait times, rushed appointments, and increased stress for both patients and providers, thereby impacting the quality of care delivered. Additionally, the mismatch between the type of care provided and the specific needs of patients can result in suboptimal treatment outcomes and reduced overall clinic efficiency.

### Need for Advanced Analytical Approaches

The traditional appointment systems in psychiatric outpatient clinics often rely on static scheduling rules that do not account for the dynamic nature of mental health conditions and treatment responses. This project recognizes the potential of leveraging historical data and machine learning to create a more adaptive scheduling system. By predicting patient demand and resource needs more accurately, the clinic can not only improve patient care but also enhance operational efficiency, ultimately fostering a more supportive environment for both patients and staff.

![Current OPC Operational Flow](images/flow.png){fig-alt="PBH OPC Operational Flow"}

### Objectives

The final goal of this project is to integrate predictive analytics into the scheduling and resource allocation processes of outpatient psychiatric services. The aim is to develop a model that can forecast patient demand and determine optimal resource distribution. This will enable clinics to tailor their staffing and scheduling strategies in real time, thereby minimizing wait times, reducing provider workload imbalances, and ensuring that patients receive timely and appropriate care.

## Methods {#sec-methods}

### Data Sourcing & Anonymization

Data for this project was sourced from Epic Analytics, covering OPC visits from October 1, 2014, to September 30, 2024. These raw data files were initially in .xlsx format but were converted to .csv for easier manipulation and were stored securely. In the process of anonymization, medical record numbers (MRNs) were replaced with unique anonymized identifiers to ensure privacy. Additionally, sensitive demographic information such as exact birth dates and full postal codes was generalized to broader categories to maintain patient confidentiality. This step was critical for complying with data privacy regulations and ethical standards in handling patient data.

```{r anonymization, eval=FALSE}
# Load required libraries
library(readr)
library(dplyr)
library(lubridate)

# Import raw data
patient_data_raw <- read_csv("H:/secure/patient_data.csv")
visit_data_raw <- read_csv("H:/secure/visit_data.csv")

# Generate unique patient IDs for anonymization
mrn_mapping <- tibble(
  MRN = unique(patient_data_raw$MRN),
  patient_id = sprintf("%05d", seq_along(unique(patient_data_raw$MRN)))
)

# Save MRN mapping for potential reversibility
mrn_mapping |> saveRDS("H:/secure/mrn_mapping.rds")

# Define anonymization function
anonymize <- function(data, mapping) {
  data |>
    left_join(mapping, by = "MRN") |>
    mutate(
      year_of_birth = if ("Birth Date (UTC)" %in% names(data)) {
        `Birth Date (UTC)` |> mdy() |> year()
      } else NA,
      postal_code = if ("Postal Code" %in% names(data)) {
        `Postal Code` |> substr(1, 3)
      } else NA
    ) |>
    select(-MRN, -`Birth Date (UTC)`, -`Postal Code`)
}

# Ensure output directory exists
"datasets/" |> dir.create(showWarnings = FALSE, recursive = TRUE)

# Apply anonymization and save
patient_data_anonymized <- patient_data_raw |> anonymize(mrn_mapping)
visit_data_anonymized <- visit_data_raw |> anonymize(mrn_mapping)

patient_data_anonymized |> saveRDS("datasets/patient_data_anonymized.rds")
visit_data_anonymized |> saveRDS("datasets/visit_data_anonymized.rds")
```

### Data Preprocessing

#### Standarization

Columns across the datasets were renamed to ensure uniformity and clarity. Special characters, spaces, and all capitalizations were removed, and units and special symbols were also stripped from column names. This enhanced the dataset’s usability and ensured compatibility for downstream analysis by establishing a consistent naming convention.

Each variable was then classified according to its appropriate data type. This classification was crucial for ensuring accurate data manipulation and integrity in subsequent processing steps. Variables were categorized into types such as numeric, factor, logical, or date, based on their content and relevance to the analysis.

Once classified, the data underwent conversion where variables were transformed into their designated types. This step was essential for preparing the data for analytical modeling, ensuring all variables were in the correct format to accurately reflect their intended use in predictive models.

The final step involved creating a mapping of variable types. This mapping served as a documentation tool, providing a clear overview of each variable's original and converted data types. It was essential for traceability and debugging, allowing for a clear understanding of how data transformations impacted the analysis.

##### Renaming

```{r renaming, eval=FALSE}
# Load required libraries
library(readr)
library(dplyr)
library(stringr)

# Load anonymized data
patient_data <- readRDS("datasets/patient_data_anonymized.rds")
visit_data <- readRDS("datasets/visit_data_anonymized.rds")

# Function to standardize column names
standardize_column_names <- function(dataset) {
  dataset |>
    rename_with(~ . |>
                  str_replace_all(" \\(mmHg\\)| \\(kg/m\\^2\\)", "") |>
                  str_to_lower() |>
                  str_replace_all("[\\s\\.\\/\\?\\-\\(\\)\\%\\$]+", "_") |>
                  str_replace_all("_+", "_") |>
                  str_replace_all("_$", ""))
}

# Function to create column name mapping
create_name_mapping <- function(original_dataset, standardized_dataset) {
  tibble(
    Original = colnames(original_dataset),
    Standardized = colnames(standardized_dataset)
  )
}

# Remove columns created automatically by Epic export process
removed_columns <- c('Start Date', 'End Date')

patient_data <- patient_data |> select(-any_of(removed_columns))
visit_data <- visit_data |> select(-any_of(removed_columns))

# Standardize column names
patient_data_renamed <- patient_data |> standardize_column_names()
visit_data_renamed <- visit_data |> standardize_column_names()

# Add suffixes to common columns, excluding "patient_id"
common_cols <- colnames(patient_data_renamed) |>
  intersect(colnames(visit_data_renamed)) |>
  setdiff("patient_id")

patient_data_renamed <- patient_data_renamed |> 
  rename_with(~ paste0(., "_from_patient_dataset"), .cols = common_cols)

visit_data_renamed <- visit_data_renamed |>
  rename_with(~ paste0(., "_from_visit_dataset"), .cols = common_cols)

# Create column name mappings
patient_name_mapping <- create_name_mapping(patient_data, patient_data_renamed) |>
  mutate(Source = "Patient Dataset")

visit_name_mapping <- create_name_mapping(visit_data, visit_data_renamed) |>
  mutate(Source = "Visit Dataset")

# Combine into a single dataframe
name_mapping <- bind_rows(patient_name_mapping, visit_name_mapping)
```

##### Retyping

###### Classification

```{r classification, eval=FALSE}
# Load required libraries
library(tibble)
library(dplyr)
library(readr)

# Manually classify variable types
variable_type_mapping <- tibble(
  Variable = c(
    # Identifier
    "patient_id",
    
    # Patient Dataset - Numeric
    "adi_national_percentile", "adi_state_decile", "bmi_from_patient_dataset", "bp_diastolic_from_patient_dataset", "bp_systolic_from_patient_dataset", "svi_2020_socioeconomic_percentile_census_tract", "year_of_birth_from_patient_dataset", "general_risk_score",
    
    # Patient Dataset - List as Character
    "allergies_and_contraindications", "chief_complaint", "diagnosis_from_patient_dataset", "hospital_or_clinic_administered_medications", "level_of_service_from_patient_dataset" , "medical_history", "medications", "medications_ordered_from_patient_dataset", "outpatient_medications", "phq_2_total_score", "phq_9", "procedures", "procedures_ordered_from_patient_dataset", "sdoh_domains", "sdoh_risk_level",
    
    # Patient Dataset - Factor
    "country_from_patient_dataset", "country_county_from_patient_dataset", "gender_identity_from_patient_dataset", "language_from_patient_dataset", "legal_sex_from_patient_dataset", "marital_status", "patient_ethnic_group_from_patient_dataset", "patient_race_from_patient_dataset", "religion_from_patient_dataset", "rural_urban_commuting_area_primary_from_patient_dataset", "rural_urban_commuting_area_secondary_from_patient_dataset", "sex_assigned_at_birth_from_patient_dataset", "sexual_orientation_from_patient_dataset", "state_from_patient_dataset", "postal_code_from_patient_dataset", "mychart_status_from_patient_dataset",
    
    # Patient Dataset - Logical
    "interpreter_needed_from_patient_dataset", "university_of_pennsylvania_student_from_patient_dataset",
    
    # Visit Dataset - Numeric
    "age_at_visit_years", "appointment_length_minutes", "bmi_from_visit_dataset", "bp_diastolic_from_visit_dataset", "bp_systolic_from_visit_dataset", "continuity_of_care", "copay_collected", "copay_due", "encounter_to_close_day", "lead_time_days", "no_show_probability", "prepayment_collected", "prepayment_due", "time_physician_spent_post_charting_minutes", "time_physician_spent_pre_charting_minutes", "time_waiting_for_physician_minutes", "time_with_physician_minutes", "year_of_birth_from_visit_dataset",
    
    # Visit Dataset - List as Character
    "diagnosis_from_visit_dataset", "medications_ordered_from_visit_dataset", "procedures_ordered_from_visit_dataset",
    
    # Visit Dataset - Factor
    "appointment_status", "country_from_visit_dataset", "country_county_from_visit_dataset", "encounter_type", "gender_identity_from_visit_dataset", "language_from_visit_dataset", "legal_sex_from_visit_dataset", "level_of_service_from_visit_dataset", "patient_ethnic_group_from_visit_dataset", "patient_race_from_visit_dataset", "primary_benefit_plan", "primary_diagnosis", "primary_payer", "primary_payer_financial_class", "primary_provider_title", "primary_provider_type", "religion_from_visit_dataset", "rural_urban_commuting_area_primary_from_visit_dataset", "rural_urban_commuting_area_secondary_from_visit_dataset", "scheduling_source", "sex_assigned_at_birth_from_visit_dataset", "sexual_orientation_from_visit_dataset", "state_from_visit_dataset", "visit_type", "postal_code_from_visit_dataset", "primary_subscriber_group_number", "mychart_status_from_visit_dataset",
    
    # Visit Dataset - Logical
    "interpreter_needed_from_visit_dataset", "new_to_department_specialty", "new_to_facility", "new_to_provider", "portal_active_at_scheduling", "self_pay", "university_of_pennsylvania_student_from_visit_dataset",
    
    # Visit Dataset - Date
    "appointment_creation_date", "visit_date",
    
    # Visit Dataset - hms
    "appointment_time"
  ),
  
  Type = c(
    # Identifier
    rep("identifier", 1),
    # Patient Dataset - Numeric
    rep("numeric", 8),
    # Patient Dataset - List as Character
    rep("list_as_character", 15),
    # Patient Dataset - Factor
    rep("factor", 16),
    # Patient Dataset - Logical
    rep("logical", 2),
    # Visit Dataset - Numeric
    rep("numeric", 18),
    # Visit Dataset - List as Character
    rep("list_as_character", 3),
    # Visit Dataset - Factor
    rep("factor", 27),
    # Visit Dataset - Logical
    rep("logical", 7),
    # Visit Dataset - Date
    rep("Date", 2),
    # Visit Dataset - hms
    rep("hms", 1)
  )
)

# Group variables by type
group_summary <- variable_type_mapping |>
  group_by(Type) |>
  summarise(Variables = list(Variable), .groups = "drop") |>
  mutate(Count = lengths(Variables)) |>
  select(Type, Count, Variables)

# Save classification mapping
variable_type_mapping |> saveRDS("datasets/mappings/types.rds")
```

###### Conversion

```{r conversion, eval=FALSE}
# Load conversion function
source("scripts/helper-functions/convert-types.R")

# Apply function
patient_data_converted <- patient_data_renamed |> convert_types()
visit_data_converted <- visit_data_renamed |> convert_types()
```

> See [Function to Convert Data Types].

###### Mapping

```{r mapping, eval=FALSE}
# Load required libraries
library(tibble)
library(dplyr)

# Function to create variable type conversions table
create_type_conversions_mapping <- function(original_df, converted_df) {
  # Get data types for original and converted datasets
  tibble(
    Variable = names(original_df),
    Original = sapply(original_df, function(x) paste(class(x), collapse = ", ")),
    Converted = sapply(converted_df, function(x) paste(class(x), collapse = ", "))
  )
}

# Create the variable type mapping tables and combine them
type_conversions_mapping <- bind_rows(
  create_type_conversions_mapping(patient_data_renamed, patient_data_converted),
  create_type_conversions_mapping(visit_data_renamed, visit_data_converted)
)

# Capture as HTML table
source("scripts/helper-functions/capture-to-html.R")

capture_output_to_html(
  "data_type_conversions.html",
  "Data Type Conversions" = type_conversions_mapping
)
```

```{r, results='asis', echo=FALSE}
cat(readLines("outputs/data_type_conversions.html"), sep = "\n")
```

###### Standarization Tables

```{r standarization, eval=FALSE}
# Capture as HTML table
source("scripts/helper-functions/capture-to-html.R")

capture_output_to_html(
  "standarization_tables.html",
  "Standarization",
  "Renaming Table" = name_mapping,
  "Retyping Table" = type_conversions_mapping
)
```

> See [Function to Capture Output as HTML](#capture-output-as-html).

```{r, results='asis', echo=FALSE}
cat(readLines("outputs/standarization_tables.html"), sep = "\n")
```

#### Cleaning

In the cleaning process, data from the patient and visit datasets were first verified for shared unique identifiers to ensure consistency before merging. A left join operation was then performed to combine the datasets based on the patient ID, consolidating patient and visit information into a single dataset. Following merging, identical duplicate columns generated from the merge, which had redundant or overlapping data, were identified and processed. These columns were consolidated into single columns, removing redundancies to streamline the dataset.

For organization, the dataset was structured into logical groups to facilitate analysis. Columns were arranged by category, such as identifiers, demographic information, clinical details, and encounter specifics. This reorganization was aimed at improving the readability and accessibility of the dataset for subsequent analyses, ensuring that related data points were grouped together logically.

##### Missing Data Analysis

In the Missing Data Analysis phase, the dataset underwent a comprehensive evaluation to identify the extent of missing data across various fields. Visualizations were created to depict the percentage of missing data in each column, helping to categorize the degree of missingness and prioritize handling strategies. These visualizations differentiated between slight, moderate, significant, and extreme levels of missing data, providing a clear graphical representation of data completeness.

```{r plotting, eval=FALSE}
# Visualize Missing Data
source("scripts/helper-functions/plot-missing-values.R")

# Patient Data
plot_missing_values_by_type(patient_data_converted, 0, 5)
plot_missing_values_by_type(patient_data_converted, 5, 30)
plot_missing_values_by_type(patient_data_converted, 30, 50)
plot_missing_values_by_type(patient_data_converted, 50, 100)

# Visit Data
plot_missing_values_by_type(visit_data_converted, 0, 5)
plot_missing_values_by_type(visit_data_converted, 5, 30)
plot_missing_values_by_type(visit_data_converted, 30, 50)
plot_missing_values_by_type(visit_data_converted, 50, 100)

# Capture as HTML
source("scripts/helper-functions/capture-to-html.R")

capture_output_to_html(
  "missing_data_plots.html",
  "Missing Data Plots",
  "Patient Data" = list(
    "figures/plots/patient_data_converted_0-5_missing_plot.png",
    "figures/plots/patient_data_converted_5-30_missing_plot.png",
    "figures/plots/patient_data_converted_30-50_missing_plot.png",
    "figures/plots/patient_data_converted_50-100_missing_plot.png"
    ),
  "Visit Data" = list(
    "figures/plots/visit_data_converted_0-5_missing_plot.png",
    "figures/plots/visit_data_converted_5-30_missing_plot.png",
    "figures/plots/visit_data_converted_30-50_missing_plot.png",
    "figures/plots/visit_data_converted_50-100_missing_plot.png"
    )
)
```

> See [Function to Plot Missing Values by Type](#plot-missing-values-by-type).

```{r, results='asis', echo=FALSE}
cat(readLines("outputs/missing_data_plots.html"), sep = "\n")
```

##### Missing Data Handling

For each category of missingness, suitable data handling strategies were proposed, considering the nature of the data and the percentage of missing values. Tables summarizing these strategies were generated to guide the implementation of appropriate methods for missing data imputation or exclusion, ensuring that the dataset's integrity was maintained while preparing it for further analysis.

```{r handling, eval=FALSE}
# Generate data handling tables with all options
source("scripts/helper-functions/generate-missing-handling-table.R")

patient_handling_table <- generate_data_handling_table(patient_data_converted)
visit_handling_table <- generate_data_handling_table(visit_data_converted)

# Capture as HTML
source("scripts/helper-functions/capture-to-html.R")

capture_output_to_html(
  "missing_data_handling_options.html",
  "Data Handling Tables",
  "Patient Data" = patient_handling_table,
  "Visit Data" = visit_handling_table
)
```

> See [Function to Generate Missing Data Handling Strategies](#generate-data-handling-table) and [Function to Calculate Summary Statistics](#calculate-summary-stats).

```{r, results='asis', echo=FALSE}
cat(readLines("outputs/missing_data_handling_options.html"), sep = "\n")
```

##### Merging

```{r merging, eval=FALSE}
# Load required libraries
library(dplyr)

# Verify unique patient_id values are shared
all_shared <- unique(patient_data_converted$patient_id) |>
  (\(x) all(x %in% visit_data_converted$patient_id))() &&
  unique(visit_data_converted$patient_id) |>
  (\(x) all(x %in% patient_data_converted$patient_id))()

if (all_shared) {
  # Perform the merge
  data_merged <- visit_data_converted |>
    left_join(patient_data_converted, by = "patient_id")
  
  print("Merging complete.")
} else {
  print("There are patient_id values that are not shared between the two datasets.")
}

# Sample some rows
set.seed(123)
sample_data_merged <- data_merged |>
  slice_sample(n = 10)
```

##### Deduplication

```{r deduplication, eval=FALSE}
# Load required libraries
library(dplyr)
library(stringr)

# Function to process and rename only identical duplicate columns
process_identical_duplicates <- function(df, suffix_1 = "_from_patient_dataset", suffix_2 = "_from_visit_dataset") {
  # Identify columns with specified suffixes
  cols_suffix_1 <- grep(paste0(suffix_1, "$"), names(df), value = TRUE)
  cols_suffix_2 <- grep(paste0(suffix_2, "$"), names(df), value = TRUE)
  
  for (col in cols_suffix_1) {
    # Find corresponding column with the second suffix
    corresponding_col <- gsub(suffix_1, suffix_2, col)
    
    # Check if the corresponding column exists
    if (corresponding_col %in% cols_suffix_2) {
      # Check if the columns are identical
      if (identical(df[[col]], df[[corresponding_col]])) {
        # Consolidate identical columns by renaming the first and removing the second
        new_name <- gsub(paste0(suffix_1, "|", suffix_2), "", col)
        names(df)[names(df) == col] <- new_name
        df <- df %>% select(-all_of(corresponding_col))
      }
    }
  }
  
  # Return the deduplicated data
  return(df)
}

# Apply the function to the merged dataset
data_deduplicated <- process_identical_duplicates(data_merged)

# Update and save Variable Type Mapping
source("scripts/helper-functions/update-type-mapping.R")

updated_mapping <- variable_type_mapping |> update_type_mapping(data_deduplicated)
variable_type_mapping <- updated_mapping
variable_type_mapping |> saveRDS("datasets/mappings/types.rds")

# Sample some rows
set.seed(123)
sample_data_deduplicated <- data_deduplicated |>
  slice_sample(n = 10)
```

> See [Function to Update Variable Type Mapping](#update-variable-type-mapping).

##### Organization

```{r organization, eval=FALSE}
# Load required libraries
library(readr)
library(dplyr)

logical_groups <- list(

  Identifiers = c("patient_id"),

  Demographic_Info = c(
    "year_of_birth", "age_at_visit_years", "legal_sex", 
    "gender_identity_from_patient_dataset", "gender_identity_from_visit_dataset", 
    "sex_assigned_at_birth_from_patient_dataset", "sex_assigned_at_birth_from_visit_dataset", 
    "sexual_orientation_from_patient_dataset", "sexual_orientation_from_visit_dataset", 
    "patient_race_from_patient_dataset", "patient_race_from_visit_dataset", 
    "patient_ethnic_group", "marital_status", "language", 
    "religion_from_patient_dataset", "religion_from_visit_dataset"
  ),
  
  Sociogeographic_Info = c(
    "country", "state", 
    "country_county_from_patient_dataset", "country_county_from_visit_dataset", 
    "postal_code_from_patient_dataset", "postal_code_from_visit_dataset", 
    "rural_urban_commuting_area_primary", "rural_urban_commuting_area_secondary", 
    "svi_2020_socioeconomic_percentile_census_tract", "adi_national_percentile", "adi_state_decile"
  ),
  
  Clinical_Info = c(
    "primary_diagnosis", "diagnosis_from_patient_dataset", "diagnosis_from_visit_dataset", 
    "medical_history", "allergies_and_contraindications"
  ),
  
  Treatment_Info = c(
    "procedures", "procedures_ordered_from_patient_dataset", "procedures_ordered_from_visit_dataset", 
    "medications", "hospital_or_clinic_administered_medications", "outpatient_medications", 
    "medications_ordered_from_patient_dataset", "medications_ordered_from_visit_dataset"
  ),
  
  Health_Metrics = c(
    "bmi_from_patient_dataset", "bmi_from_visit_dataset", 
    "bp_systolic_from_patient_dataset", "bp_systolic_from_visit_dataset", 
    "bp_diastolic_from_patient_dataset", "bp_diastolic_from_visit_dataset", 
    "phq_2_total_score", "phq_9", "general_risk_score", "sdoh_risk_level", "sdoh_domains"
  ),
  
  Encounter_Info = c(
    "visit_date", "visit_type", "continuity_of_care", "encounter_type", 
    "chief_complaint", "interpreter_needed", "appointment_status", 
    "appointment_creation_date", "appointment_time", "appointment_length_minutes", 
    "lead_time_days", "encounter_to_close_day", "scheduling_source", "no_show_probability"
  ),
  
  Provider_Info = c(
    "primary_provider_type", "primary_provider_title", 
    "time_physician_spent_pre_charting_minutes", "time_physician_spent_post_charting_minutes", 
    "time_with_physician_minutes", "time_waiting_for_physician_minutes"
  ),
  
  Financial_Info = c(
    "primary_benefit_plan", "primary_payer", "primary_payer_financial_class", "self_pay", 
    "copay_due", "copay_collected", "prepayment_due", "prepayment_collected", 
    "primary_subscriber_group_number", "level_of_service_from_patient_dataset", 
    "level_of_service_from_visit_dataset"
  ),
  
  Status_Info = c(
    "portal_active_at_scheduling", "mychart_status", 
    "new_to_department_specialty", "new_to_facility", "new_to_provider", 
    "university_of_pennsylvania_student"
  )
)

# Reorganize data
data_organized <- data_deduplicated |>
  select(all_of(unlist(logical_groups, use.names = FALSE))) |>
  arrange(patient_id, visit_date)

# Sample some rows
set.seed(123)
sample_data_organized <- data_organized |>
  slice_sample(n = 10)
```

##### Cleaning Samples

```{r cleaning_samples, eval=FALSE}
# Capture as HTML tables
source("scripts/helper-functions/capture-to-html.R")

capture_output_to_html(
  "cleaning_samples.html",
  "Samples",
  "Merged Data" = sample_data_merged,
  "Deduplicated Data" = sample_data_deduplicated,
  "Organized Data" = sample_data_organized
)
```

```{r, results='asis', echo=FALSE}
cat(readLines("outputs/cleaning_samples.html"), sep = "\n")
```

#### Preparation

During the aggregation, various functions were utilized to normalize data encoding, compute modes for specific variables like 'sdoh_risk_level', and convert complex data entries into more straightforward numeric representations. The dataset underwent encoding normalization to ensure uniform character encoding across all text fields. A custom function was implemented to calculate the mode for 'sdoh_risk_level' to handle categorical data effectively. Furthermore, strings of numbers separated by new lines were processed to compute their mean values, consolidating the data further.

For the encoding process, all factor variables within the dataset were encoded into integers to facilitate computational efficiency and preparedness for analytical procedures. This step involved storing a mapping of the original factor levels to their new encoded forms, ensuring that no information was lost and that the transformations could be reversed or interpreted in future analyses.

Additionally, the type mapping was updated to reflect changes in data structure and types post-aggregation and encoding. This involved adjusting the classifications of certain variables to better suit their role in the subsequent analysis, ensuring data integrity and appropriateness for the modeling steps that follow.

Reassessment of missing data was performed after these transformations to ensure that the data handling strategies were still appropriate given the new data structure. This included visualizing missing data by type and reassessing the handling strategies to adapt to the encoded and aggregated data's needs.

> **Note**: Following the aggregation process, the resulting datasets are substantially reduced in size. Therefore, they can now be saved and reloaded between code chunks to enable independent execution.

##### Aggregation

```{r aggregation, eval=FALSE}
# Load required libraries
library(data.table)
library(stringr)
library(dplyr)

# Function to normalize encoding
normalize_encoding <- function(df) {
  df[] <- lapply(df, function(col) {
    if (is.character(col)) {
      iconv(col, from = "", to = "UTF-8", sub = "byte") # Replace invalid characters
    } else {
      col
    }
  })
  df
}

# Custom function to compute mode for 'sdoh_risk_level'
compute_mode <- function(x) {
  x <- x[x != "Unknown"]
  if (length(x) == 0) return(NA_character_)
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

# Function to convert strings of numbers separated by "\r\n" to means
mean_from_string <- function(x) {
  sapply(strsplit(x, "\\r?\\n", perl = TRUE), function(vals) {
    vals_num <- suppressWarnings(as.numeric(vals))
    if (all(is.na(vals_num))) NA_real_ else mean(vals_num, na.rm = TRUE)
  })
}

# Convert to data.table for faster processing
data_aggregated <- copy(data_organized)
setDT(data_aggregated)

# Normalize encoding for character columns
data_aggregated <- normalize_encoding(data_aggregated)

# Select columns to aggregate by count
cols_to_aggregate <- setdiff(
  names(data_aggregated)[sapply(data_aggregated, is.character)],
  c("patient_id", "phq_2_total_score", "phq_9", "sdoh_risk_level")
)

# Aggregate by count
for (col in cols_to_aggregate) {
  data_aggregated[[col]] <- ifelse(
    is.na(data_aggregated[[col]]),
    NA_real_,
    vapply(strsplit(data_aggregated[[col]], "\\r?\\n", perl = TRUE),
           function(x) length(unique(x)),
           numeric(1))
  )
}

# Aggregate 'phq_2_total_score' and 'phq_9' by mean
data_aggregated[, phq_2_total_score := mean_from_string(phq_2_total_score)]
data_aggregated[, phq_9 := mean_from_string(phq_9)]

# Aggregate 'sdoh_risk_level' by mode
data_aggregated[, sdoh_risk_level := {
  splits <- strsplit(sdoh_risk_level, "\\r?\\n", perl = TRUE)
  sapply(splits, compute_mode)
}]

# Inspect sdoh_risk_level unique values
unique(data_aggregated$sdoh_risk_level)

# Consolidate 'Low Risk ' (extra space) with 'Low Risk'
data_aggregated <- data_aggregated |>
  mutate(sdoh_risk_level = str_replace_all(sdoh_risk_level, fixed("Low Risk "), "Low Risk"))

# Convert 'sdoh_risk_level' to factor
data_aggregated$sdoh_risk_level <- as.factor(data_aggregated$sdoh_risk_level)

# Save aggregated data
data_aggregated |> saveRDS("datasets/processed/data_aggregated.rds")
```

##### Encoding

```{r encoding, eval=FALSE}
# Load required libraries
library(readr)
library(dplyr)

# Load Aggregated Data
data_aggregated <- readRDS("datasets/processed/data_aggregated.rds")

# Initialize a list to store encoding mapping
mapping_list <- list()

# Encode all factor variables in the dataset
for (col in colnames(data_aggregated)) {
  if (is.factor(data_aggregated[[col]])) {
    # Store the mapping
    mapping_list[[col]] <- data.frame(
      Variable = col,
      Encoded = seq_along(levels(data_aggregated[[col]])),
      Value = levels(data_aggregated[[col]])
    )
    
    # Replace the factor variable with its integer encoding
    data_aggregated[[col]] <- as.integer(data_aggregated[[col]])
  }
}

# Reconvert to factors
data_aggregated <- data_aggregated %>%
  mutate(across(where(is.integer), as.factor))

# Save mapping and encoded dataset
mapping_list |> saveRDS("datasets/mappings/encoding.rds")
data_aggregated |> saveRDS("datasets/processed/data_encoded.rds")

print("Encoding complete. Mapping and encoded dataset saved.")
```

```{r retyping, eval=FALSE}
# Load required libraries
library(readr)
library(dplyr)

# Load type mapping
types <- readRDS("datasets/mappings/types.rds")

# Update type mapping
retypes <- types |>
  mutate(Type = case_when(
    Variable == "sdoh_risk_level" ~ "factor",
    Type == "list_as_character" ~ "numeric",
    TRUE ~ Type
  ))

retypes |> saveRDS("datasets/mappings/retypes.rds")

print("Retyped mapping saved.")
```

##### Missing Data Analysis Reassessment

```{r replotting, eval=FALSE}
# Load required libraries
library(readr)

# Load encoded dataset and retyped mapping
data_encoded <- readRDS("datasets/processed/data_encoded.rds")
variable_type_mapping <- readRDS("datasets/mappings/retypes.rds")

# Visualize Missing Data
source("scripts/helper-functions/plot-missing-values.R")

# Encoded Data
plot_missing_values_by_type(data_encoded, 0, 5)
plot_missing_values_by_type(data_encoded, 5, 30)
plot_missing_values_by_type(data_encoded, 30, 50)
plot_missing_values_by_type(data_encoded, 50, 100)

# Capture as HTML
source("scripts/helper-functions/capture-to-html.R")
capture_output_to_html(
  "missing_data_replots.html",
  "Missing Data Plots",
  "Encoded Data" = list(
    "figures/plots/data_encoded_0-5_missing_plot.png",
    "figures/plots/data_encoded_5-30_missing_plot.png",
    "figures/plots/data_encoded_30-50_missing_plot.png",
    "figures/plots/data_encoded_50-100_missing_plot.png"
    )
)
```

```{r, results='asis', echo=FALSE}
cat(readLines("outputs/missing_data_replots.html"), sep = "\n")
```

##### Missing Data Handling Reassessment

```{r rehandling, eval=FALSE}
# Load required libraries
library(readr)

# Load encoded dataset and retyped mapping
data_encoded <- readRDS("datasets/processed/data_encoded.rds")
variable_type_mapping <- readRDS("datasets/mappings/retypes.rds")

# Generate data handling tables with all options
source("scripts/helper-functions/generate-missing-handling-table.R")

encoded_data_handling_table <- generate_data_handling_table(data_encoded)

# Capture as HTML
source("scripts/helper-functions/capture-to-html.R")

capture_output_to_html(
  "encoded_data_handling_options.html",
  "Data Handling Table",
  "Encoded Data" = encoded_data_handling_table
)
```

```{r, results='asis', echo=FALSE}
cat(readLines("outputs/encoded_data_handling_options.html"), sep = "\n")
```

##### Imputation

```{r imputation, eval=FALSE}
# Load required libraries
library(readr)
library(tidyr)
library(dplyr)
library(mice)

# Load encoded dataset
data_encoded <- readRDS("datasets/processed/data_encoded.rds")

# Calculate missingness percentage for numeric variables
missingness_info <- data_encoded |>
  select(where(is.numeric)) |>
  summarise(across(everything(), ~ mean(is.na(.)) * 100)) |>
  pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_percent")

# Categorize variables based on missingness thresholds
missingness_info <- missingness_info |>
  mutate(category = case_when(
    missing_percent < 5 ~ "<5%",
    missing_percent >= 5 & missing_percent < 30 ~ "5-30%",
    missing_percent >= 30 & missing_percent < 50 ~ "30-50%",
    missing_percent >= 50 ~ ">50%"
  ))

# Impute data based on categories
data_imputed <- data_encoded

# Loop through categories and apply appropriate imputation
for (cat in unique(missingness_info$category)) {
  vars_in_category <- missingness_info |>
    filter(category == cat) |>
    pull(variable)
  
  # Skip if no variables fall into the current category
  if (length(vars_in_category) == 0) next
  
  if (cat == "<5%") {
    # Impute with mean
    data_imputed[vars_in_category] <- data_imputed[vars_in_category] |>
      mutate(across(everything(), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))
  } else if (cat == "5-30%") {
    # Impute with median
    data_imputed[vars_in_category] <- data_imputed[vars_in_category] |>
      mutate(across(everything(), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
  } else if (cat == "30-50%") {
    # Use MICE for imputation
    # Subset the data for variables in this category
    mice_data <- data_imputed[vars_in_category]
    
    # Check if MICE can run
    if (all(sapply(mice_data, function(x) all(is.na(x))))) {
      warning(paste("Skipping MICE for category:", cat, "- no non-NA data available."))
      next
    }
    
    # Run MICE
    mice_imputed <- mice(mice_data, m = 1, method = 'pmm', maxit = 5, seed = 123)
    
    # Extract the completed dataset
    imputed_subset <- complete(mice_imputed)
    
    # Replace in the main dataset
    data_imputed[vars_in_category] <- imputed_subset
  } else if (cat == ">50%") {
    # Handle separately - may decide to drop these columns
    message(paste("Dropping variables with >50% missingness:", paste(vars_in_category, collapse = ", ")))
    data_imputed <- data_imputed |> select(-all_of(vars_in_category))
  }
}

# Save the imputed dataset
data_imputed |> saveRDS("datasets/processed/data_imputed.rds")

# Print completion message
message("Imputation completed and dataset saved to 'datasets/processed/data_imputed.rds'")
```

##### Preparation Samples

```{r preparation_samples, eval=FALSE}
# Load required libraries
library(readr)
library(dplyr)

# Load preparation datasets
data_aggregated <- readRDS("datasets/processed/data_aggregated.rds")
data_encoded <- readRDS("datasets/processed/data_encoded.rds")
data_imputed <- readRDS("datasets/processed/data_imputed.rds")

set.seed(123)
sample_data_aggregated <- data_aggregated |>
  slice_sample(n = 10)
sample_data_encoded <- data_encoded |>
  slice_sample(n = 10)
sample_data_imputed <- data_imputed |>
  slice_sample(n = 10)

# Capture Data Samples
source("scripts/helper-functions/capture-to-html.R")

capture_output_to_html(
  "preparation_samples.html",
  "Samples",
  "Aggregated Data" = sample_data_aggregated,
  "Encoded Data" = sample_data_encoded,
  "Imputed Data" = sample_data_imputed
)
```

```{r, results='asis', echo=FALSE}
cat(readLines("outputs/preparation_samples.html"), sep = "\n")
```

#### Feature Engineering

The following basic features were obtained:

-   **Visit Count**: The total number of visits each patient made.

-   **First Visit**: The date of the first visit within the dataset's timeframe.

-   **Last Visit**: The date of the last visit within the dataset's timeframe.

-   **Visit Span Years**: The total number of years between the first and last visit, calculated as a continuous variable.

-   **Near Start Boundary**: A boolean indicator showing whether the first visit occurred within the first six months of the study period (6 months is the standard OPC timeframe to define "loss to follow up").

-   **Near End Boundary**: A boolean indicator showing whether the last visit occurred within the last six months of the study period.

-   **Boundary Flag**: A boolean indicator that is true if either the first or last visit is near the study period boundaries.

-   **Adjusted Span Years**: The number of years between the first and last visit, adjusted for boundary effects.

-   **Visits Per Year**: The average number of visits per year, adjusted for any boundary effects.

```{r feature_engineering, eval=FALSE}
# Load required libraries
library(dplyr)
library(tidyr)
library(lubridate)

# Load datasets
data_encoded <- readRDS("datasets/processed/data_encoded.rds")
encoding <- readRDS("datasets/mappings/encoding.rds")

# Define dataset boundaries
dataset_start <- as.Date("2014-10-01")
dataset_end <- as.Date("2024-09-30")

# Select provider types
selected_providers <- c("Physician", "Psychiatrist", "Resident", "Nurse Practitioner")

# Decode
selected_provider_codes <- encoding$primary_provider_type |>
  filter(Value %in% selected_providers) |>
  pull(Encoded)

# Filter `data_encoded` by provider_type
data_filtered <- data_encoded |>
  filter(primary_provider_type %in% relevant_provider_codes)

# Feature Extraction: Summarize by patient_id
data_featured <- data_filtered |>
  group_by(patient_id) |>
  summarise(
    visit_count = n(),
    first_visit = min(visit_date, na.rm = TRUE),
    last_visit = max(visit_date, na.rm = TRUE),
    visit_date_min = min(visit_date, na.rm = TRUE),
    visit_date_max = max(visit_date, na.rm = TRUE)
  ) |>
  ungroup()

# Filter patients with more than 2 visits
data_featured <- data_featured |>
  filter(visit_count > 2)

# Feature Engineering: Boundary Handling and Adjusted Metrics
data_featured <- data_featured |>
  mutate(
    visit_span_years = as.numeric(difftime(last_visit, first_visit, units = "days")) / 365.25,
    near_start_boundary = first_visit <= (dataset_start + months(6)), # Within 6 months of start
    near_end_boundary = last_visit >= (dataset_end - months(6)),      # Within 6 months of end
    boundary_flag = near_start_boundary | near_end_boundary,          # Any boundary proximity
    adjusted_span_years = case_when(
      near_start_boundary ~ as.numeric(difftime(last_visit, dataset_start, units = "days")) / 365.25,
      near_end_boundary ~ as.numeric(difftime(dataset_end, first_visit, units = "days")) / 365.25,
      TRUE ~ visit_span_years
    ),
    visits_per_year = case_when(
      adjusted_span_years <= 0.01 ~ NA_real_, # Handle very small spans
      TRUE ~ visit_count / adjusted_span_years
    )
  )

# Merge back with data_encoded
data_final <- data_encoded |>
  left_join(data_featured, by = "patient_id")

# Save the updated dataset
data_final |> saveRDS("datasets/processed/data_featured.rds")
```

### Exploratory Data Analysis

#### Data inspection

```{r inspection}
# Load required library
library(readr)

# Load dataset
data_featured <- readRDS("datasets/processed/data_featured.rds")

# Inspect data
dim(data_featured)
head(data_featured)
str(data_featured)
summary(data_featured)
```

#### Feature EDA

The exploratory data analysis of the `visits_per_year` feature reveals a right-skewed distribution, indicating that most patients have a moderate frequency of visits, with a smaller subset exhibiting extremely high visit frequencies. The summary statistics show that the minimum visits per year is 0.35, and the 1st quartile (Q1) is 5.34, meaning 25% of patients visit less than approximately 5.34 times annually. The median is 7.64 visits per year, highlighting that half of the patients have fewer than 8 visits annually. However, the mean of 12.21, which is higher than the median, suggests that outliers (e.g., patients with over 99 visits per year) pull the average upward. The 3rd quartile (Q3) of 12.74 implies that 75% of patients have visits fewer than 13 annually.

The histogram shows a steep drop in frequency as visits per year increase, confirming the right-skewness of the data. Meanwhile, the boxplot highlights significant outliers above 25 visits per year, suggesting a small number of high-frequency patients that may require special attention or further investigation. Overall, the majority of patients fall within a reasonable visit range of 5 to 13 visits per year, making this range suitable for regression-based predictive models, while the outliers could be better addressed using classification-based predictive models to identify and analyze their unique characteristics.

```{r eda}
# Load required libraries
library(ggplot2)
library(dplyr)

# Load featured dataset
data_featured <- readRDS("datasets/processed/data_featured.rds")

# Filter out missing or extreme values for cleaner visualization
filtered_data <- data_featured |>
  filter(!is.na(visits_per_year) & visits_per_year < 100)

# Summary statistics
summary_stats <- filtered_data |>
  summarise(
    min = min(visits_per_year, na.rm = TRUE),
    q1 = quantile(visits_per_year, 0.25, na.rm = TRUE),
    median = median(visits_per_year, na.rm = TRUE),
    mean = mean(visits_per_year, na.rm = TRUE),
    q3 = quantile(visits_per_year, 0.75, na.rm = TRUE),
    max = max(visits_per_year, na.rm = TRUE)
  )
print(summary_stats)

# Plot histogram
ggplot(filtered_data, aes(x = visits_per_year)) +
  geom_histogram(bins = 30, fill = "blue", alpha = 0.7, color = "black") +
  labs(
    title = "Distribution of Visits Per Year",
    x = "Visits Per Year",
    y = "Count"
  ) +
  theme_minimal()

# Plot boxplot
ggplot(filtered_data, aes(y = visits_per_year)) +
  geom_boxplot(fill = "red", alpha = 0.7) +
  labs(
    title = "Boxplot of Visits Per Year",
    y = "Visits Per Year"
  ) +
  theme_minimal()
```

## Results

### Linear Regression

The linear regression model demonstrated a relatively strong adjusted R-squared value of 82.91%, explaining a significant proportion of the variance in `visits_per_year`. However, while this performance metric appears promising, it masks critical limitations in the model's design and implementation that must be addressed to ensure reliability, interpretability, and fairness. A key issue lies in the reliance on principal components (PC1, PC2, PC3, and PC4) as dominant predictors. Although these components capture substantial variance, they abstract away the original variables, making it difficult to draw actionable insights or understand the underlying relationships between the predictors and the outcome. This undermines the interpretability of the model, particularly in practical applications where domain experts require clarity on which features drive predictions.

Several original predictors, such as `legal_sex`, were found to be statistically non-significant, raising questions about their inclusion in the model. Their presence likely adds noise rather than value, potentially diminishing the model's predictive efficiency. Additionally, the statistically significant predictors, including `visit_date`, `primary_payer`, and `primary_payer_financial_class`, may reflect temporal or systemic biases that do not generalize well across populations or timeframes.

Residual diagnostics, which were not yet analyzed, present another significant limitation. Without a thorough evaluation of residuals, it is unclear whether the model adequately captures non-linear relationships or if heteroscedasticity is present, both of which would invalidate key assumptions of linear regression. The presence of skewness and extreme outliers in the target variable, `visits_per_year`, further compounds these issues. Although a log transformation was applied, the model remains susceptible to high-leverage points that could distort coefficients and predictions.

The approach to handling missing data also warrants critical scrutiny. By excluding columns with missing values, the model sacrifices potentially valuable information and may inadvertently introduce bias if the missingness is not random. More sophisticated imputation methods should be considered to preserve data integrity and improve generalizability. Moreover, the exclusion of potentially relevant variables due to multicollinearity, while necessary for linear regression, limits the scope of the analysis and may result in the loss of subtle but important predictors.

Temporal variables, such as `visit_date`, add another layer of complexity. While these predictors capture time-related patterns, they are inherently tied to the specific timeframe of the dataset and may not generalize to future datasets or different populations. This introduces the risk of overfitting and reduces the model's long-term utility. The lack of fairness audits further compounds these concerns, particularly given the inclusion of variables such as `legal_sex`, which may inadvertently encode systemic biases.

While the RMSE of 5.28 suggests a relatively low level of prediction error, this metric alone does not account for the model's underlying weaknesses. The focus on maximizing predictive accuracy has come at the expense of interpretability, fairness, and robustness. Future iterations of the model must address these critical limitations by incorporating alternative modeling approaches, such as tree-based methods or ensemble learning, to capture non-linear relationships and interactions. Finally, a more rigorous evaluation of residuals, handling of outliers, and imputation of missing data are necessary to build a model that is both reliable and equitable. These steps are essential to move beyond performance metrics and create a model that truly adds value to healthcare analytics.

```{r linear_regression}
# Load Required Libraries
library(dplyr)
library(caret)
library(ggplot2)
library(pROC)
library(doParallel)
library(car)

# Load Featured Dataset
data_featured <- readRDS("datasets/processed/data_featured.rds")

# Prepare model data
non_na_cols <- colnames(data_featured)[colSums(is.na(data_featured)) == 0]
non_na_cols <- union(non_na_cols, "visits_per_year") # Ensure `visits_per_year` is included
data_model <- data_featured |>
  select(all_of(non_na_cols)) |>
  filter(!is.na(visits_per_year)) # Remove rows with missing `visits_per_year`

# Exclude `patient_id` and similar unique identifiers
data_model <- data_model |> select(-patient_id)

# Log-transform `visits_per_year`
data_model <- data_model |>
  mutate(visits_per_year_log = log1p(visits_per_year)) # Use log1p to handle zeros

# Train-Test Split
set.seed(123)
train_index <- createDataPartition(data_model$visits_per_year, p = 0.8, list = FALSE)
train_data <- data_model[train_index, ]
test_data <- data_model[-train_index, ]

# Remove Low-Variance Features
low_variance_cols <- nearZeroVar(train_data, saveMetrics = TRUE)
low_variance_features <- colnames(train_data)[!low_variance_cols$nzv]

# Ensure `train_data` and `test_data` are data frames
train_data <- as.data.frame(train_data)
test_data <- as.data.frame(test_data)

# Subset to remove low-variance features
train_data <- train_data |> select(all_of(low_variance_features))
test_data <- test_data |> select(all_of(low_variance_features))

# Dimensionality Reduction Using PCA
train_visits_per_year <- train_data$visits_per_year
test_visits_per_year <- test_data$visits_per_year
train_data <- train_data[, setdiff(colnames(train_data), "visits_per_year")]
test_data <- test_data[, setdiff(colnames(test_data), "visits_per_year")]

pca_preprocess <- preProcess(as.data.frame(train_data), method = "pca", pcaComp = 30)
train_data_pca <- predict(pca_preprocess, as.data.frame(train_data))
test_data_pca <- predict(pca_preprocess, as.data.frame(test_data))

# Check PCA-transformed datasets
stopifnot(all(colnames(train_data_pca) == colnames(test_data_pca)))

# Add visits_per_year back into PCA-transformed data
train_data_pca$visits_per_year <- train_visits_per_year
test_data_pca$visits_per_year <- test_visits_per_year

# Downsample training data
if (nrow(train_data_pca) > 5000) {
  set.seed(123)
  train_data_pca <- train_data_pca[sample(nrow(train_data_pca), size = 5000), ]
}

# Address Multicollinearity in Linear Regression
train_data_pca <- train_data_pca %>%
  mutate_all(~ifelse(is.infinite(.) | is.nan(.) | is.na(.), 0, .))

linear_combos <- findLinearCombos(as.matrix(train_data_pca))
if (!is.null(linear_combos$remove)) {
  train_data_pca <- train_data_pca[, -linear_combos$remove]
}

vif_check <- vif(lm(visits_per_year ~ ., data = train_data_pca))
high_vif_cols <- names(vif_check[vif_check > 10])
if (length(high_vif_cols) > 0) {
  train_data_pca <- train_data_pca[, !colnames(train_data_pca) %in% high_vif_cols]
}

# Ensure consistent column types before training
cols_to_numeric <- c("self_pay", "portal_active_at_scheduling")
for (col in cols_to_numeric) {
  if (col %in% colnames(train_data_pca)) {
    train_data_pca[[col]] <- as.numeric(train_data_pca[[col]])
  }
}

# Train Linear Regression Model
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
lm_model <- train(
  visits_per_year ~ ., 
  data = train_data_pca, 
  method = "lm", 
  trControl = trainControl(method = "cv", number = 5, verboseIter = TRUE)
)
stopCluster(cl)

# Fix mismatched column types in test_data_pca
factor_to_integer_cols <- c("legal_sex", "encounter_type", "primary_benefit_plan", 
                            "primary_payer", "primary_payer_financial_class")
for (col in factor_to_integer_cols) {
  if (col %in% colnames(test_data_pca)) {
    test_data_pca[[col]] <- as.integer(as.factor(test_data_pca[[col]]))
  }
}

if ("visit_date" %in% colnames(test_data_pca)) {
  test_data_pca$visit_date <- as.numeric(test_data_pca$visit_date)
}

for (col in cols_to_numeric) {
  if (col %in% colnames(test_data_pca)) {
    test_data_pca[[col]] <- as.numeric(test_data_pca[[col]])
  }
}

# Ensure consistent column types between training and test datasets
consistent_cols <- intersect(colnames(train_data_pca), colnames(test_data_pca))
factor_and_logical_cols <- consistent_cols[sapply(train_data_pca[consistent_cols], function(col) is.factor(col) || is.logical(col))]
for (col in factor_and_logical_cols) {
  train_data_pca[[col]] <- as.numeric(as.factor(train_data_pca[[col]]))
  test_data_pca[[col]] <- as.numeric(as.factor(test_data_pca[[col]]))
}

# Evaluate Linear Regression Model
lm_predictions <- predict(lm_model, newdata = test_data_pca)
lm_rmse <- sqrt(mean((lm_predictions - test_data_pca$visits_per_year)^2))
cat("Linear Regression RMSE:", lm_rmse, "\n")

# Visualize Regression Results
ggplot(data = NULL, aes(x = test_data_pca$visits_per_year, y = lm_predictions)) +
  geom_point(alpha = 0.7, color = "darkblue") +
  geom_abline(slope = 1, intercept = 0, color = "darkred", linewidth = 1.5) +
  labs(
    title = "Actual vs. Predicted Visits Per Year",
    x = "Actual Visits Per Year",
    y = "Predicted Visits Per Year"
  ) +
  theme_minimal()

# Summary Outputs
cat("Linear Regression Summary:\n")
print(summary(lm_model$finalModel))
```

This linear regression model is recognized as an initial exploratory step in the broader development of predictive modeling capabilities for outpatient psychiatry services. As the presentation deadline approached, the focus has been on establishing a foundational understanding of the data's structure and the preliminary relationships between variables. This phase has involved setting up basic model configurations, handling multicollinearity, and assessing initial model performance through measures like RMSE.

Looking ahead, future iterations are planned to significantly expand and refine predictive modeling efforts:

1.  **Feature Engineering Expansion**: A `Resource Utilization Score (RUS)` will be developed as:

    -   A composite score (`cRUS`) for regression models, made of:

        -   Schedule Utilization (`cRUS-S`): Such as Minutes per week (to be matched with available provider schedules)

        -   Clinical Utilization (`cRUS-C`)

            -   Diagnostic (`RUS-CD`): Such as unique number of diagnoses.
            -   Therapeutic (`RUS-CT`): Such as unique number of medications or procedures.

    -   For classification models by categories:

        -   Low (`RUS-L`)

        -   Medium (`RUS-M`)

        -   High (`RUS-H`)

        -   Very High (`RUS-VH`)

    -   More granular features, including interaction terms between existing variables, time-series analysis features like rolling averages and lag variables, and polynomial features to capture non-linearity, are to be crafted.

2.  **Advanced Machine Learning Algorithms**: Beyond linear regression, an array of sophisticated machine learning algorithms such as Gradient Boosting Machines (GBM), Random Forests, and Deep Learning models are to be evaluated. These models offer the potential for handling complex, non-linear patterns in the data more effectively. Classification models for high resource utilization will be necessary for prediction

3.  **Hyperparameter Optimization**: Techniques like grid search, random search, and Bayesian optimization are to be employed to fine-tune model parameters, ensuring optimal performance.

4.  **Dimensionality Reduction Techniques**: In addition to Principal Component Analysis (PCA), methods like t-Distributed Stochastic Neighbor Embedding (t-SNE) and Autoencoders are to be explored for their efficacy in reducing the dimensionality of data while preserving important informational cues.

5.  **Model Ensembles and Stacking**: The implementation of ensemble methods, which combine predictions from multiple models to improve accuracy, and stacking techniques, where a new model is trained to synthesize the output of multiple other models, are to be considered.

6.  **Robust Anomaly Detection**: To improve model reliability, anomaly detection algorithms will be integrated to identify and handle outliers and unusual patterns in the data that could skew predictions.

7.  **Incremental Learning Models**: Models capable of learning incrementally from new data without the need to retrain from scratch are to be investigated to adapt to new trends in patient data over time.

8.  **Time-Series Forecasting Models**: Specific models that are robust in handling sequential data, like ARIMA (AutoRegressive Integrated Moving Average) and LSTM (Long Short-Term Memory) networks, are to be assessed for predicting trends and seasonal variations in appointment scheduling.

9.  **Simulations and Scenario Analysis**: Advanced simulations are to be conducted to understand the potential impacts of various scheduling scenarios. This will help in strategic planning and managing resource allocation under different hypothetical conditions.

10. **Automated Data Pipelines**: To support ongoing analysis, automated data pipelines are to be developed for real-time data ingestion, cleaning, transformation, and loading (ETL). This will facilitate a seamless flow of updated data into the modeling environment.

11. **Custom Model Evaluation Metrics**: Custom metrics that directly relate to clinical outcomes and operational efficiency, such as patient wait times reduction percentage and provider utilization rates, are to be developed and integrated into the model evaluation process.

These technical enhancements aim not only to bolster the predictive accuracy and robustness of the models but also to align them closely with the operational dynamics and clinical decision-making processes within outpatient psychiatry settings.

## Conclusion {#sec-conclusion}

### Reflections on the Project's Limitations

This project, while comprehensive in its preparation to predicting resource utilization in outpatient psychiatric services, faces several limitations that need addressing in future work. Firstly, the data used in the study is confined to a single healthcare system, which may not fully represent broader demographic and clinical variables found across different geographic and healthcare settings. This limitation could affect the generalizability of the predictive models to other psychiatric outpatient services. Additionally, the inherent variability in psychiatric conditions and their treatments poses a challenge to creating universally applicable models. Patient behavior, such as adherence to scheduled visits, can be unpredictable and is influenced by numerous factors that are not always captured in the data.

### Future Directions and Enhancements

To overcome these limitations, future iterations of the project will aim to integrate multi-site data, enhancing the diversity and representativeness of the input data. This approach will likely help in developing more robust models that can adapt to various clinical environments. Furthermore, incorporating real-time data streams could significantly improve the model's responsiveness to changes in patient behavior and clinic operations. The introduction of natural language processing to analyze unstructured data such as clinician notes could also uncover additional insights that would enhance predictive accuracy.

### Anticipated Challenges

The next phases of the project will also address several anticipated challenges. One major challenge is the integration of predictive models into existing electronic health records (EHR) systems, which requires not only technical solutions but also collaboration with IT departments and adherence to strict privacy regulations. Another challenge is the potential resistance from clinic staff, who may be skeptical of AI and machine learning-based tools. Addressing these concerns through training and demonstration of value will be crucial for successful implementation.

#### Ancitipated Solutions

Addressing the challenge of integrating predictive models with electronic health records (EHRs), the Epic Cognitive Developer Platform emerges as a pivotal solution. This platform offers a structured approach to embedding advanced analytics directly into the EHR environment, which can significantly streamline the integration process. By leveraging this platform, the models can operate within the existing EHR infrastructure, reducing the need for extensive custom development and simplifying compliance with healthcare data standards and privacy regulations.

The platform's capabilities include seamless access to real-time patient data, which is crucial for the accuracy and relevance of predictive models. It also allows for the deployment of models that can interact directly with clinical workflows, making the insights generated by the models immediately actionable. Furthermore, the platform supports iterative updates and enhancements to the models, facilitating continuous improvement without disrupting the EHR system's core functionalities.

Utilizing the Epic Cognitive Developer Platform not only addresses the technical barriers to EHR integration but also helps ensure that the models are scalable and adaptable to different healthcare settings. This makes it an essential component in overcoming one of the major hurdles in the widespread adoption of AI-driven tools in healthcare.

### Closing Thoughts

Despite these challenges, the potential of predictive modeling to enhance resource utilization in psychiatric outpatient services is vast. By continuously refining the models and incorporating feedback from end-users, the project aims to create a dynamic tool that supports clinical decision-making and improves patient care. The ultimate goal is not only to optimize scheduling and resource allocation but also to contribute to the broader field of healthcare analytics by providing a template for similar initiatives in other specialties. The journey of this project highlights the importance of interdisciplinary collaboration, continual learning, and adaptation to technological advancements in healthcare.

## Appendix

### Helper Functions

#### Function to Capture Output as HTML {#capture-output-as-html}

``` r
{{< include scripts/helper-functions/capture-to-html.R >}}
```

#### Function to Convert Data Types

``` r
{{< include scripts/helper-functions/convert-types.R >}}
```

#### Function to Update Variable Type Mapping {#update-variable-type-mapping}

``` r
{{< include scripts/helper-functions/update-type-mapping.R >}}
```

#### Function to Plot Missing Values by Type {#plot-missing-values-by-type}

``` r
{{< include scripts/helper-functions/calculate-summary-stats.R >}}
```

#### Function to Calculate Summary Statistics {#calculate-summary-stats}

``` r
{{< include scripts/helper-functions/calculate-summary-stats.R >}}
```

#### Function to Generate Missing Data Handling Strategies {#generate-data-handling-table}

``` r
{{< include scripts/helper-functions/calculate-summary-stats.R >}}
```