Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hospital Readmission Prediction based #484

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 46 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,100 +1,96 @@
# healthcareai

[![Code Health](https://landscape.io/github/HealthCatalyst/healthcareai-py/master/landscape.svg?style=flat)](https://landscape.io/github/HealthCatalyst/healthcareai-py/master)
[![Appveyor build status](https://ci.appveyor.com/api/projects/status/github/HealthCatalyst/healthcareai-py?branch=master&svg=true)](https://ci.appveyor.com/project/CatalystAdmin/healthcareai-py/branch/master)
[![Build Status](https://travis-ci.org/HealthCatalyst/healthcareai-py.svg?branch=master)](https://travis-ci.org/HealthCatalyst/healthcareai-py)
<!--[![Anaconda-Server Badge](https://anaconda.org/catalyst/healthcareai/badges/version.svg)](https://anaconda.org/catalyst/healthcareai)
[![Anaconda-Server Badge](https://anaconda.org/catalyst/healthcareai/badges/installer/conda.svg)](https://conda.anaconda.org/catalyst)-->
[![PyPI version](https://badge.fury.io/py/healthcareai.svg)](https://badge.fury.io/py/healthcareai)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.999010.svg)](https://doi.org/10.5281/zenodo.999010)
[![Code Health](https://landscape.io/github/HealthCatalyst/healthcareai-py/master/landscape.svg?style=flat)](https://landscape.io/github/HealthCatalyst/healthcareai-py/master)
[![Appveyor build status](https://ci.appveyor.com/api/projects/status/github/HealthCatalyst/healthcareai-py?branch=master&svg=true)](https://ci.appveyor.com/project/CatalystAdmin/healthcareai-py/branch/master)
[![Build Status](https://travis-ci.org/HealthCatalyst/healthcareai-py.svg?branch=master)](https://travis-ci.org/HealthCatalyst/healthcareai-py)
[![PyPI version](https://badge.fury.io/py/healthcareai.svg)](https://badge.fury.io/py/healthcareai)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.999010.svg)](https://doi.org/10.5281/zenodo.999010)
[![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://raw.githubusercontent.com/HealthCatalyst/healthcareai-py/master/LICENSE)

The aim of **healthcareai** is to streamline machine learning in healthcare. The package has two main goals:

- Allow one to easily create models based on tabular data, and deploy a best model that pushes predictions to a database such as MSSQL, MySQL, SQLite or csv flat file.
- Provide tools related to data cleaning, manipulation, and imputation.
- Allow one to easily create models based on tabular data, and deploy a best model that pushes predictions to a database such as MSSQL, MySQL, SQLite, or CSV flat file.
- Provide tools related to data cleaning, manipulation, and imputation.

## Installation

### Windows

- If you haven't, install 64-bit Python 3.5 via [the Anaconda distribution](https://repo.continuum.io/archive/Anaconda3-4.2.0-Windows-x86_64.exe)
- **Important** When prompted for the **Installation Type**, select **Just Me (recommended)**. This makes permissions later in the process much simpler.
- **Important** When prompted for the **Installation Type**, select **Just Me (recommended)**. This makes permissions later in the process much simpler.
- Open the terminal (i.e., CMD or PowerShell, if using Windows)
- Run `conda install pyodbc`
- Upgrade to latest scipy (note that upgrade command took forever)
- Run `conda remove scipy`
- Run `conda install scipy`
- Upgrade to the latest version of scipy
- Run `conda remove scipy`
- Run `conda install scipy`
- Run `conda install scikit-learn`
- Install healthcareai using **one and only one** of these three methods (ordered from easiest to hardest).
<!--1. **Recommended:** Install the latest release with conda by running `conda install -c catalyst healthcareai`-->
2. **Recommended:** Install the latest release with pip run `pip install healthcareai`
3. If you know what you're doing, and instead want the bleeding-edge version direct from our github repo, run `pip install https://github.com/HealthCatalyst/healthcareai-py/zipball/master`

#### Why Anaconda?

We recommend using the Anaconda python distribution when working on Windows. There are a number of reasons:
- When running anaconda and installing packages using the `conda` command, you don't need to worry about [dependency hell](https://en.wikipedia.org/wiki/Dependency_hell), particularly because packages aren't compiled on your machine; `conda` installs pre-compiled binaries.
- A great example of the pain the using `conda` saves you is with the python package **scipy**, which, by [their own admission](http://www.scipy.org/scipylib/building/windows.html) *"is difficult"*.
- **Recommended:** Install the latest release with pip by running `pip install healthcareai`
- If you know what you're doing, and instead want the bleeding-edge version direct from our GitHub repo, run `pip install https://github.com/HealthCatalyst/healthcareai-py/zipball/master`

### Linux

You may need to install the following dependencies:
- `sudo apt-get install python-tk`
- `sudo pip install pyodbc`
- Note you'll might run into trouble with the `pyodbc` dependency. You may first need to run `sudo apt-get install
unixodbc-dev` then retry `sudo pip install pyodbc`. Credit [stackoverflow](http://stackoverflow.com/questions/2960339/unable-to-install-pyodbc-on-linux)
- You might run into trouble with the `pyodbc` dependency. If so, try running `sudo apt-get install unixodbc-dev` then retry `sudo pip install pyodbc`.

Once you have the dependencies satisfied run `pip install healthcareai` or `sudo pip install healthcareai`
Once you have the dependencies satisfied, run `pip install healthcareai` or `sudo pip install healthcareai`.

### macOS

- `pip install healthcareai` or `sudo pip install healthcareai`

### Linux and macOS (via docker)
### Linux and macOS (via Docker)

- Install [docker](https://docs.docker.com/engine/installation/)
- Clone this repo (look for the green button on the repo main page)
- cd into the cloned directory
- run `docker build -t healthcareai .`
- run the docker instance with `docker run -p 8888:8888 healthcareai`
- You should then have a jupyter notebook available on `http://localhost:8888`.
- Run `docker build -t healthcareai .`
- Run the docker instance with `docker run -p 8888:8888 healthcareai`
- You should then have a Jupyter notebook available on `http://localhost:8888`.

### Verify Installation

To verify that *healthcareai* installed correctly, open a terminal and run `python`. This opens an interactive python
console (also known as a [REPL](https://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop)). Then enter this
command: `from healthcareai import SupervisedModelTrainer` and hit enter. If no error is thrown, you are ready to rock.
To verify that *healthcareai* installed correctly, open a terminal and run `python`. This opens an interactive Python console (also known as a [REPL](https://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop)). Then enter this command: `from healthcareai import SupervisedModelTrainer` and hit enter. If no error is thrown, you are ready to rock.

If you did get an error, or run into other installation issues, please [let us know](http://healthcare.ai/contact.html)
or better yet post on [Stack Overflow](http://stackoverflow.com/questions/tagged/healthcare-ai) (with the healthcare-ai
tag) so we can help others along this process.
If you did get an error, or run into other installation issues, please [let us know](http://healthcare.ai/contact.html) or better yet post on [Stack Overflow](http://stackoverflow.com/questions/tagged/healthcare-ai) (with the healthcare-ai tag) so we can help others along this process.

## Getting started
---

## Getting Started

1. Read through the [Getting Started](http://healthcareai-py.readthedocs.io/en/latest/getting_started/) section of the [healthcareai-py](http://healthcareai-py.readthedocs.io/en/latest/) documentation.

2. Read through the example files to learn how to use the healthcareai-py API.
* For examples of how to train and evaluate a supervised model, inspect and run either `example_regression_1.py` or `example_classification_1.py` using our sample diabetes dataset.
* For examples of how to use a model to make predictions, inspect and run either `example_regression_2.py` or `example_classification_2.py` after running one of the first examples.
* For examples of more advanced use cases, inspect and run `example_advanced.py`.
- For examples of how to train and evaluate a supervised model, inspect and run either `example_regression_1.py` or `example_classification_1.py` using our sample diabetes dataset.
- For examples of how to use a model to make predictions, inspect and run either `example_regression_2.py` or `example_classification_2.py` after running one of the first examples.
- For examples of more advanced use cases, inspect and run `example_advanced.py`.

3. To train and evaluate your own model, modify the queries and parameters in either `example_regression_1.py` or `example_classification_1.py` to match your own data.

4. Decide what type of prediction output you want. See [Choosing a Prediction Output Type](http://healthcareai-py.readthedocs.io/en/latest/prediction_types/) for details.

5. Set up your database tables to match the schema of the output type you chose.
* If you are working in a Health Catalyst EDW ecosystem (primarily MSSQL), please see the [Health Catalyst EDW Instructions](http://healthcareai-py.readthedocs.io/en/latest/catalyst_edw_instructions/) for setup.
* Otherwise, please see [Working With Other Databases](http://healthcareai-py.readthedocs.io/en/latest/databases/)
for details about writing to different databases (MSSQL, MySQL, SQLite, CSV)
- If you are working in a Health Catalyst EDW ecosystem (primarily MSSQL), please see the [Health Catalyst EDW Instructions](http://healthcareai-py.readthedocs.io/en/latest/catalyst_edw_instructions/) for setup.
- Otherwise, please see [Working With Other Databases](http://healthcareai-py.readthedocs.io/en/latest/databases/) for details about writing to different databases (MSSQL, MySQL, SQLite, CSV).

6. Congratulations! After running one of the example files with your own data, you should have a trained model. To use your model to make predictions, modify either `example_regression_2.py` or `example_classification_2.py` to use your new model. You can then run it to see the results.

---

## Logistic Regression for Hospital Readmission Prediction

### Overview
A new feature has been added to implement a **Logistic Regression model** for predicting hospital readmissions based on patient data. This model uses a dataset of diabetic patients and predicts the likelihood of a patient being readmitted within 30 days based on features such as race, gender, age, admission type, discharge disposition, and medical history.

### How to Use the Hospital Readmission Prediction Model

6. Congratulations! After running one of the example files with your own data, you should have a trained model. To use your model to make predictions, modify either `example_regression_2.py` or `example_classification_2.py` to use your new model. You can then run it to see the results.
1. **Dataset**:
Ensure you have the **Diabetic Hospital Readmission Dataset** available. The data should be saved as `diabetic_data.csv`

## For Issues
2. **Running the Model**:
The model is implemented in the Python script `hospital_readmission_prediction.py`. You can run this script to train and test the model as follows:

- Double check that the code follows the examples [here](http://healthcareai-py.readthedocs.io/en/latest/)
- If you're still seeing an error, create a post in [Stack Overflow](http://stackoverflow.com/questions/tagged/healthcare-ai) (with the healthcare-ai tag) that contains
* Details on your environment (OS, database type, R vs Py)
* Goals (ie, what are you trying to accomplish)
* Crystal clear steps for reproducing the error
- You can also log a new issue in the GitHub repo by clicking [here](https://github.com/HealthCatalyst/healthcareai-py/issues/new)
```bash
python hospital_readmission_prediction.py
81 changes: 81 additions & 0 deletions hospital_readmission_prediction.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# -*- coding: utf-8 -*-
"""hospital_readmission_prediction.py

This script performs data preprocessing and trains a logistic regression model
to predict hospital readmission based on patient data.
"""

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load the dataset
data = pd.read_csv('diabetic_data.csv')
print("Dataset loaded successfully!")
print(data.head())

# Replace '?' with NaN and fill missing values
data.replace('?', pd.NA, inplace=True)
data.fillna(method='ffill', inplace=True)
data.head()

# Checking if 'race', 'gender', and 'age' still need to be one-hot encoded
categorical_columns = ['race', 'gender', 'age']

# Applying get_dummies to these columns
if all(col in data.columns for col in categorical_columns):
data = pd.get_dummies(data, columns=categorical_columns)
print(data.head())

# Applying pd.get_dummies to the identified non-numeric columns
non_numeric_columns = ['weight', 'diag_1', 'diag_2', 'diag_3', 'max_glu_serum', 'A1Cresult',
'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
'tolazamide', 'examide', 'citoglipton', 'insulin', 'glyburide-metformin',
'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone',
'metformin-pioglitazone', 'change', 'diabetesMed']

# Applying pd.get_dummies to convert these non-numeric columns to numerical format
data = pd.get_dummies(data, columns=non_numeric_columns)
print(data.head())

# Define the target variable
y = data['readmitted']

# Convert 'readmitted' column to binary: NO = 0, all others = 1
y = y.apply(lambda x: 0 if x == 'NO' else 1)

# Define the features (exclude 'readmitted' and any other irrelevant columns)
X = data.drop(columns=['readmitted', 'encounter_id', 'patient_nbr', 'payer_code', 'medical_specialty']) # Exclude any other irrelevant columns

# Display the shape of X and y to confirm the setup
print("Features shape:", X.shape)
print("Target shape:", y.shape)

from sklearn.model_selection import train_test_split

# Spliting the data into training and test sets (70% training, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Display the shapes of the split data
print("Training set shape:", X_train.shape, y_train.shape)
print("Test set shape:", X_test.shape, y_test.shape)

from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
model = LogisticRegression(max_iter=2000)

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

from sklearn.metrics import accuracy_score

# Print accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
42 changes: 42 additions & 0 deletions test_hospital_readmission.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# -*- coding: utf-8 -*-
"""
test_hospital_readmission.py

Unit tests for the hospital readmission prediction model.
"""

import unittest
import pandas as pd
import os
from hospital_readmission_prediction import model, X_train, y_train, X_test, y_test, data

class TestHospitalReadmission(unittest.TestCase):
"""
This class tests the preprocessing and model training functions
in the hospital_readmission_prediction.py script.
"""

@classmethod
def setUpClass(cls):
# Load the dataset
cls.data = pd.read_csv('diabetic_data.csv')

def test_data_loading(self):
"""Test if the data is loaded correctly"""
self.assertFalse(self.data.empty, "Dataset is empty!")

def test_model_training(self):
"""Test if the logistic regression model is trained correctly"""
# Ensure the model can be fitted
model.fit(X_train, y_train)
self.assertTrue(model, "Model training failed.")

def test_model_accuracy(self):
"""Test if the model achieves a reasonable accuracy"""
# Train and predict
y_pred = model.predict(X_test)
accuracy = model.score(X_test, y_test)
self.assertGreaterEqual(accuracy, 0.5, "Model accuracy is too low!")

if __name__ == '__main__':
unittest.main()