This is an assignment given in The Open University of Israel's course: Data Mining, in which I was asked to define the problem, preproccess the data, and solve the problem using classification algorithms.
The purpose of the data mining process regarding CKDs is to reveal patterns, dependencies or hidden meanings in the following set of features, describing people's health conditions. Using statistic tools, algorithms and models, we can check the influence of the given 24 attributes on the possibility of being diagnosed with CKD. This proccess will help us understand the factors affecting kidney disease, diagnose it earlier, and match the best treatment method or even prevent the disease in the future.
# | Attribute name | Description | Data type | Unit measures | Range of values | Average | Standard deviation |
---|---|---|---|---|---|---|---|
1 | age | Age | numeric | years | 2-90 | 51.48338 | 17.16971 |
2 | bp | Blood pressure | numeric | Mm/Hg | 50-180 | 76.46907 | 13.68364 |
3 | sg | Specific gravity | nominal | {1.005,1.010,1.015,1.020,1.025} | |||
4 | al | Albumin | nominal | {0,1,2,3,4,5} | |||
5 | su | Sugar | nominal | {0,1,2,3,4,5} | |||
6 | rbc | Red blood cells | nominal | {Normal, Abnormal} | |||
7 | pc | Pus cell | nominal | {Normal, Abnormal} | |||
8 | pcc | Puss cell clumps | nominal | {Present, Not present} | |||
9 | ba | Bacteria | nominal | {Present, Not present} | |||
10 | bgr | Blood glucose random | numeric | Mgs/Dl | 22-490 | 148.0365 | 72.28171 |
11 | bu | Blood urea | numeric | Mgs/Dl | 1.5-391 | 57.42572 | 50.50301 |
12 | sc | Serum creatinine | numeric | Mgs/Dl | 0.4-76 | 3.072454 | 5.741126 |
13 | sod | Sodium | numeric | mEq/L | 4.5-163 | 137.5288 | 10.40875 |
14 | pot | Potassium | numeric | mEq/L | 2.5-47 | 4.627244 | 3.19304 |
15 | hemo | Hemoglobin | numeric | Gms | 3.1-17.8 | 12.52644 | 2.912587 |
16 | pcv | Packed cell volume | numeric | 9-54 | 38.8845 | 8.990105 | |
17 | wc | White blood cell count | numeric | Cells/cumm | 2200-26400 | 8406.122 | 2944.474 |
18 | rc | Red blood cell count | numeric | Millions/cmm | 2.1-8 | 4.707435 | 1.025323 |
19 | htn | Hypertension | nominal | {Yes, No} | |||
20 | dm | Diabetes mellitus | nominal | {Yes, No} | |||
21 | cad | Coronay artery disease | nominal | {Yes, No} | |||
22 | appet | Appetite | nominam | {Good, Poor} | |||
23 | pe | Pedel edema | nominal | {Yes, No} | |||
24 | ane | Anemia | nominal | {Yes, No} |
Prediction of the presence of CKD, depending on different medical indices of a patient.
The data is taken from a single dataset from Bangladesh, and therefore there is no need in integration from different sources.
The data contains missing values, so we want to fix the dataset by deleting observations ot filling those values with close values, or by smoothing. In this case, I chose to fill in missing values using mean (for continous values) and mode (for discrete values).
Analyzing if there is a need for data reduction, by deleting observation or irrelevant features, as well as discretizing and normalizing the data. Notice that some values greatly exceeds the standard deviation, so we fix them.
I've used Excel
for organising the data, as well as Python
for data proccessing and running ML algorithms.
In order for the data to fit for the different classification algorithms, we will normalize the values such that no single value will have greater weight on the classification.
Run the chosen classification algorithms on the optimized data, using Python
.
Analyzing the results derived from the classification algorithm, and assessing them based on metrics like accuracy, relevance, simplicity and so on, through the use of statistical data analysis.
After examination, the results can be either of the following:
- The results are satisfactory
In this case, we move forward to the next step - conclusions. - The results aren't satisfactory
In this case, we repeat the process while changing some parts of it (e.g data deletion, discretization, different transformations, different algorithms etc.). This action will be repeated until a model is found that satisfies the needs of the test, or until it is decided to stop the process.
Using the optimal model found to predict the presence of CKD in future patients. Depending on the chosen model, we can present it visually using a mathematical formula, an inference rule, a decision tree or the source code of the algorithm.
A statistical model employed for forecasting a categorical class from either categorical or numerical attributes, which provides the likelihood of association with a particular class.
- Advantages - easy to implement and use, and conveniently presentable. Moreover, the algorithm does not require any prior assumptions about the distribution of the random variables in the feature space, and can handle both numerical and categorical values.
- Disadvantages - the model tends to overfit/underfit if the number of examples is big/small compared to the number of features. In addition, the model is not recommended when the features are co-linear.
A supervised learning model designed for classification tasks, which iteratively divides the data based on the features, in relation to the measure of information gain.
- Constraints - a discretization task must be done for each continuous feature.
- Advantages - easy to implement and use, and conveniently presentable. Moreover, the algorithm can handle noisy data and missing values, by pruning the tree.
- Disadvantages - the information gain metric generally favors attributes with a broader spectrum of values. Moreover, the algorithm could be quite demanding in terms of computational resources.
An algorithm that constructs a binary decision tree, which iteratively divides the data into two based on the features, utilizing the Gini index metric, so that each split results in two subgroups that are as similar in size as feasible.
- Advantages - can manage both numerical and categorical data, along with missing values. Furthermore, the resulting tree is typically more concise compared to other decision tree algorithms.
- Disadvantages - the Gini index measure tends to prefer features with larger range of values, and the running time is longer, compared to other decision tree algorithms.
A classifier that combines the outcomes of numerous decision trees to attain the best possible results.
- Advantages - can manage high-dimensional, extensive datasets, including those with missing values. It also has the ability to decrease the data's variance, thereby preventing overfitting, by averaging the predictions generated by the various trees it constructs.
- Disadvantages - an algorithm that operates slowly, particularly on extensive datasets, and it has a tendency to favor features with a greater number of degrees. Additionally, this classifier is not as easily representable as a solitary decision tree.
Given that the dataset contains 24 features, any example with more than a third (8) of its features missing will be removed. If not, the missing values will be replaced with the mean (for numerical values) or the mode (for categorical values).
Transform all categorical values into nominal variables based on their value (for instance, 'yes' will be assigned the value float
type (real numbers).
Implement transformations in accordance with the selected classification algorithm:
For the Logistic Regression
algorithm, normalize the nominal data using min-max
method, so that the new range will be [0,1].
However, for the Random Forest
algorithm, discretize the numerical values by splitting the data to intervals with equal frequency, so that each values group will get equal representation.
Furthermore, we'll consider the possibility of errors in the data entry process, such as an extra '0' added, or a missed decimal point, etc. Hence, for features with a high standard deviation or abnormal minimum/maximum values relative to the mean, we'll manually adjust their values.
The chosen techniques are Logistic Regression
and Random Forest
.
A classification algorithm that estimates the likelihoods of association with a certain prediction class, in our scenario - the presence of chronic kidney disease. The algorithm examines the relationships between one or more independent variables (we'll operate under the assumption that the features are indeed independent). The model employs a logistic function to scrutinize the conditional probabilities among the features, implying it computes the probability of class membership given the values of a particular feature.
An ensemble algorithm that combines the outcomes achieved by executing numerous decision trees, and delivers an optimal classification based on the majority vote from the results of the decision trees. In the majority of instances, the classification obtained is more precise than that derived from a single decision tree.
For both techniques, I employed 10-fold cross-validation and bootstrap sampling to obtain more precise predictions. Each method was trained using the same examples, but the examples were processed differently for each algorithm.
The algorithm seeks a set of weights that would minimize prediction error. After data cleansing, we'll normalize it using the min-max
method to ensure all values fall within the Gradient Descent
, to find the optimal weights that minimize the in-sample error, defined as the least squares loss (l2-loss). Additionally, we'll employ regularization to prevent overfitting.
The data will be discretized by binning into intervals with equal frequency, so every value of a feature will get equal representation among the rest of the values. I decided that in every run of the algorithm, 100 decision trees will be created, and each division will rely on at most the square root of the total number of features. The splitting criterion will be based on Gini index.
The selected algorithm is Naïve-Bayes
due to its relative simplicity when compared to the Bayesian network
. The latter accounts for all feature interdependencies, while Naïve-Bayes
operates under the assumption of feature independence, yet still delivers commendable results in practice. A Bayesian network
requires a larger database to infer feature dependencies. Given that the preprocessed database only contains 378 examples, the Naïve-Bayes
algorithm, which can handle smaller databases, is the preferred choice. Furthermore, as the Bayesian network
necessitates numerous computations, opting for Naïve-Bayes
provides a simpler algorithm with significantly faster training and execution times.
However, Naïve-Bayes
performance could be compromised if there are correlated features, prompting us to carry out dimensionality reduction (for instance, by using the PCA
algorithm) to obtain a dataset with reduced correlation.
The algorithm computes the conditional probabilities of each feature given the classification class, and then, for any new unseen example, it calculates the product of the relevant conditional probabilities for each class. The class that yields the highest probability becomes the final classification of the algorithm.
K-NN
is a classification algorithm that views the data as points within a space, the dimension of which equals the number of features.
This algorithm doesn't necessitate training as each example is computed independently of others, but in relation to its
Given that it necessitates these calculations, we can deduce that for larger databases or a high number of features, it won't be as effective, as it calculates the distance from all points for each new classification. Additionally, the values of each feature need to be normalized to prevent the algorithm from favoring features with higher values.
To classify a new example, the algorithm computes the distance between the new point and all other points in the database, selecting the
The chosen algorithm is Naïve-Bayes
, primarily because it's generally faster to train and classify than KNN
. This is because it calculates probabilities under the assumption of independence, unlike KNN
which necessitates storing and computing the distances of all examples for the classification task. Moreover, the assumption of independence simplifies the model as it pertains to a smaller set of hypotheses, only those where the features are independent, unlike KNN
which doesn't make any simplifying assumptions. Additionally, the Naïve-Bayes
algorithm is easier to interpret than KNN
since the model's decisions are based solely on conditional probabilities between each feature and class, making it clear how a result is obtained. Conversely, KNN
requires calculating distance and then classifies based on majority vote, making the classification task less intuitive to understand.
I utilized Python
to execute the algorithm. Analogous to the classification conducted using the previous algorithms, I implemented k-fold cross-validation with
In order to evaluate the quality of a cluster created by a cluster analysis algorithm, we'll use the next measurements:
- Distance within the cluster - we aim for all elements within a cluster to be as similar as possible. This is achieved by calculating the sum of squared distances between each pair of points within the cluster, with the goal of minimizing this sum.
- Distance between the clusters - we aim for each cluster to be as distinct as possible from other clusters. We perform the same calculation as before, but this time we aim for the sum to be maximal.
- Homogenity - we desire each cluster to contain elements that belong to the same class.
- Completeness - we want examples that are classified to the same class to belong to the same cluster.
- V-measure - the harmonic mean between homogeneity and completeness, which describes the trade-off between these two.
- ARI measurement - assesses the similarity between actual and predicted classifications, and the degree of randomness in classifying new examples.
- AMI measurement - evaluates the mutual information between the actual and predicted classifications, similar to ARI measurement.
- Silhouette - examines the degree to which an example matches the cluster to which it belongs based on the distance from points within the cluster and points in nearby clusters.
The selected algorithm is DBScan
. This is a density-based clustering algorithm that operates under the assumption that clusters are high-density areas separated by lower-density areas. As it identifies clusters based on density rather than distance, it can detect clusters of arbitrary shapes, not necessarily elliptical. Furthermore, it doesn't make assumptions about the data distribution. This robust algorithm is not sensitive to outliers or noise. However, it is computationally expensive as it iterates through all points and calculates distances between them for every core point. Given that our database doesn't contain a large number of observations, we can overlook the computational cost of the algorithm and, in return, achieve more accurate clusters.
To run DBScan
, we need to preprocess the data. This involves handling missing values (either by completion or deletion), converting values to numerical values (even though the algorithm can handle categorical values), and normalizing the data so that features like Age don't have undue importance in the classification task. To achieve clearer, easily interpretable clusters, we'll apply a dimensionality reduction algorithm, such as Principal Component Analysis
(PCA). The PCA
algorithm essentially projects the data into a lower-dimensional vector space to maximize variance. It does this by calculating the eigenvalues of each feature and retaining the
Given that we are working with the optimized database, there's no need for further data processing as the data already meet our requirements, with the exception of dimensionality reduction.
The DBScan
algorithm doesn't necessitate the user to provide the expected number of clusters. However, it does use a parameter
The algorithm categorizes points into three groups - core points, border points, and noise points, characterized by the following features:
- A point
$p$ is a core point if there are at least MinPts points within its$\varepsilon$ -neighborhood. - If the distance between points
$p$ and$q$ , where$p$ is a core point, is less than$\varepsilon$ , we say that$q$ is directly reachable from$p$ . - If there's a path between
$p$ and$q$ where each pair of consecutive points are directly reachable, we say that$q$ is reachable from$p$ . - Any point that isn't reachable is considered an outlier or noise point.
The algorithm operates as follows:
- For each point, identify all points within its
$\varepsilon$ -neighborhood. - If the number of points found exceeds MinPts, mark this point as a core point.
- For each core point, find all its reachable neighbors.
- Continue this process until there are no more core points to check.
I utilized Python to execute the algorithm. The necessary parameters are defined as MinPts=6, PCA
to reduce the dimensions from 24 to 2 for visibility purposes.
The neural network is defined as follows. The network will consist of three types of layers:
- Input layer - contains 24 neurons, corresponding to the data being taken from a 24-dimensional space.
- Hidden layers - these are situated between the input and output layers. After a process of trial and error, we choose two hidden layers, each containing 10 neurons.
- Output layer - this is the final layer in the network and contains two neurons, corresponding to the number of classes in our task.
The network is defined as a feed-forward, fully-connected network, meaning that the direction of the network is strictly forward, from the input layer to the output layer, and each neuron in a layer is connected to all neurons in the subsequent layer.
The activation function will be ReLU
.
The loss function will be Cross-entropy
, with optimization performed using Stochastic Gradient Descent
. The loss function is defined as follows:
where y is the actual class, and p(y) is the probability of predicting the n'th point, out of N points.
The optimization parameters include the batch size, learning rate, regularization parameters, and optimization algorithm. As previously stated, we've chosen SGD
as the optimization algorithm, with each batch being of size 100. The regularization parameter will remain at SKLearn
library.
I utilized Python to run the network. Initially, we split the dataset into a training set (