-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathclock-protein-cry1-toxicity.Rmd
2590 lines (1652 loc) · 117 KB
/
clock-protein-cry1-toxicity.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Ensemble Model of Core Clock Protein Cryptochrome (CRY1) Toxicity Prediction"
author: "HA W.M.ALEX"
date: "2024-12-10"
output: pdf_document
latex_engine: xelatex
toc: true
toc_depth: 2
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
\newpage
```{r packages and libraries, include=FALSE}
##############################################################################################
# #
# Beginning - Ensemble Model of Core Clock Protein Cryptochrome (CRY1) Toxicity Prediction #
# #
##############################################################################################
options(warn=-1)
# Install Necessary Packages if required
#
if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(tidyr)) install.packages("tidyr", repos = "http://cran.us.r-project.org")
if(!require(dplyr)) install.packages("dplyr", repos = "http://cran.us.r-project.org")
if(!require(lubridate)) install.packages("lubridate", repos = "http://cran.us.r-project.org" )
if(!require(stringr)) install.packages("stringr", repos = "http://cran.us.r-project.org")
if(!require(ggplot2)) install.packages("ggplot2", repos = "http://cran.us.r-project.org")
if(!require(gridExtra)) install.packages("gridExtra", repos = "http://cran.us.r-project.org")
if(!require(knitr)) install.packages("knitr", repos = "http://cran.us.r-project.org")
if(!require(rstudioapi)) install.packages("rstudioapi", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(tinytex)){install.packages("tinytex", repos = "http://cran.us.r-project.org")
tinytex::install_tinytex()}
if(!require(reshape2)) install.packages("reshape2", repos = "http://cran.us.r-project.org")
if(!require(matrixStats)) install.packages("matrixStats", repos = "http://cran.us.r-project.org")
if(!require(pROC)) install.packages("pROC", repos = "http://cran.us.r-project.org")
if(!require(PRROC)) install.packages("PRROC", repos = "http://cran.us.r-project.org")
if(!require(gbm)) install.packages("gbm", repos = "http://cran.us.r-project.org")
if(!require(randomForest)) install.packages("randomForest", repos = "http://cran.us.r-project.org")
if(!require(xgboost)) install.packages("xgboost", repos = "http://cran.us.r-project.org")
# Loading Necessary Libraries
#
library(dslabs)
library(tidyverse)
library(dplyr)
library(tidyr)
library(caret)
library(knitr)
library(lubridate)
library(stringr)
library(ggplot2)
library(gridExtra)
library(reshape2)
library(matrixStats)
library(pROC)
library(PRROC)
library(gbm)
library(randomForest)
library(xgboost)
# Set number of significant digits=4 globally
options(digits = 4)
# set maximum display output= 2000 globally
options(max.print=2000)
set.seed(755)
#####################################################################################
# Initialization: #
# #
# 1) Download cryptochrome-d.csv and benchmark-13.csv Files #
# #
# 2) Read csv Files cryptochrome-d.csv and benchmark-13.csv from current directory. #
# #
#####################################################################################
#################################################
# Files Handling for R script current directory #
#################################################
# Get R script current directory
current_path_r_script <- rstudioapi::getActiveDocumentContext()$path
current_directory_r_script <- dirname(current_path_r_script)
print(current_directory_r_script)
# Join R script current directory path and file name "benchmark-13.csv"
benchmark_file_path <- str_c(current_directory_r_script,"/benchmark-13.csv")
print(benchmark_file_path)
# Join R script current directory path and file name "cryptochrome-d.csv"
cryptochrome_file_path <- str_c(current_directory_r_script,"/cryptochrome-d.csv")
print(cryptochrome_file_path)
# Download cryptochrome-d.csv File
download.file("https://github.com/alexwmlab/test-one/blob/main/cryptochrome-d.csv?raw=true",cryptochrome_file_path)
# Download benchmark-13.csv File
download.file("https://github.com/alexwmlab/test-one/blob/main/benchmark-13.csv?raw=true",benchmark_file_path)
# Read csv File cryptochrome-d.csv and Create ccpc_data dataset
ccpc_data <- read_csv(col_names=TRUE, cryptochrome_file_path)
# Read csv File benchmark-13.csv and Create benchmark_d dataset
benchmark_d <- read_csv(col_names=TRUE, benchmark_file_path)
```
\newpage
# 1 Executive Summary
The Objective of Capstone Project is to develop a **Core Clock Protein Cryptochrome (CRY1) Toxicity Prediction Ensemble Model** that **Predict Toxicity of Molecules** using **Machine Learning** based on **core_clock_protein_cry** set. The Primary task is to Leverage this data to **Predict Toxicity of CRY1 Molecules/Instances**. Original data is Sourced and Loaded from **UCI Machine Learning Repository ^[https://archive.ics.uci.edu/dataset/728/toxicity-2]**. Renamed as **ccpc_data** and **benchmark_d** dataset. The **ccpc_data** set is a **Non-Linear**, **High Dimensional** but **Small** dataset comprising with **171 of Molecules/Instances**, **1203 Molecular Descriptors/Features** and
**1 Feature of Toxicity Class (Toxic or NonToxic)**. The **benchmark_d** dataset comprising with **171 Molecules**, **13 Molecular Descriptors/Features** and **1 Feature of Toxicity Class** for Benchmarking Purpose.
**Molecular Descriptors** of **core_clock_protein_cry** set for each Model were formulated from **1203 Molecular Descriptors** of **ccpc_data** set using **Molecular Descriptors Prioritization Method**. In **ccpc_data** set, **Zero** Values hold meaningful information relevant to its **Molecular Descriptors**. As such, these **Zero** Values are preserved in their original form and are not subject to any modification or elimination during Data Cleaning process. **Scaling and Normalization** are one of the key Pre-Processing steps in data preparation. **Scaling** ensures that all **Molecular Descriptors** are on the same **Scale**, while **Normalization** adjusts the data to a normal distribution.
Prior to Models Training, **Data Exploration Analysis(EDA)** were conducted which encompassing Data Cleaning,
Data Pre-processing, Molecular Descriptors/Features Prioritization, Redundant Feature Elimination (RFE), Features Importance Selection, Redundant Molecular Descriptors Reducing and Principal Component Analysis (PCA). Model employs a **Ensemble Modelling Approach** using Multiple Models of **Gradient Boosted Machine (GBM), Random Forest, K-Nearest Neighbors (KNN) and Extreme Gradient Boosting (XGBoost)** augmented by **Hyper-Parameters Tuning** and **Cross Validation Method** to Optimize Configuration of the **Final Machine Learning Ensemble Model**. These Methods are instrumental in Optimizing magnitude of the Parameters to mitigate the **Risk of Over-fitting**.
The **core_clock_protein_cry** set is also Split into **train sets** and **test sets** for training and testing purpose. Each of the Four **Molecular Descriptors Prioritized Model** (GBM, Random Forst, KNN, XGBoost) will be trained separately with **Optimized Hyper-Parameters** and Evaluate on the **test sets**. **Predictions** will be made and an ensemble created from these Four **Models** to generate a **Majority Prediction of Toxicity Class**. The Ensemble Model **Performance** was evaluated using confusionMatrix function of Caret Package in R. Our ultimate Goal is to created a **Balanced and Robust Ensemble Model** that maximizes the **Performance**, as measured by our Evaluation Metrics such as **F1-Score, Accuracy, Precision and Recall**.
**Performance of our Final Ensemble Model is as follow:**
**F1-Score=0.880, Accuracy=0.823, Recall=0.957, Precision=0.815, Sensitivity=0.957, Specificity = 0.583, P-value=0.021**
Final **Ensemble model** combines **Strength** of GBM, Random Forest, KNN and XGBoost Algorithm to learn more complex patterns by **Analyzing Interactions of Molecular Descriptors**, making the **Model** increasingly more **Accurate**. The Model's **Performance** was Evaluated using the Metric such as **F1-Score, Accuracy, Precision and Recall**. The **reliability** and **trustworthiness** of the Model are Substantiated through incorporating **Cross-Validation Method** and **Hyper-Parameters Optimization** during Model development and training.
**Optimal Machine Learning Ensemble Model achieved F-Score of 0.880 and Accuracy of 0.823, Signifying a Remarkable Outcome. Thus, I firmly assert our confidence in adopting the ultimate Ensemble Mode and Algorithm to construct the Core Clock Protein Cryptochrome (CRY1) Toxicity Prediction System.**
\newpage
# 2 Introduction
## 2.1 Core Clock Protein Cryptochrome (CRY1)
The **Core Clock Protein** refers to a group of Proteins that are central to the regulation of **Circadian Rhythms**, which are body's internal 24-hour cycles that influence various physiological processes. ^[journals.biologists.com]
The primary **Core Clock Proteins** include CLOCK(Clock Circadian Regulator) and BMAL1(ARNTL), which form a complex that activates the transcription of other Clock genes like Period (PER) and Cryptochrome (CRY).^[www.genecards.org]
These Proteins work together in a feedback loop to maintain the **Circadian Rhythm**, which helps regulate sleep-wake
cycles, metabolism, body temperature, and many other functions. Disruptions in these Proteins can lead to various health issues, such as sleep disorders and metabolic problems.
**Cryptochrome (CRY)** proteins play a crucial role in the regulation of the **Circadian Rhythms** by **interacting** with other **Core Clock Proteins**, There are generally two types, **"CRY1"** and "CRY2", each with its own specific functions, but both are essential for the proper functioning of the **Circadian Clock**.
**"CRY" Proteins** help regulate the feedback loop that maintains our **Circadian Rhythms** by inhibiting the activity of the "CLOCk-BMAL1" complex, thus controlling the transcription of other **Clock** genes. In simpler terms, they act like a brake in the system to ensure the cycle operates smoothly and on time.
## 2.2 Toxicity Prediction
### 2.2.1 Machine Learning Model
Machine Learning Model can **Predict** potential **Toxicity** of drug candidates by analyzing their **interactions** with various biological targets such as **Core Clock Protein cryptochrome (CRY1)** Molecules. This helps in early identification of harmful effects, reducing the risk of late-stage drug failures.
Two Datasets of **Core Clock Protein Cryptochrome (CRY1)** Molecules for Toxicity Prediction were sourced and loaded from **UCI Machine Learning Repository** ^[https://archive.ics.uci.edu/dataset/728/toxicity-2]. Renamed as ccpc_data and benchmark_d dataset , which include **Molecules/Instances** and their associated **Molecular Descriptors** as well as **Toxicity Class**. The ccpc_data dataset comprising with **171 of Molecules/Instances**, **1203 Molecular Descriptors/Features/Compound** and **1 Feature of Toxicity Class**. The benchmark_d dataset comprising with **171 Molecules/Instances**, **13 Molecular Descriptors/Features** and **1 Feature of Toxicity Class** for Benchmarking purpose.
In **Toxicity** of Molecules/Instances **Model** Building, especially when working with our datasets from
UCI Machine Learning Repository for **Classification** task, **Molecular Descriptors/Features** are commonly used to help distinguish **Toxicity of Molecules/Instances (Toxic or NonToxic)**. These **Molecular Descriptors** are derived from **Quantitative Structure-Activity Relationship (QSAR)**^[https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship]
and are used to create a comprehensive profile of each **Molecule/Instance/Compound**. By analyzing **interactions** of these **Molecular Descriptors/Features**, Machine Learning Models can be trained to classify **Toxicity** of these **Molecules** more accurately.
\newpage
**QSAR** relies on the basic principle of chemistry that states the biological activity of any Compound is associated
with the arrangement of atoms forming the Molecular structure. In other words, structurally related Molecules posses
similar biological activities. This structural information can be defined in terms of a series of **Parameters**
called **Molecular Descriptors**.
### 2.2.2 Molecular Descriptors/Features
In the context of **Molecular Descriptors/Features**, they comprise Relevant Features from the Chemical Compounds,
such as Molecular Weight, Solubility and other Molecular Descriptors of Molecules/Instances. MATS3v and nHBint10 are
examples of specific types of Descriptors used in cheminformatics to describe Molecular Properties:
**MATS3v**
* This stands for Moran Autocorrelation Topological Structure with lag 3, weighted by Van der Waals volumes. It is a type of Descriptor that considers the spatial arrangement of atoms within a Molecule ^[www.vcclab.org]
**nHBint10**
* This represents the number of hydrogen bond acceptors within 10 bonds. It is a topological Descriptor that counts the number of hydrogen bond acceptors within a certain distance in the Molecular structure. ^[www.alvascience.com]
These Descriptors might be Redundant, meaning they provide similar information and overlap with information provided by the other similar Descriptors. **Dimensionality Reduction Method/Feature Selection** Techniques are typically employed to identify and reduce **Redundant Molecular Descriptors**, improving the **Performance** of the **Machine Learning Models**.
### 2.2.3 Machine Learning Challenges
Applying Machine Learning to **CRY1 Toxicity Predictions** presents several Challenges:
**Data Quality and Quantity**
* High Quality, labeled data is essential for training effective Machine Learning Models.
**Heterogeneity of Data**
* Toxicity data is highly Heterogeneous, with variations between **Molecular Descriptors** and within **CRY1**. This complexity makes it challenging to develop Models that generalize well across different Molecular Descriptors.
**High Dimensionality**
* CRY1 datasets often include a vast number of **Molecular Descriptors/Features**, which can lead to **Over-Fitting** and computational Challenges.
\newpage
Hence, Method/Techniques are adopt for **Dimensionality Reduction** as following:
**Principal Component Analysis (PCA)**
* Reduce the number of Features while retaining the most Important information.(i.e. **Vars Importance**)
**Recursive Feature Elimination (RFE)**
* Recursively remove the least important Features until a specified number of **Features** is reached.
**Redundant Features Reducing**
* Eliminate highly correlated Features with cutoff 0.9. (i.e. **coefficient value > 90%**)
Thus, prior to the Model Training, **Molecular Descriptors/Features Prioritization Method** will be **utilized** to Pre-Process and Select Molecular Descriptors/Features that are most relevant to Model for **Toxicity Prediction**.
\newpage
# 3 Data Exploration Analysis (EDA)
## 3.1 Data Explorations
Data was sourced from **UCI Machine Learning Repository.^[https://archive.ics.uci.edu/dataset/728/toxicity-2]**.
Two csv Files were loaded and renamed as **cryptochrome-d.csv** and **benchmark-13.csv**. The **cryptochrome-d.csv** file was read into **ccpc_data** dataframe, while **benchmark-13.csv** file was read into the **benchmark_d** dataframe.
The **ccpc_data** set, consisting of **171 rows and 1204 columns**, was Split into **df_ccpc_data** and **df_ccpc_tox**. As a result, **df_ccpc_data** set comprising of **171 Molecules/Instances** and **1203 Molecular Descriptors/Features.** was generated. And **df_ccpc_tox** set comprise **Single Toxicity Feature:Class of 171 Molecules/Instance.** Then **core_clock_protein_cry** list was created for Principal Component Analysis (PCA) and Model Training, incorporating both the Molecular Descriptors Matrix and Toxicity Vector.
Similarly, "benchmark_d" dataset consisting of **171 rows** and **14 columns.** set was split into **df_benchmark_d** and **df_benchmark_tox**. Hence, **df_benchmark_d** set comprising of **171 Molecules/Instances** and **13 Molecular Descriptors/Features.** was generated. And **df_benchmark_tox** set comprise **Single Toxicity Feature:Class of 171 Molecules/Instance.** . The **benchmark_core_clock_protein_cry** list was also created incorporating both the Molecular Descriptors Matrix and Toxicity Vector.
**Redundant Feature Reduction** was applied to **df_benchmark_d** set to eliminate **highly correlated Features**, using a **cutoff value** of **0.9**. This process resulted in the creation of **df_benchmark_d_reduced** set, which was subsequently utilized in the **Feature Importance** Techniques. And then **df_benchmark_d_reduced** set was transformed into a **Matrix** and **df_benchmark_d_tox_reduced$Class** was transformed into a **Toxicity Vector.**
Both **core_clock_protein_cry** and **benchmark_core_clock_protein_cry** were then randomly Split into 80% (train_set_x, train_set_y) and 20% (test_set_x, test_set_y). I use a 20% test set instead of 10% because our dataset is Small but High-Dimensional. This requires a sufficient amount of data to create a reliable test set for Evaluation purposes. Both train_set_x and test_set_x contained **Molecular Descriptors** and train_set_y and test_set_y contained **Toxicity Feature** ("Toxic" or "NonToxic") of Molecules/Instances.
Prior to **Model** training, the **Molecular Descriptors Prioritization Method** was applied for **Feature Selection**, reducing **dimensionality** by retaining the **Most Relevant Features**. **Feature Selection** will be performed using the Random Forest Algorithm, and the top Molecular Descriptors will be identified based on **Variable Importance.**
\newpage
### 3.1.1 Dimensions - ccpc_data set
**A) Dimensions of ccpc_data**
```{r dimension of ccpc_data, echo=FALSE}
# Dimension of ccpc_data - Number of rows and columns
dim_df <- data.frame(Rows = dim(ccpc_data)[1], Columns = dim(ccpc_data)[2])
dim_df %>% knitr::kable(caption="Dimension of ccpc_data set")
```
**B) Dimensions of benchmark_d**
```{r dimensions benchmark_d, echo=FALSE}
# Dimension of benchmark_d - Number of rows and columns
dim_df <- data.frame(Rows = dim(benchmark_d)[1], Columns = dim(benchmark_d)[2])
dim_df %>% knitr::kable(caption="Dimension of benchmark_d dataset")
```
### 3.1.2 Description of Columns/Features and Data Type
**The 1 to 1203 Columns/Features - ccpc_data set**
These Columns are **Molecular Descriptors** used in Quantitative Structure-Activity Relationship(QSAR) Models to describe Molecular properties. These **Molecular Descriptors/Features** of dataset are used in combination to capture **Core Clock Protein Cryptochrome (CRY1) Toxicity of Molecules/Instances.**
**Examples of Molecular Descriptors/Features**
* MDEC-23: This Descriptor evaluate Molecular distance edge Descriptors for C.^[docs.ochem.eu]
* apol: This Descriptor calculates sum of atomic polarizabilities.
* MATS3v: This Descriptor measures the Molecular topology and shape. ^[www.vcclab.org]
* nHBint10: This Descriptor counts the number of hydrogen bond acceptors in a Molecule.
* GATS8s: This Descriptor measures the similarity of a Molecule's properties at different distances lag 8 within
the Molecule, weighted by I-state.z ^[alvascience.com]
**The "Class" Column - ccpc_data set**
* This column contains either **Toxic or NonToxic** which represents Core Clock Protein Cryptochrome (CRY1) **Toxicity Class of Molecules/Instances.**
The **ccpc_data** set comprises a total of **1204 Columns** with **1203 Molecular Descriptors Features**
and **1 Toxicity Class**
\newpage
**Data Type of Column - ccpc_data set**
```{r data type ccpc_data, echo=FALSE}
data_type <- data.frame(Column="1 to 1203", Data_Type="Numeric", Value="")
data_type<- bind_rows(data_type,
data.frame(Column="Class", Data_Type="Character", Value="Toxic or NonToxic"))
data_type %>% knitr::kable(caption="ccpc_data set")
```
**Description of Columns/Features - benchmark_d set**
**Column Names and Data Type**
```{r data type benchmark-d, echo=FALSE}
col_data_type <- data.frame(Data_Type=sapply(benchmark_d, class))
col_data_type %>% knitr::kable(caption="benchmark_d dataset")
```
The benchmark_d dataset comprises a total of **14 Columns** with **13 Molecular Descriptors/Features**
and **1 Toxicity class.**
\newpage
### 3.1.3 Data Cleaning
**R Codes check columns with "NA" values in "ccpc_data" set**
```{r missing value, echo=TRUE}
# check and count number of missing value in ccpc_data set
na_results <- ccpc_data[apply(is.na(ccpc_data),1,any),]
nrow(na_results)
```
There are **NO missing value** in **ccpc_data** set.
\
**R Codes check "Zeros" value in the "ccpc_data" set**
```{r zero value, echo=TRUE}
# Create a logical matrix where TRUE represents zero value
zero_check <- ccpc_data == 0
# Sum the TRUE values for each column to count zeros
zero_count <- colSums(zero_check)
# Display sum of zeros of ccpc_data set
sum(zero_count)
```
In **ccpc_data** set, there are **42659 Zeros** value. It's important to note that these **Zeros** value are not Errors or Missing data. Instead, they carry Meaningful Information relevant to its **Molecular Descriptors.** Therefore, I am preserving all **zeros** value in the dataset and not altering them during the data cleaning process.
This Approach ensures that we maintain the integrity and accuracy of **ccpc_data** set, allowing for a more precise and insightful analysis.
\newpage
### 3.1.4 Pre-Processing
**3.1.4.1 Data Wrangling**
Replace **hyphens** with **underscores** in all **Column Names/Molecular Descriptors** of **ccpc_data** set.
The conversion of hyphens "-" to underscores "_" maintain consistency in data formats, also ensuring
compatibility with R. This standardize **Column Names** is also making it easier to manage in subsequent analyses and
ensuring uniformity across datasets. (e.g., **MDEC-23 to MDEC_23**)
\
**Example: The Column Name of ccpc_data is "MDEC-23" before the Replacement.**
```{r display colum 99, echo=TRUE}
# Display first 6 rows of column "MDEC-23" descriptors - column range [95:108]
head(ccpc_data[,99])
```
**R Codes that replace hyphens with underscores in the Columns Name**
```{r data wrangle column 99, echo=TRUE}
# Data Wrangling of ccpc_data for Molecular Descriptors/Features/Columns
# 1) convert hyphens "-" to under_scores "_"
#
colnames(ccpc_data) <- str_replace_all(colnames(ccpc_data), "^'|'$","")
colnames(ccpc_data) <- str_replace_all(colnames(ccpc_data), "-","_")
# Data Wrangling of benchmark_d for Molecular Descriptors/Features/Columns
# 1) convert hyphens "-" to under_scores "_"
#
colnames(benchmark_d) <- str_replace_all(colnames(benchmark_d), "^'|'$","")
colnames(benchmark_d) <- str_replace_all(colnames(benchmark_d), "-","_")
```
**Example: The Column Name of ccpc_data is "MDEC_23" after the Replacement.**
```{r After replacement display column 99, echo=TRUE}
# Display first 6 rows of column "MDEC_23" descriptors - column range [95:108]
head(ccpc_data[,99])
```
\newpage
**3.1.4.2 Dataset Splitting**
**ccpc_data** set is splitting into **df_ccpc_data** set and **df_ccpc_tox** set by Column Categories.
**benchmark_d** set is also splitting into **df_benchmark_d** and **df_benchmark_d_tox** set by Column Categories.
**A) Column Categories**
**ccpc_data set:**
* **Column Name** of Columns 1 to 1203: Contain **Name** of **1203 Molecular Descriptors**
* **Value** of Columns 1 to 1203 : Contain **Value** of **1203 Molecular Descriptors**
* **Value** of Column "Class" : Contain **Toxicity** Class (**Toxic or NonToxic**) of **Molecules/Instances**
**benchmark_d set:**
* **Column Name** of Columns 1 to 13: Contain **Name** of **13 Molecular Descriptors**
* **Value** of Columns 1 to 13 : Contain **Value** of **13 Molecular Descriptors**
* **Value** of Column "Class" : Contain **Toxicity** Class (**Toxic or NonToxic**) of **Molecules/Instances**
**B) Generate following New Dataset from ccpc_data set:**
**df_ccpc_data set:**
* **Column Name** of Columns 1 to 1203: Contain **Name** of **1203 Molecular Descriptors**
* **Value** of Columns 1 to 1203 : Contain **Value** of **1203 Molecular Descriptors**
**df_ccpc_tox set:**
* **Value** of Column "Class" : Contain **Toxicity** Class (**Toxic or NonToxic**) of **Molecules/Instances**
**C) Generate following New Dataset from benchmark_d set:**
**df_benchmark_d set:**
* **Column Name** of Columns 1 to 13: Contain **Name** of **13 Molecular Descriptors**
* **Value** of Columns 1 to 13 : Contain **Value** of **13 Molecular Descriptors**
**df_benchmark_d_tox set:**
* **Value** of Column "Class" : Contain **Toxicity** Class (**Toxic or NonToxic**) of **Molecules/Instances**
```{r df_ccpc_data benchmark_d, include=FALSE}
# Generate df_ccpc_data comprising 1203 Molecular Descriptors of 171 Molecules/Instances
df_ccpc_data <- ccpc_data %>% select(-Class)
# Generate df_ccpc_tox comprising Toxicity Class ("Toxic" or "NonToxic") of 171 Molecules/Instances
df_ccpc_tox <- ccpc_data %>% select(Class)
# Generate df_benchmark_d comprising Molecular Descriptors for benchmarking
df_benchmark_d <- benchmark_d %>% select(-Class)
# Generate df_benchmark_d_tox comprising Toxicity Class ("Toxic" or "NonToxic") for benchmarking
df_benchmark_d_tox <- benchmark_d %>% select(Class)
```
**D) Dimension of df_ccpc_data set:**
```{r dimension df_ccpc_data, echo=FALSE}
# Table Display Dimension of df_ccpc_data using kable function of knitr packagae
dim_df <- data.frame(Rows = dim(df_ccpc_data)[1], Columns = dim(df_ccpc_data)[2])
dim_df %>% knitr::kable(caption="Dimension of df_ccpc_data set")
```
\newpage
**E) Dimension of df_ccpc_tox set:**
```{r dimension of df_ccpc_tox, echo=FALSE}
# Table Display Dimension of df_ccpc_tox using kable function of knitr of package
dim_df <- data.frame(Rows = dim(df_ccpc_tox)[1], Columns = dim(df_ccpc_tox)[2])
dim_df %>% knitr::kable(caption="Dimension of df_ccpc_tox set")
```
**F) Dimension of df_benchmark_d set:**
```{r dimension df_benchmark_d, echo=FALSE}
# Table Display Dimension of df_benchmark_d using kable function of knitr package
dim_df <- data.frame(Rows = dim(df_benchmark_d)[1], Columns = dim(df_benchmark_d)[2])
dim_df %>% knitr::kable(caption="Dimension of df_benchmark_d set")
```
**G) Dimension of df_benchmark_d_tox set:**
```{r dimension of df_benchmark_d_tox, echo=FALSE}
# Table Display Dimension of df_benchmark_d_tox using kable function of knitr package
dim_df <- data.frame(Rows = dim(df_benchmark_d_tox)[1], Columns = dim(df_benchmark_d_tox)[2])
dim_df %>% knitr::kable(caption="Dimension of df_benchmark_d_tox set")
```
\newpage
**3.1.4.3 Redundant Features Reducing - benchmark set**
\
**Eliminate highly correlated Features of df_benchmark_d with coefficient over 90%.**
* **findCorrelation** is a function of Caret package in R, used to detect and remove highly correlated variables
in df_benchmark_d set.
* **cutoff:** Parameter/Threshold for correlation. Variables with correlation higher than this value will be flagged.
**R Codes that Eliminate Highly Correlated Features with cutoff=0.9**
```{r df_benchmark_d coefficient cutoff 0.9, echo=TRUE}
###############################################################################
# Eliminate highly correlated features of df_benchmark_d (coefficient > 90%) #
###############################################################################
# Compute Correlation Coefficient
benchmark_correlation <- cor(df_benchmark_d)
# Find highly correlated features (cutoff = 0.9)
highly_correlated <- findCorrelation(benchmark_correlation, cutoff = 0.9)
# Remove the highly correlated features
df_benchmark_d_reduced <- df_benchmark_d[, -highly_correlated]
# Assign df_benchmark_d_tox to Reduced toxicity dataset
df_benchmark_d_tox_reduced <- df_benchmark_d_tox
```
**Dimension of df_benchmark_d_reduced set:**
```{r df_benchmark_d_reduced, echo=FALSE}
# Table Display Dimension of df_benchmark_d_reduced using kable function of knitr package
dim_df <- data.frame(Rows = dim(df_benchmark_d_reduced)[1], Columns = dim(df_benchmark_d_reduced)[2])
dim_df %>% knitr::kable(caption="Dimension of df_benchmark_d_reduced set")
```
**Column Name of Molecular Descriptors - df_benchmark_d_reduced set:**
```{r column names df_benchmark_d_reduced, echo=FALSE}
# Column Name of Molecular Descriptors/Features - df_benchmark_d_reduced set
colnames(df_benchmark_d_reduced)
```
\newpage
## 3.2 Feature Selection
### 3.2.1 benchmark_core_clock_protein_cry set
\
**R Codes that create benchmark_core_clock_protein_cry set for Feature Selection**
```{r benchmark_core_clock_protein_cry, echo=TRUE}
#########################################################################
# Create benchmark_core_clock_protein_cry set for Feature Selection #
#########################################################################
# Transform df_benchmark_d_reduced to moecular_descriptors matrix
molecular_descriptors <- df_benchmark_d_reduced %>% as.matrix()
# Transform df_benchmark_d_tox_reduced$Class to toxicity vector
toxicity <- factor(df_benchmark_d_tox_reduced$Class)
# Generate list - benchmark_core_clock_protein_cry
benchmark_core_clock_protein_cry <- list(molecular_descriptors=molecular_descriptors,
toxicity=toxicity )
```
**Number of Molecules/Instances are in the Matrix**
```{r nrow benchmark_core_clock_protein_cry, echo=TRUE}
# Number of Molecules/Instances are in the Matrix
nrow(benchmark_core_clock_protein_cry$molecular_descriptors)
```
**Number of Molecular Descriptors/Features are in the Matrix**
```{r ncol benchmark_core_clock_protein_cry, echo=TRUE}
# Number of Molecular Descriptors/Features are in the Matrix
ncol(benchmark_core_clock_protein_cry$molecular_descriptors)
```
**Proportion of the Molecules/Instances are "Toxic"**
```{r proportion benchmark_core_clock_protein_cry toxic, echo=TRUE}
# Proportion of the Molecules/Instances are "Toxic"
mean(benchmark_core_clock_protein_cry$toxicity=="Toxic")
```
**Proportion of the Molecules/Instances are "NonToxic"**
```{r NonToxic proportion benchmark_core_cloc_protein_cry, echo=TRUE}
# Proportion of the Molecules/Instances are "NonToxic"
mean(benchmark_core_clock_protein_cry$toxicity=="NonToxic")
```
\newpage
### 3.2.2 Scaling and Normalizing
Scaling and Normalizing **benchmark_core_clock_protein_cry** set, an essential step in making sure the **Model** can
Effectively learn Patterns.
**A) Normalization**
* Adjust **benchmark_core_clock_protein_cry** set to a common **Scale** without distorting differences in the ranges of values. This is especially important for **Molecular Descriptors/Features** with different units.
**B) Standardization**
* Transform the **benchmark_core_clock_protein_cry** set to have **Mean of Zero** and **Standard Deviation of One**.
**R Codes for Scaling and Normalizing benchmark_core_clock_protein_cry set**
```{r normalizing benchmark_core_clock_protein_cry, echo=TRUE}
###################################################################
# Scaling and Normalizing benchmark_core_clock_protein_cry set #
###################################################################
# Compute Mean of each Column
col_means <- colMeans(benchmark_core_clock_protein_cry$molecular_descriptors, na.rm=TRUE)
# Compute Standard Deviations of each Column
col_sds <- colSds(benchmark_core_clock_protein_cry$molecular_descriptors, na.rm=TRUE)
# Subtract Column Means
benchmark_molecular_descriptors_centered<-sweep(benchmark_core_clock_protein_cry$molecular_descriptors,
2, col_means, FUN="-")
# Divide by Column Standard Deviations
benchmark_molecular_descriptors_scaled <- sweep(benchmark_molecular_descriptors_centered,
2 , col_sds, FUN="/")
```
**List 6 rows of scaled and normalized benchmark_molecular_descriptors_scalad set**
```{r list 6 rows benchmark_molecular_descriptors_scalad, echo=TRUE}
# List 6 rows of scaled and normalized benchmark_molecular_descriptors_scalad set
head(benchmark_molecular_descriptors_scaled)
```
\newpage
### 3.2.3 Partitions dataset
Partitions dataset into **80% training and 20% testing sets** as follow:
* Splits 80% of **benchmark_molecular_descriptors_scaled** Matrix into **train_set_x**
* Splits 20% of **benchmark_molecular_descriptors_scaled** Matrix into **test_set_x**
* Splits 80% of **benchmark_core_clock_protein_cry$toxicity** Class into **train_set_y**
* Splits 20% of **benchmark_core_clock_protein_cry$toxicity** Class into **test_set_y**
Given our dataset small size, we allocate 20% for testing rather than the typical 10%. This ensures an adequate amount of data for a reliable and effective evaluation.
\
**R Codes Split dataset into train_set_x,train_set_y (80%) & test_set_x,test_set_y (20%)**
```{r split benchmark_core_clock_protein_cry train 0.8 test 0.2, echo=TRUE}
#######################################################################################
# Splits dataset into train_set_x, train_set_y (80%) & test_set_x, test_set_y (20%) #
#######################################################################################
set.seed(1)
test_index <- createDataPartition(benchmark_core_clock_protein_cry$toxicity, times = 1,
p = 0.2, list = FALSE)
test_set_x <- benchmark_molecular_descriptors_scaled[test_index,]
test_set_y <- benchmark_core_clock_protein_cry$toxicity[test_index]
train_set_x <- benchmark_molecular_descriptors_scaled[-test_index,]
train_set_y <- benchmark_core_clock_protein_cry$toxicity[-test_index]
```
### 3.2.4 Feature Importance
**Feature Importance** Technique ranks **Molecular Descriptors/Features** of **benchmark_core_clock_protein_cry** set based on their contribution to the Model's predictions using Random Forest Algorithm. **Feature Importance** can be derived from the average decrease in impurity or the mean decrease in Accuracy when a **Molecular Descriptors/Features** is excluded.
**A) Model Optimization**
* Fine-tune the Models to improve the **Performance**, involving **Hyper-Parameter Tuning** and **Cross-Validation.**
* Optimize the Models by adjusting Parameters as following:
* Sequence **mtry** from 1 to 10
* Set Number of Tree (**ntree**) equal to 500
* Set Number of **Cross validation** equal to 5
\newpage
**R Codes select most relevant Descriptors from Vars Importance using Random Forest**
```{r molecular descriptors vars importance train_ccpc_benchmark, echo=TRUE}
##############################################################################################################
# Select Most Relevant Molecular Descriptors from Vars Importance as Benchmark using Random Forest Algorithm #
##############################################################################################################
####################################################
# Optimized Parameters of Random Forest Algorithm #
# * Sequence "mtry" from 1 to 10 #
# * Set Number of "tree" equal 500 (ntree=500) #
# * Set Number of "cross validation" equal to 5 #
####################################################
set.seed(9)
train_ccpc_benchmark <- caret::train(train_set_x, train_set_y , method="rf",
tuneGrid=data.frame(mtry=seq(1:10)),
ntree=500,
trControl=trainControl(method="cv", number=5),
importance= TRUE)
```
**B) List Vars Importance**
```{r vars importance (Two) train_ccpc_benchmark, echo=TRUE}
# Process and Generate Vars Importance of train_ccpc_benchmark
importance_rf_b <- varImp(train_ccpc_benchmark)
print(importance_rf_b)
# Vars Importance of benchmark set
benchmark_importance_rf <- rownames(importance_rf_b$importance)[order(-importance_rf_b$importance[,1])]
#benchmark_importance_rf[1:12]
```
**C) Select Top Pairwise Most Relevant Molecular Descriptors/Features.**
\
**`r benchmark_importance_rf[1]` and `r benchmark_importance_rf[2]` are selected as benchmark set.**
\newpage
```{r overall accuracy accuracy_ccpc_benchmark, include=FALSE}
# max vars importance
most_importance_rf <-rownames(importance_rf_b$importance)[which.max(importance_rf_b$importance[,1])]
most_importance_rf
```
**R codes predict and compute overall accuracy using confusionMatrix function of caret package**
```{r overall accuracy predict_ccpc_benchmark, echo=TRUE}
# Compute Overall Accuracy for Molecular Descriptors/Feature Prioritization purpose
predict_ccpc_benchmark <- predict(train_ccpc_benchmark, test_set_x)
accuracy_ccpc_benchmark <- confusionMatrix(table(predict_ccpc_benchmark, test_set_y),mode="everything")
benchmark_accuracy <- accuracy_ccpc_benchmark$overall["Accuracy"]
```
**The Overall Accuracy of `r round(benchmark_accuracy,4)` will be adopted as benchmark Accuracy.**
## 3.3 Principal Component Analysis (PCA)
In the context of **Principal Component Analysis (PCA)** for Predicting **Core Clock Protein Cryptochrome (CRY1) Toxicity.** The **core_clock_protein_cry** dataset includes various **Molecular Descriptors** of **Molecules/Instances.** These descriptors are essential for distinguishing **Toxicity Class (Toxic or NonToxic)** of the **Molecules.**
The primary goal of performing **PCA** is to reduce the **dimensionality** of the dataset from **1203 Molecular Descriptors** while retaining the most crucial information. This process transforms the Descriptors into a set of linearly uncorrelated components, known as **Principal Components (PCs)**, that capture the maximum **Variance** in the dataset. The **core_clock_protein_cry** dataset undergoes **Scaling, Normalization and Standardization** before **PCA** is performed using the **prcomp** function from the Caret package.
The results are summarized and **visualized** to understand the **principal components** (e.g., **PC1, PC2, PC3,..**) and their contributions to explaining the **Variance** in the data. Including **Molecular Descriptors** in **PCA** helps identify patterns and relationships that are crucial for distinguishing **Toxicity Class (Toxic or NonToxic)** in the dataset.
### 3.3.1 core_clock_protein_cry set
**R Codes that create core_clock_protein_cry set for PCA**
```{r normalization core_clock_protein_cry, echo=TRUE}
##############################################################################
# Create core_clock_protein_cry set (1203 Molecular Descriptors) for PCA #
##############################################################################
# Transform df_ccpc_data to molecular_descriptors Matrix
molecular_descriptors <- df_ccpc_data %>% as.matrix()
# Transform df_ccpc_tox$Class to toxicity vector as factor
toxicity <- factor(df_ccpc_tox$Class)
# Generate list - core_clock_protein_cry
core_clock_protein_cry <- list(molecular_descriptors=molecular_descriptors,
toxicity=toxicity )
# core_clock_protein_cry
```
\newpage
**Number of Molecules/Instances in the Matrix**
```{r}
# Number of Molecules/Instances/Column are in the Matrix
nrow(core_clock_protein_cry$molecular_descriptors)
```
**Number of Molecular Descriptors/Features in the Matrix**
```{r}
# Number of Molecular Descriptors/Features/Predictors are in the Matrix
ncol(core_clock_protein_cry$molecular_descriptors)
```
**Proportion of Molecules/Instances are "Toxic"**
```{r}
# Proportion of the Molecules/Instances are "Toxic"
mean(core_clock_protein_cry$toxicity=="Toxic")
```
**Proportion of Molecules/Instances are "NonToxic"**
```{r}
# Proportion of the Molecules/Instances are "NonToxic"
mean(core_clock_protein_cry$toxicity=="NonToxic")
```
### 3.3.2 Scaling and Normalization
Scaling and Normalizing **core_clock_protein_cry** set , an essential step of **PCA** in making sure the **Model** can
Effectively learn Patterns.
A) Normalization
* Adjust **core_clock_protein_cry** set to a common **Scale** without distorting differences in the ranges of
values. This is especially important for **Molecular Descriptors/Features** with different units.
B) Standardization
* Transform the **core_clock_protein_cry** set to have a **Mean of Zero** and a **Standard Deviation of One.**
\newpage
**R Codes for Scaling and Normalizing core_clock_protein_cry set**
```{r scaling normalizing core_clock_protein_cry, echo=TRUE}
###########################################################
# Scaling and Normalizing "core_clock_protein_cry" set #
###########################################################
# Compute Mean of each Column
col_means <- colMeans(core_clock_protein_cry$molecular_descriptors, na.rm=TRUE)
# Compute Standard Deviations of each Column
col_sds <- colSds(core_clock_protein_cry$molecular_descriptors, na.rm=TRUE)
# Subtract Column Means
molecular_descriptors_centered <- sweep(core_clock_protein_cry$molecular_descriptors,
2, col_means, FUN="-")
# Divide by Column Standard Deviations
molecular_descriptors_scaled <- sweep(molecular_descriptors_centered,
2 , col_sds, FUN="/")
```
**List 6 Rows of scaled and normalized molecular_descriptors_scalad set**
```{r molecular_descriptors_scaled 170 to 176 column, echo=TRUE}
# List 6 rows of scaled and normalized molecular_descriptors_scalad set
head(molecular_descriptors_scaled[,170:176])
```
\newpage
### 3.3.3 Proportion of Variance (PCA)
To compute the **Proportion of Variance** explained by **PCA**, start by creating the Covariance **Matrix** to understand how different Features vary together. Next, use Eigenvalues and Eigenvectors to identify the **Principal Components.** Finally, select the **Principal Components** that capture the **most variance** to retain the most significant patterns in the data.
**R Codes that compute Proportion of Variance explained by PCA**
```{r pca propotion of variance, echo=TRUE}
######################################
# PCA - Proportion of variance #
######################################
# PCA is performed using the "prcomp" function
pca_result_core_clock_protein_cry <- prcomp(molecular_descriptors_scaled)
# Create Dataframe with column pca_result_core_clock_protein$x
pca_data_core_clock_protein_cry <- data.frame(pca_result_core_clock_protein_cry$x)
# Add toxicity Class to the Dataframe
pca_data_core_clock_protein_cry$toxicity <- core_clock_protein_cry$toxicity
# Compute Proportion of Variance (Eigenvalues and Eigenvectors)
proportions_var_ccpc <-pca_result_core_clock_protein_cry$sdev^2/sum(pca_result_core_clock_protein_cry$sdev^2)
#proportions_var_ccpc[1:42]
```
**R Codes that generate Proportion of Variance explained by First Principal Component (PC1)**
```{r PC1, echo=TRUE}
# Proportion of Variance explained by First Principal Component (PC1)
proportions_var_ccpc_first_pc <- proportions_var_ccpc[1]
proportions_var_ccpc_first_pc
```
**R Codes compute No of Principal Components required to explain at least 90% of Variance**
```{r PCs - 90 percent variance, echo=TRUE}
# Number of Principal Components are required to explain at least 90% of Variance
cumsum_proportions_var_ccpc <- cumsum(proportions_var_ccpc)
no_pc_90percent_var_ccpc <- which(cumsum_proportions_var_ccpc >= 0.9)[1]
```
\
**`r no_pc_90percent_var_ccpc` Principal Components are required to explain at least 90% of Variance**
\newpage