-
Notifications
You must be signed in to change notification settings - Fork 6
/
Project1.Rmd
135 lines (93 loc) · 3.9 KB
/
Project1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
title: "Project1"
author: Your Name
output: word_document
---
```{r setup, include=FALSE}
library(knitr)
knitr::opts_chunk$set(echo = TRUE)
# FIXME: change the input file to TCGA_breast_cancer_ERpositive_vs_ERnegative_PAM50.tsv
file <- "TCGA_breast_cancer_LumA_vs_Basal_PAM50.tsv"
first10 <- c('NAT1','BIRC5','BAG1','BCL2','BLVRA','CCNB1','CCNE1','CDC6','CDC20','CDH3')
```
## Assignment
Please paste the assignment here
## Reading data
Please add R code that reads data here -
reading file: `r file`
```{r reading_data, echo=FALSE}
system.time({
# important -- this makes sure our runs are consistent and reproducible
set.seed(0)
header <- scan(file, nlines = 1, sep="\t", what = character())
data <- read.table(file, skip = 2, header = FALSE, sep = "\t", quote = "", check.names=FALSE)
names(data) <- header
header2 <- scan(file, skip = 1, nlines = 1, sep="\t", what = character())
})
```
## Computation
Please add R code that computes the results
```{r computation, echo=FALSE}
cross_validation <- function (nfold, alg="centroid") {
# split each cancer type samples into nfold groups
LumA_groups <- split(sample(colnames(LumA)), 1+(seq_along(colnames(LumA)) %% nfold))
Basal_groups <- split(sample(colnames(Basal)), 1+(seq_along(colnames(Basal)) %% nfold))
result <- array()
# iterate from 1 to nfold groups -- to choose test group
for (test_group in 1:nfold) {
# return all samples in the chosen test group
testLumA <- LumA[,colnames(LumA) %in% unlist(LumA_groups[test_group])]
testBasal <- Basal[,colnames(Basal) %in% unlist(Basal_groups[test_group])]
# return all samples *not* in the chosen test group
trainingLumA <- LumA[,!(colnames(LumA) %in% unlist(LumA_groups[test_group]))]
trainingBasal <- Basal[,!(colnames(Basal) %in% unlist(Basal_groups[test_group]))]
if (alg == "centroid") {
# compute centroid for each cancer type -- mean for each gene based on all samples
# note -- rows are gene
centroidLumA <- rowMeans(trainingLumA)
centroidBasal <- rowMeans(trainingBasal)
# For each sample in the test set decide whether it will be classified
# distance from centroid Lum A: sum(abs(x-centroidLumA))
# distance from centroid Basal: sum(abs(x-centroidBasal))
# distance is a sum of distances over all genes
# misclassification if when the distance is greater from centroid associated with known result
misclassifiedLumA <<- sum(sapply(testLumA, function(x) { sum(abs(x-centroidLumA))>sum(abs(x-centroidBasal)) }))
misclassifiedBasal <<- sum(sapply(testBasal, function(x) { sum(abs(x-centroidLumA))<sum(abs(x-centroidBasal)) }))
}
if (alg == "GLM") {
# FIXME: ADD GLM
misclassifiedLumA <<- 0
misclassifiedBasal <<- 0
}
result[test_group] <- (misclassifiedLumA+misclassifiedBasal)/(ncol(testLumA)+ncol(testBasal))
}
# c(mean(result), sd(result))
paste("mean=",mean(result),"sd=",sd(result))
}
system.time({
# FIXME: the data file has ER "Positive" / "Negative" instead of cancer type
# FIXME: gene id field is labeled data$id instead of data$sample_id
LumA <- data[,header2=='Luminal A']
Basal <- data[,header2=='Basal-like']
kNNC_5_all <- cross_validation(nfold=5)
GLM_5_all <- cross_validation(nfold=5, alg="GLM")
kNNC_10_all <- cross_validation(nfold=20)
GLM_10_all <- cross_validation(nfold=10, alg="GLM")
# FIXME: add code to compute 10 fold cross validation for GLM and kNNC
})
```
Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code.
## Results
These are our results:
### 5-fold cross validation
```{r results5, echo=FALSE}
x<-data.frame("GLM"=c(GLM_5_all,GLM_10_all),"kNNC"=c(kNNC_5_all,kNNC_10_all))
rownames(x) <- c("5-fold","10-fold")
kable(x)
```
## Discussion
This is what I found out
# Part 2
Change eval=TRUE when ready to include Project1fs.R
```{r part2, eval = FALSE, code=readLines("Project1fs.R"), echo=FALSE}
```