Project1.Rmd

---
title: "Project1"
author: Your Name
output: word_document
---

```{r setup, include=FALSE}
library(knitr)
knitr::opts_chunk$set(echo = TRUE)

# FIXME: change the input file to TCGA_breast_cancer_ERpositive_vs_ERnegative_PAM50.tsv

file <- "TCGA_breast_cancer_LumA_vs_Basal_PAM50.tsv"
first10 <- c('NAT1','BIRC5','BAG1','BCL2','BLVRA','CCNB1','CCNE1','CDC6','CDC20','CDH3')
```

## Assignment

Please paste the assignment here

## Reading data

Please add R code that reads data here - 
reading file: `r file`

```{r reading_data, echo=FALSE}
system.time({
# important -- this makes sure our runs are consistent and reproducible
set.seed(0)

header <- scan(file, nlines = 1, sep="\t", what = character())
data <- read.table(file, skip = 2, header = FALSE, sep = "\t", quote = "", check.names=FALSE)
names(data) <- header

header2 <- scan(file, skip = 1, nlines = 1, sep="\t", what = character())
})
```

## Computation

Please add R code that computes the results

```{r computation, echo=FALSE}
cross_validation <- function (nfold, alg="centroid") {

  # split each cancer type samples into nfold groups
  LumA_groups <- split(sample(colnames(LumA)), 1+(seq_along(colnames(LumA)) %% nfold))
  Basal_groups <- split(sample(colnames(Basal)), 1+(seq_along(colnames(Basal)) %% nfold))

  result <- array()

  # iterate from 1 to nfold groups -- to choose test group
  for (test_group in 1:nfold) {

    # return all samples in the chosen test group
    testLumA <- LumA[,colnames(LumA) %in% unlist(LumA_groups[test_group])]
    testBasal <- Basal[,colnames(Basal) %in% unlist(Basal_groups[test_group])]

    # return all samples *not* in the chosen test group 
    trainingLumA <- LumA[,!(colnames(LumA) %in% unlist(LumA_groups[test_group]))]
    trainingBasal <- Basal[,!(colnames(Basal) %in% unlist(Basal_groups[test_group]))]

    if (alg == "centroid") {
       # compute centroid for each cancer type -- mean for each gene based on all samples
       # note -- rows are gene
       centroidLumA <- rowMeans(trainingLumA)
       centroidBasal <- rowMeans(trainingBasal)

       # For each sample in the test set decide whether it will be classified
       # distance from centroid Lum A: sum(abs(x-centroidLumA))
       # distance from centroid Basal: sum(abs(x-centroidBasal))
       # distance is a sum of distances over all genes 
       # misclassification if when the distance is greater from centroid associated with known result
       misclassifiedLumA <<- sum(sapply(testLumA, function(x) { sum(abs(x-centroidLumA))>sum(abs(x-centroidBasal)) }))
       misclassifiedBasal <<- sum(sapply(testBasal, function(x) { sum(abs(x-centroidLumA))<sum(abs(x-centroidBasal)) }))
    }

    if (alg == "GLM") {
      # FIXME: ADD GLM
      misclassifiedLumA <<- 0
      misclassifiedBasal <<- 0
    }

    result[test_group] <- (misclassifiedLumA+misclassifiedBasal)/(ncol(testLumA)+ncol(testBasal))
 }

 # c(mean(result), sd(result))
 paste("mean=",mean(result),"sd=",sd(result))
}


system.time({

  # FIXME: the data file has ER "Positive" / "Negative" instead of cancer type
  # FIXME: gene id field is labeled data$id instead of data$sample_id
  
LumA <- data[,header2=='Luminal A']
Basal <- data[,header2=='Basal-like']

kNNC_5_all <- cross_validation(nfold=5)
GLM_5_all <- cross_validation(nfold=5, alg="GLM")

kNNC_10_all <- cross_validation(nfold=20)
GLM_10_all <- cross_validation(nfold=10, alg="GLM")

# FIXME: add code to compute 10 fold cross validation for GLM and kNNC

})

```

Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code.

## Results

These are our results:

### 5-fold cross validation
```{r results5, echo=FALSE}
x<-data.frame("GLM"=c(GLM_5_all,GLM_10_all),"kNNC"=c(kNNC_5_all,kNNC_10_all))
rownames(x) <- c("5-fold","10-fold")
kable(x)
```


## Discussion

This is what I found out

# Part 2

Change eval=TRUE when ready to include Project1fs.R

```{r part2, eval = FALSE, code=readLines("Project1fs.R"), echo=FALSE}
```