Session1.3-walkthrough.Rmd

---
title: "Introduction to Solving Biological Problems Using R - Day 1"
author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
  Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
  html_notebook:
    toc: yes
    toc_float: yes
---

# 3. R for data analysis

##3 steps to Basic Data Analysis

- In this short section, we show how the data manipulation steps we have just seen can be used as part of an analysis pipeline:

1. Reading in data
    + `read.table()`
    + `read.csv(), read.delim()`
2. Analysis
    + Manipulating & reshaping the data
        + perhaps dealing with "missing data"
    + Any maths you like
    + Diagnostic Plots
3. Writing out results
    + `write.table()`
    + `write.csv()`
  
## A simple walkthrough

- We have data from 100 patients that given consent for their data to use in future studies
- A researcher wants to undertake a study involving people that are overweight
- We will walkthrough how to filter the data and write a new file with the candidates for the study    
    
##The Working Directory (wd)


- Like many programs R has a concept of a working directory 
- It is the place where R will look for files to execute and where it will
save files, by default
- For this course we need to set the working directory to the location
of the course scripts
- In RStudio use the mouse and browse to the directory where you saved the Course Materials

- ***Session → Set Working Directory → Choose Directory...***

## 0. Locate the data

Before we even start the analysis, we need to be sure of where the data are located on our hard drive

- Functions that import data need a file location as a character vector
- The default location is the ***working directory***
```{r}
getwd()
```

- If the file you want to read is in your working directory, you can just use the file name

```{r eval=FALSE}
list.files()
```

- The `file.exists` function does exactly what it says on the tin!
    + a good sanity check for your code

```{r}
file.exists("patient-info.txt")
```

- Otherwise you need the *path* to the file
    + you can get this using **`file.choose()`**
    
- If you unsure about specifying a file path at the command line, this [online tutorial](http://rik.smith-unna.com/command_line_bootcamp/?id=vczhybjhtyt) will give you hands-on practice
    
##1. Read in the data

- The data are a tab-delimited file. Each row is a record, each column is a field. Columns are separated by tabs in the text
- We need to read in the results and assign it to an object (`patients`)

```{r}
patients <- read.delim("patient-info.txt")

```

In the latest RStudio, there is the option to import data directly from the File menu. ***File*** -> ***Import Dataset*** -> ***From Csv***

- If the data are comma-separated, then use either the argument `sep=","` or the function `read.csv()`:
- You need to make sure you use the correct function
    + can you explain the output of the following lines of code?

```{r }
tmp <- read.csv("patient-info.txt")
head(tmp)
```
- For full list of arguments:
```{r}
?read.table
```

##1b. Check the data
- *Always* check the object to make sure the contents and dimensions are as you expect
- R will sometimes create the object without error, but the contents may be un-usable for analysis
    + If you specify an incorrect separator, R will not be able to locate the columns in your data, and you may end up with an object with just one column
    
```{r}
# View the first 10 rows to ensure import is OK
patients[1:10,]  
```


- or use the `View()` function to get a display of the data in RStudio:
```{r}
View(patients)
```

##1c. Understanding the object

- Once we have read the data successfully, we can start to interact with it
- The object we have created is a *data frame*:
```{r}
class(patients)
```

- We can query the dimensions:

```{r}
ncol(patients)
nrow(patients)
dim(patients)
```


- The names of the columns are automatically assigned:

```{r}
colnames(patients)
```

- We can use any of these names to access a particular column:
    + and create a vector
    + TOP TIP: type the name of the object and hit TAB: you can select the column from the drop-down list!
```{r}
patients$ID

```

## Word of warning


![](images/tolstoy.jpg)


![](images/hadley.jpg)

> Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham - RStudio chief scientist and author of dplyr, ggplot2 and others)

You will make your life a lot easier if you keep your data **tidy** and ***organised***. Before blaming R, consider if your data are in a suitable form for analysis. The more manual manipulation you have done on the data (highlighting, formulas, copy-and-pasting), the less happy R is going to be to read it. Here are some useful links on some common pitfalls and how to avoid them

- http://www.datacarpentry.org/spreadsheet-ecology-lesson/
- http://kbroman.org/dataorg/

##Handling missing values

- The data frame contains some **`NA`** values, which means the values are missing – a common occurrence in real data collection
- `NA` is a special value that can be present in objects of any type (logical, character, numeric etc)
- `NA` is not the same as `NULL`:
    - `NULL` is an empty R object. 
    - `NA` is one missing value within an R object (like a data frame or a vector)
- Often R functions will handle `NA`s gracefully:

```{r}
length(patients$Height)
mean(patients$Height)
```

- However, sometimes we have to tell the functions what to do with them. 
- R has some built-in functions for dealing with `NA`s, and functions often have their own arguments (like `na.rm`) for handling them:
    + annoyingly, different functions have different argument names to change their behaviour with regards to `NA` values. *Always check the documentation*

```{r}
mean(patients$Height, na.rm = TRUE)

mean(na.omit(patients$Height))
```

##2. Analysis (reshaping data and maths)

- Our analysis involves identifying patients with extreme BMI
    + we will define this as being two standard deviations from the mean

```{r}
# Create an index of results:
BMI <- (patients$Weight)/((patients$Height/100)^2)
upper.limit <- mean(BMI,na.rm = TRUE) + 2*sd(BMI,na.rm = TRUE)
upper.limit
```


- We can plot a simple chart of the BMI values 
    + add a vertical line to indicate the cut-off
    + plotting will be covered in detail shortly..

```{r}
plot(BMI)
# Add a horizonal line:
abline(h=upper.limit) 
```

- It is also useful to save the variable we have computed as a new column in the data frame

```{r}
round(BMI,1)
patients$BMI <- round(BMI,1)
head(patients)
```

- To actually select the candidates we can use a logical expression to test the values of the BMI vector being greater than the upper limit
    + if the second line looks a bit weird, remember that `<-` is doing an assignment. Thevalue we are assigning to our new variable is the logical (`TRUE` or `FALSE`) vector given by testing each item in `BMI` against the `upper.limit`
    
```{r}
BMI > upper.limit
candidates <- BMI > upper.limit
```

We have seen that a logical vector can be used to subset a data frame

- However, in our case the result looks a bit funny
- Can you think why this might be?

```{r}
patients[candidates,]
```

The `which` function will take a logical vector and return the indices of the `TRUE` values

- This can then be used to subset the data frame

```{r}
which(BMI > upper.limit)
candidates <- which(BMI > upper.limit)
```


## 3. Outputting the results

- We write out a data frame of candidates (patients with BMI more than standard deviations from the mean) as a 'comma separated values' text file (CSV):

```{r}
write.csv(patients[candidates,], file="selectedSamples.csv")
```

- The output file is directly-readable by Excel
- It's often helpful to double check where the data has been saved. Use the *get working directory* function:

```{r eval=FALSE}
getwd()      # print working directory
list.files() # list files in working directory

```


To recap, the set of R commands we have used is:-

```{r}
patients <- read.delim("patient-info.txt")
BMI <- (patients$Weight)/((patients$Height/100)^2)
upper.limit <- mean(BMI,na.rm = TRUE) + 2*sd(BMI,na.rm = TRUE)
plot(BMI)
# Add a horizonal line:
abline(h=upper.limit) 
patients$BMI <- round(BMI,1)
candidates <- which(BMI > upper.limit)
write.csv(patients[candidates,], file="selectedSamples.csv")

```

##Exercise: Exercise 3

- A separate study is looking for patients that are underweight and also smoke; 
  + Modify the condition in our previous code to find these patients
  + e.g. having BMI that is 2 standard deviations *less* than the mean BMI
  + Write out a results file of the samples that match these criteria, and open it in a spreadsheet program


```{r}
### Your Answer Here ### 


```