01-supp-addressing-data.Rmd

---
layout: page
title: Programming with R
subtitle: Addressing data
minutes: 20
---


```{r, include = FALSE}
source('tools/chunk-options.R')
```

> ## Learning Objectives {.objectives}
>
> * Understand the 3 different ways R can address data inside a data frame.
> * Combine different methods for addressing data with the assignment operator to update subsets of data

R is a powerful language for data manipulation. There are 3 main ways for addressing data inside R objects.

* By index (slicing)
* By logical vector
* By name (columns only)

Lets start by loading some sample data:

```{r readData}
dat<-read.csv(file='data/sample.csv',header=TRUE, stringsAsFactors=FALSE)
```

> ## Tip {.callout} 
>
> The first row of this csv file is a list of column names. We used the *header=TRUE* argument to `read.csv` so that R can interpret the file correctly.
> We are using the *stringsAsFactors=FALSE* argument to override the default behaviour for R. Using factors in R is covered in a separate lesson.

Lets take a look at this data.

```{r classDat}
class(dat)
```

R has loaded the contents of the .csv file into a variable called `dat` which is a `data frame`.

```{r dimDat}
dim(dat)
```

The data has 100 rows and 9 columns.

```{r headDat}
head(dat)
```

The data is the results of an (not real) experiment, looking at the number of aneurisms that formed in the eyes of patients who undertook 3 different treatments.

### Addressing by Index

Data can be accessed by index. We have already seen how square brackets `[` can be used to subset (slice) data. The generic format is `dat[row_numbers,column_numbers]`.

> ## Challenge - Selecting values 1 {.challenge}
>
> What will be returned by `dat[1,1]`?

```{r indexing1}
dat[1,1]
```

If we leave out a dimension R will interpret this as a request for all values in that dimension.

> ## Challenge - Selecting values 2 {.challenge}
>
> What will be returned by `dat[,2]`?

The colon `:` can be used to create a sequence of integers.

```{r colonOperator}
6:9
```

Creates a vector of numbers from 6 to 9.

This can be very useful for addressing data. 

> ## Challenge - Subsetting with sequences {.challenge}
> Use the colon operator to index just the aneurism count data (columns 6 to 9).

Finally we can use the `c()` (combine) function to address non-sequential rows and columns.

```{r nonsequential_indexing}
dat[c(1,5,7,9),1:5]
```

Returns the first 5 columns for patients in rows 1,5,7 & 9

> ## Challenge - Subsetting non-sequential data {.challenge}
> Return the Age and Gender values for the first 5 patients.

### Addressing by Name

Columns in an R data frame are named.

```{r column_names}
names(dat)
```

> ## Tip {.callout} 
>
> If names are not specified e.g. using `headers=FALSE` in a `read.csv()` function, R assigns default names `V1,V2,...,Vn`

We usually use the `$` operator to address a column by name

```{r named_addressing}
dat$Gender
```

Named addressing can also be used in square brackets.
```{r names_addressing_2}
head(dat[,c('Age','Gender')])
```

> ## Best Practice {.callout} 
>
> Best practice is to address columns by name, often you will create or delete columns and the column position will change.

### Logical Indexing

A logical vector contains only the special values `TRUE` & `FALSE`.

```{r logical_vectors}
c(TRUE,TRUE,FALSE,FALSE,TRUE)
```
> ## Tip {.callout} 
>
> Note the values `TRUE` and `FALSE` are all capital letters and are not quoted.

Logical vectors can be created using `relational operators` e.g. `<, >, ==, !=, %in%`.

```{r logical_vectors_example}
x<-c(1,2,3,11,12,13)
x < 10

x %in% 1:10
```

We can use logical vectors to select data from a data frame.

```{r logical_vectors_indexing}
index <- dat$Group == 'Control'
dat[index,]$BloodPressure
```

Often this operation is written as one line of code:

```{r logical_vectors_indexing2}
plot(dat[dat$Group=='Control',]$BloodPressure)
```

> ## Challenge - Using logical indexes {.challenge}
> 1. Create a scatterplot showing BloodPressure for subjects not in the control group.
> 2. How many ways are there to index this set of subjects?

### Combining Indexing and Assignment

The assignment operator `<-` can be combined with indexing.

```{r indexing and assignment}
x<-c(1,2,3,11,12,13)
x[x < 10] <- 0
x
```

> ## Challenge - Updating a subset of values {.challenge}
> In this dataset, values for Gender have been recorded as both uppercase `M, F` and lowercase `m,f`. 
> Combine the indexing and assignment operations to convert all values to lowercase.