forked from swcarpentry/r-novice-inflammation
-
Notifications
You must be signed in to change notification settings - Fork 0
/
01-supp-addressing-data.Rmd
173 lines (120 loc) · 4.43 KB
/
01-supp-addressing-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
layout: page
title: Programming with R
subtitle: Addressing data
minutes: 20
---
```{r, include = FALSE}
source('tools/chunk-options.R')
```
> ## Learning Objectives {.objectives}
>
> * Understand the 3 different ways R can address data inside a data frame.
> * Combine different methods for addressing data with the assignment operator to update subsets of data
R is a powerful language for data manipulation. There are 3 main ways for addressing data inside R objects.
* By index (slicing)
* By logical vector
* By name (columns only)
Lets start by loading some sample data:
```{r readData}
dat<-read.csv(file='data/sample.csv',header=TRUE, stringsAsFactors=FALSE)
```
> ## Tip {.callout}
>
> The first row of this csv file is a list of column names. We used the *header=TRUE* argument to `read.csv` so that R can interpret the file correctly.
> We are using the *stringsAsFactors=FALSE* argument to override the default behaviour for R. Using factors in R is covered in a separate lesson.
Lets take a look at this data.
```{r classDat}
class(dat)
```
R has loaded the contents of the .csv file into a variable called `dat` which is a `data frame`.
```{r dimDat}
dim(dat)
```
The data has 100 rows and 9 columns.
```{r headDat}
head(dat)
```
The data is the results of an (not real) experiment, looking at the number of aneurisms that formed in the eyes of patients who undertook 3 different treatments.
### Addressing by Index
Data can be accessed by index. We have already seen how square brackets `[` can be used to subset (slice) data. The generic format is `dat[row_numbers,column_numbers]`.
> ## Challenge - Selecting values 1 {.challenge}
>
> What will be returned by `dat[1,1]`?
```{r indexing1}
dat[1,1]
```
If we leave out a dimension R will interpret this as a request for all values in that dimension.
> ## Challenge - Selecting values 2 {.challenge}
>
> What will be returned by `dat[,2]`?
The colon `:` can be used to create a sequence of integers.
```{r colonOperator}
6:9
```
Creates a vector of numbers from 6 to 9.
This can be very useful for addressing data.
> ## Challenge - Subsetting with sequences {.challenge}
> Use the colon operator to index just the aneurism count data (columns 6 to 9).
Finally we can use the `c()` (combine) function to address non-sequential rows and columns.
```{r nonsequential_indexing}
dat[c(1,5,7,9),1:5]
```
Returns the first 5 columns for patients in rows 1,5,7 & 9
> ## Challenge - Subsetting non-sequential data {.challenge}
> Return the Age and Gender values for the first 5 patients.
### Addressing by Name
Columns in an R data frame are named.
```{r column_names}
names(dat)
```
> ## Tip {.callout}
>
> If names are not specified e.g. using `headers=FALSE` in a `read.csv()` function, R assigns default names `V1,V2,...,Vn`
We usually use the `$` operator to address a column by name
```{r named_addressing}
dat$Gender
```
Named addressing can also be used in square brackets.
```{r names_addressing_2}
head(dat[,c('Age','Gender')])
```
> ## Best Practice {.callout}
>
> Best practice is to address columns by name, often you will create or delete columns and the column position will change.
### Logical Indexing
A logical vector contains only the special values `TRUE` & `FALSE`.
```{r logical_vectors}
c(TRUE,TRUE,FALSE,FALSE,TRUE)
```
> ## Tip {.callout}
>
> Note the values `TRUE` and `FALSE` are all capital letters and are not quoted.
Logical vectors can be created using `relational operators` e.g. `<, >, ==, !=, %in%`.
```{r logical_vectors_example}
x<-c(1,2,3,11,12,13)
x < 10
x %in% 1:10
```
We can use logical vectors to select data from a data frame.
```{r logical_vectors_indexing}
index <- dat$Group == 'Control'
dat[index,]$BloodPressure
```
Often this operation is written as one line of code:
```{r logical_vectors_indexing2}
plot(dat[dat$Group=='Control',]$BloodPressure)
```
> ## Challenge - Using logical indexes {.challenge}
> 1. Create a scatterplot showing BloodPressure for subjects not in the control group.
> 2. How many ways are there to index this set of subjects?
### Combining Indexing and Assignment
The assignment operator `<-` can be combined with indexing.
```{r indexing and assignment}
x<-c(1,2,3,11,12,13)
x[x < 10] <- 0
x
```
> ## Challenge - Updating a subset of values {.challenge}
> In this dataset, values for Gender have been recorded as both uppercase `M, F` and lowercase `m,f`.
> Combine the indexing and assignment operations to convert all values to lowercase.