-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmore_exercises.Rmd
454 lines (305 loc) · 14.1 KB
/
more_exercises.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
---
title: "Extra Data, Use-cases and Exercises"
author: "Mark Dunning"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
html_notebook:
toc: yes
toc_float: yes
css: stylesheets/styles.css
editor_options:
chunk_output_type: inline
---
<img src="images/logo-sm.png" style="position:absolute;top:40px;right:10px;" width="200" />
# The COSMIC Dataset
We will illustrate some of the main functions from tidyverse with example data from [COSMIC](https://cancer.sanger.ac.uk/cosmic). An example file is freely available, but you will need to register for full-size file.
`download.file` is an R *function* for downloading a file into your *working directory* (the directory where R is currently looking to open files from, and save files to).
```{r eval=FALSE}
## create a more_data folder
dir.create("more_data",showWarnings = FALSE)
download.file("https://github.com/sheffield-bioinformatics-core/r-online/raw/master/more_data/CosmicMutantExport.tsv",destfile = "more_data/CosmicMutantExport.tsv")
```
## Exercise1
<div class="exercise">
- Which function from `readr` would you use the read the file `CosmicMutantExport.tsv` into R?
- Import the file `CosmicMutantExport.tsv` into R
- How many rows and columns are in the data frame you create
</div>
```{r echo=FALSE}
library(readr)
cosmic <- readr::read_tsv("more_data//CosmicMutantExport.tsv")
cosmic
```
## Column names containing spaces
When using `select` we can use the column name without quotation marks to print that column. Column names that contain a space present a bit of an issue for R:-
```{r eval=FALSE}
library(dplyr)
select(cosmic, Gene name)
```
This can be avoiding by putting the desired column name inside the backtick symbols.
```{r}
library(dplyr)
select(cosmic, `Gene name`)
```
```{r}
select(cosmic, `Gene name`, `Accession Number`)
```
```{r}
select(cosmic, `Gene name`:`Sample name`)
```
## Exercise2
Practice using the `select` function to identify particular columns in the dataset. Remember some of the "helper functions" that are available when using select.
<div class="exercise">
- Select the first five, and the last five columns from the table
- Select the columns that start with "Mutation"
- Select all columns that contain information on the tumour Histology. There should be four columns.
</div>
## Exercise3
The `filter` can be used to restrict a data frame to particular rows of interest
<div class="exercise">
Use `filter` to restrict the rows data to..
- Mutations that occur in the breast or prostate
- Mutations that occur in patients under 50
</div>
## Separating columns
You might have noticed that the first column of the data frame comprises both the name of a gene and it's identifier in the [Ensembl database](https://www.ensembl.org/index.html). Representing the data in such a way makes analyses such as counting the number of mutations in a particular gene more complicated than they should be.
A function to perform this kind of data cleaning can be found in the `tidyr` package.
```{r}
## check if tidyr package is installed, and install if it isn't
if(!require("tidyr")) install.packages("tidyr")
```
The `separate` function takes a data frame as it's first argument and will split a named column into it's component parts. The names of the new columns can be specified. The character (in our case `_`) that is found in the column can also be specified, although it is usually able to detect this automatically.
```{r}
library(tidyr)
separate(cosmic, `Gene name`, into=c("SYMBOL","ENSEMBL"))
```
As usual, we will need to assign the result to a variable if we want to keep this new version of the data.
```{r}
cosmic <- separate(cosmic,
`Gene name`,
into=c("SYMBOL","ENSEMBL"))
```
Now we can make a barplot of how many mutations are found for each gene
```{r}
library(ggplot2)
ggplot(cosmic, aes(x = SYMBOL)) + geom_bar()
```
## Exercise 4
<div class="exercise">
- Is the age distribution of patients that have mutations in `NCOR2` or `SELP` different? Use a violin plot to find out?
- Expand the `count` code above to find the number of mutations in each sample type for the two genes
- Visualise these counts using a `geom_tile` arrangement (see below for example plot)
</div>
```{r echo=FALSE}
count(cosmic, SYMBOL, `Primary site`) %>%
ggplot(aes(x = SYMBOL, y=`Primary site`, fill= n)) + geom_tile()
```
## Exercise4
<div class="exercise">
- Separate the position of the mutation into chromosome start and end columns
- Arrange the rows according to chromosome and start position
# Clinical data from the TCGA project
This exercise concerns the clinical descriptions of tumours from The Cancer Genome Archive. It was previously downloaded from [GEO](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944) and has undergone some minor alterations.
## Part1
The data can be downloaded using the following code:-
```{r}
## create a more_data folder
dir.create("more_data",showWarnings = FALSE)
## will check if the file exists before downloading
if(!file.exists("more_data/tcga_clinical.tsv")) download.file("https://github.com/sheffield-bioinformatics-core/r-online/raw/master/more_data/tcga_clinical.tsv", destfile = "more_data/tcga_clinical.tsv")
```
<div class="exercise">
**Exercise**: What function from `readr` would you use to read the file `tcga_clinical.tsv` into R? Read the file in. What are the number of rows and columns?
</div>
You should find that the data frame contains a great deal of columns; far too many to be useful. We would like to keep the columns containing the age of the patient, and the tumour stage in our analysis. Rather than opening-up the file, or `View`ing the file in RStudio, we can use a couple of helper functions to identify the relevant column names.
<div class="exercise">
**Exercise**: Use the `select` function in conjunction with `contains` and `starts_with` to identify columns that have `age` or `stage` in their name. The code should look like the following (you will need to fill-in the dots).
</div>
```{r eval=FALSE}
# Complete the code by replacing the ...
select(..., contains("...."))
select(..., starts_with("...."))
```
<div class="exercise">
**Exercise:** Use the `select` function to create a new data frame that contains the following columns. **These are not the actual columns names**
- Tumour site
- Race
- Gender
- Age at diagnosis
- Dead / Alive Status
You can add extra columns if you wish
**See below for example output**
</div>
```{r message=FALSE, echo=FALSE,warning=FALSE}
library(tidyverse)
clin <- readr::read_tsv("../raw_data/tcga_clinical.tsv")
data <- select(clin,
tumor_tissue_site,
race,
gender,
age_at_initial_pathologic_diagnosis,
vital_status)
data
```
<div class="exercise">
**Exercise:** Use the `dplyr` function called `count` to tabulate how which sites are included in the data. Re-arrange the output from `count` using `arrange` to determine the most common type of cancer in the dataset.
**See below for example output**
</div>
```{r echo=FALSE}
count(data, tumor_tissue_site)
```
<div class="exercise">
**Exercise**: Not all samples have an entry for tumour type. Use the `filter` function to create a table with valid entries for `tumor_tissue_site`. Create a barplot to show display the number of occurences of each tumour type
HINT: An easy way to make the labels on the x-axis more legible is to use the `coord_flip` function
```{r eval=FALSE}
ggplot(data, aes(x=...)) + geom_bar() + coord_flip()
```
**See below for example output**
</div>
```{r echo=FALSE}
filter(data,!is.na(tumor_tissue_site)) %>%
filter(tumor_tissue_site != "[Not Available]") %>%
ggplot(aes(x = tumor_tissue_site)) + geom_bar() + coord_flip()
```
## Part2
We would like to visualise the age of diagnosis, and eventually compare between different disease types, The code we might think to use initially could look like:-
```{r}
## assuming your filtered clinical data is called data
ggplot(data, aes(x = age_at_initial_pathologic_diagnosis)) + geom_density()
```
This doesn't look like the desired output though. If we re-visit the data frame and print the "age" column we notice that the entries in the column are stored as "chr". i.e. characters or text
```{r}
select(data, age_at_initial_pathologic_diagnosis)
```
This has occurred because some entries are "`[Not Available]`" rather than a number or `NA`. As soon as R finds any text within the column, it treats everything in the column as text.
These entries can be filtered in the same manner as previously (when filtering the tissue type column), but this does not solve the problem entirely.
```{r}
data %>%
filter(age_at_initial_pathologic_diagnosis != "[Not Available]") %>%
ggplot(aes(x = age_at_initial_pathologic_diagnosis)) + geom_density()
```
We need to add an additional step which will force R to treat the data in the `age_at_initial_pathologic_diagnosis` column as numerical data. Such a conversion can be done using the `as.numeric` function and the `mutate` function can be used to modify the `age_at_initial_pathologic_diagnosis` column to contain the numeric values
<div class="exercise">
**Exercise**: Use `mutate` and `as.numeric` to convert the values in `age_at_initial_pathologic_diagnosis` into numbers. You will still need to remove the
`[Not Available]` values beforehand. Now try and create the density plot.
</div>
```{r echo=FALSE}
data %>%
filter(age_at_initial_pathologic_diagnosis != "[Not Available]") %>%
mutate(age_at_initial_pathologic_diagnosis = as.numeric(age_at_initial_pathologic_diagnosis)) %>%
ggplot(aes(x = age_at_initial_pathologic_diagnosis)) + geom_density()
```
<div class="exercise">
**Exercise**: Use the `facet_wrap` function to compare the distribution of ages between different tumour types
</div>
```{r echo=FALSE}
data %>%
filter(age_at_initial_pathologic_diagnosis != "[Not Available]") %>%
filter(tumor_tissue_site != "[Not Available]") %>%
mutate(age_at_initial_pathologic_diagnosis = as.numeric(age_at_initial_pathologic_diagnosis)) %>%
ggplot(aes(x = age_at_initial_pathologic_diagnosis)) + geom_density() + facet_wrap(~tumor_tissue_site)
```
<div class="exercise">
**Exercise**: Do any tumour types have a different age of diagnosis between males and females? Use a boxplot to find out
</div>
```{r echo=FALSE}
data %>%
filter(age_at_initial_pathologic_diagnosis != "[Not Available]") %>%
filter(tumor_tissue_site != "[Not Available]") %>%
mutate(age_at_initial_pathologic_diagnosis = as.numeric(age_at_initial_pathologic_diagnosis)) %>%
ggplot(aes(x= gender, y = age_at_initial_pathologic_diagnosis)) + geom_boxplot() + facet_wrap(~tumor_tissue_site)
```
Lets now look at gender split for each cancer type. As a first step, we can group the data by gender and tissue type and obtain counts.
```{r}
data %>%
group_by(tumor_tissue_site,gender) %>%
filter(tumor_tissue_site != "[Not Available]") %>%
summarise(N = n())
```
These data are ready for plotting, but for comparisons we need to take into account the total number of each tissue type. We can create frequencies rather than absolute numbers by dividing by the total number of cases.
```{r}
data %>%
group_by(tumor_tissue_site,gender) %>%
filter(tumor_tissue_site != "[Not Available]") %>%
summarise(N = n()) %>%
mutate(freq = N / sum(N))
```
Note that the order of the grouping is important here. If we reversed it to `gender` then `tumor_tissue_site` the frequencies would be calculated using the total of males of females.
```{r}
data %>%
group_by(gender,tumor_tissue_site) %>%
filter(tumor_tissue_site != "[Not Available]") %>%
summarise(N = n()) %>%
mutate(freq = N / sum(N))
```
<div class="exercise">
**Exercise**: Create a plot to show the gender split in cases of each tumor type.
</div>
```{r echo=FALSE}
data %>%
group_by(tumor_tissue_site,gender) %>%
filter(tumor_tissue_site != "[Not Available]") %>%
summarise(N = n()) %>%
mutate(freq = N / sum(N)) %>%
ggplot(aes(x = gender, y = freq)) + geom_col() + facet_wrap(~tumor_tissue_site)
```
<div class="exercise">
**Exercise**: Create a plot to show the proportion of patients dead or alive for each tumour type
</div>
```{r echo=FALSE}
data %>%
group_by(tumor_tissue_site,vital_status) %>%
filter(tumor_tissue_site != "[Not Available]") %>%
filter(vital_status != "[Not Available]") %>%
summarise(N = n()) %>%
mutate(freq = N / sum(N)) %>%
ggplot(aes(x = vital_status, y = freq)) + geom_col() + facet_wrap(~tumor_tissue_site)
```
# Palmer Penguins
This fun example concerns a dataset available from:-
https://allisonhorst.github.io/palmerpenguins/
![](https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png)
** Artwork by @allison_horst **
```{r}
## check if package already exists, and if not install from CRAN
if(!require("palmerpenguins")) install.packages("palmerpenguins")
```
Once the package is installed, we can load it and `View` the `penguins` dataset, which is included with the package.
```{r}
library(palmerpenguins)
View(penguins)
example("penguins")
```
```{r}
count(penguins, species)
```
<div class="exercise">
- Is the body mass of male penguins more than female penguins? Use a suitable plot to visualise the data.
- Is the trend consistent across different species?
- Use the `summarise` function to calculate the average body mass in males / females for different species
- Can you remove observations for which no `sex` was recorded from the dataset?
</div>
```{r}
ggplot(penguins, aes(x = sex, y = body_mass_g)) + geom_boxplot() + facet_wrap(~species)
group_by(penguins, sex,species) %>%
summarise(Weight = mean(body_mass_g))
```
<div class="exercise">
Is the body mass related to the flipper length? Use an appropriate plot to find out
</div>
```{r}
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, col = species)) + geom_point() + geom_smooth(method="lm")
```
# Machine Learning Examples
```{r}
if(!require("MLDataR")) install.packages("MLDataR")
```
```{r}
library(MLDataR)
```
```{r}
View(diabetes_data)
```
```{r}
ggplot(diabetes_data,aes(x = DiabeticClass, y = Age, fill=Gender)) + geom_boxplot()
```