-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path2018-08_Intro_to_R_Notes.Rmd
375 lines (183 loc) · 10 KB
/
2018-08_Intro_to_R_Notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
---
title: "Intro to R"
author: "Kim Fitter - R-Ladies Auckland"
subtitle:
date: "2018/08/01"
output:
html_document:
toc: true
toc_float: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(cache = TRUE, echo = TRUE)
```
## Introduction
These are the notes for the Intro to R workshop.
## What do R-users use R for?
A typical data science project looks something like this workflow;<sup>1</sup>
<img src="http://r4ds.had.co.nz/diagrams/data-science.png" >
### Why use R and Rstudio ?
- R was intentionally developed to be a data analysis language
- R is free and open source
- RStudio is an [Integrated development environment (IDE)](https://en.wikipedia.org/wiki/Integrated_development_environment) designed to help users use R
1 [R for Data Science](http://r4ds.had.co.nz/), a great free ebook.
# Installation
- install R version 3.5.1 https://www.r-project.org/
- install RStudio Desktop https://www.rstudio.com/
Installation instructions [adapted](https://github.com/rladies/meetup-presentations_london/blob/master/2016-04_Beginners_DropIn/April_DropIn.Rmd) with appreciation from a previous R-Ladies workshop.
# Getting started in RStudio
### Argh, so many windows
Let's start with a couple useful panes.
- Console
- Environment, History
- Files, Plots, Packages, Help, Viewer
If this doesn't look like you, then go to the RStudio menu `Tools > Global Options > Pane Layout` update and Apply
# How to run code?!
## Running code in the Console
The console is where you can execute single-line R commands.
The console is located, by default, in the lower left pane.
Try `3 + 2` and `Enter`.
Assign the number 5 to an object `x` with arrow assignment `<-`.
```{r eval=FALSE}
x <- 3 + 2
```
What happens when you type `x` into the Console after assigning the value 5 to it? What do you see in the `Environment` pane?
## Running code in code chunks
````
```{ }`r ''`
# Here's a code chunk.
# Assign 10 to y. Run this line of code using Ctrl+Enter
y <- 10
```
````
# R Markdown
## So far we have run some code, but how do we save it?
We can use [R Markdown](http://r4ds.had.co.nz/r-markdown.html) documents (instead of R scripts)
> R Markdown provides an unified authoring framework for data science, combining your code, its results, and your prose commentary. R Markdown documents are fully reproducible and support dozens of output formats, like PDFs, Word files, slideshows, and more.
> R Markdown files are designed to be used in three ways:
1. For **communicating** to decision makers, who want to focus on the conclusions, not the code behind the analysis.
2. For **collaborating** with other data scientists (including future you!), who are interested in both your conclusions, and how you reached them ( i.e. the code).
3. As an environment in which to do data science, as a modern day lab **notebook** where you can capture not only what you did, but also what you were thinking.
## Opening and knitting an R Markdown .Rmd
Open `File > New File > R Markdown`.
Follow the prompts to install any required R packages.
Give your document a title, keep the default HTML option and press **ok**.
This will open an `Untitled1` template, you have given your document a title, not saved it!
Save your document as `Intro.Rmd`.
Now we will `knit` our document to an HTML document using the `knit` button or shortcut **control + shift + k**.
## Get these slides
We will all work from a personal copy of these slides, accessed from GitHub. Delete everything inside your .Rmd file.
### Find them yourself
Google `github r-ladiesakl`
Go to the **meetup-presentations_auckland** in pinned repositories
-> **2018-08_Intro_to_R_Notes.Rmd**
-> **Raw** button
### or Navigate directly to this link
https://github.com/R-LadiesAKL/meetup-presentations_auckland/blob/master/2018-08_Intro_to_R_Notes.Rmd
Select all, copy and paste the text into your .Rmd file. and `knit`.
# Survival tips
- Modern day coding practice comprises almost entirely searching "how do I do `x` in `language`".
- The [community](https://community.rstudio.com/) is one of your best resources; talk to each other and make friends and future collaborators.
- Type `?function` into the RStudio Console, and the Help pane will display documentation.
- Cheatsheets online or in R-Studio Help menu
- Twitter
- stackoverflow
<center> <img src="https://i.gifer.com/1tUl.gif"> </center>
# Data Structure Types in R
<img src="http://venus.ifca.unican.es/Rintro/_images/dataStructuresNew.png" style="width:60%" /) <sup>2</sup>
`Vectors`: one-dimensional arrays used to store data of the same type
`Matrices`: two-dimensional arrays to store data of the same type
`Arrays`: similar to matrices but they can be multi-dimensional
`Factors`: vectors of grouped categorical variables
`Lists`: ordered collection of objects, where the elements can be of different types
`Data Frames`: generalization of matrices where different columns can store different data types
2 [First Steps in R](http://venus.ifca.unican.es/Rintro/dataStruct.html#data-structure-types)
# Packages
Packages are collections of other people's code. Often someone has already written a script that does what you want to do.
For example, we want to import data. We will use a package that helps with data wrangling tasks like this, the [`tidyverse`](https://www.tidyverse.org/).
We're going to use the metapackage `tidyverse` to help us with our data analysis.
## Functions
The most common element of packages are functions. R also comes preloaded with a *base* of functions commonly used.
Functions run other people's code for us, so that we don't have to reinvent the wheel.
We will use functions to intall and load the `tidyverse`.
### How to spot a function
- *functions* in R take the form `function()`
# Installing and loading packages
We want to install the package `tidyverse`.
### For installation; i.e., first time only.
`install.packages("name of package")`
### For loading.
`library(name of package)`
### Let's install and load the tidyverse
```{r Load the tidyverse package}
# This is a code chunk.
# We can write informative comments with a hash # at the start.
# Install the tidyverse using the install.packages() function.
# Load the tidyverse using the library() function.
# Press the green arrow in the top right corner of the chunk to run!
# Don't forget, you need to install the package before you can use it.
```
# Importing data
## Import the data
Since the data is stored on an online repository, we can import it via URL.
We can import this data using the `read_csv()` function from the `tidyverse`.
**This function takes a file argument, such as the URL, which goes between the () as a "character string".**
The Summer of Tech data is found here: "https://raw.githubusercontent.com/R-LadiesAKL/sotdata/master/Event%20attendances%20data%20Dec%202017.csv"
Try importing the data at the console with `read_csv` with the URL including the " ". What output do you see?
`read_csv` with the argument url produces a data object. An object we can assign.
Open a code chunk *here* and read the data in using `read_csv` and assign `<-` the data to an object called `eventdat`.
- **control+alt+i** to open a code chunk.
- press green **play** to run chunk.
# Explore and Understand
Let's explore the information in this table.
## Summary functions
Lots of objects in R `<an R object>` are friendly to the `summary(<an R object>)` function.
What's is the output of `summary()` using object `eventdat`?
### An alternative to `summary`
An alternative is the `skim()` function from from the `skimr` package.
- install the `skimr` package
- Open a code chunk here and load the `skimr` package in your notes
- apply the `skim` function to the `eventdat` data
What is the difference between the output of `summary` and `skim`? Which do you like better and why?
# The fine art of wrangling
At this point, we often wish to manipulate the data in some way. This is variously known as wrangling, cleaning, and scrubbing.
To do this, we'll learn a very useful operator, the pipe `%>%`.
## The pipe
Piping makes code easier to read (arguably).
The `head()` function takes one argument, an object called <some data>:
```{}
head(<some data>)
```
But we could also *pipe* `%>%` the data into the function.
```{}
# Use the pipe function to present the top of the `eventdat` dataset.
<some data> %>% head()
```
# Visualisation
## The structure of a ggplot
One R method of plotting data is with the `ggplot2` package, which comes with the `tidyverse`.
## Aesthetics
We define x and y axes of the plot with aesthetics in `ggplot`.
```{}
<some data> %>%
ggplot(aes(x = <column1>, y = <column2>))
```
## Your first ggplot
In your **dataframe** chose two `numeric` variables to plot **column1** and ** column2**.
We'll add a plot layer `+` to our ggplot using `geom_point` for a scatterplot.
Set the x axis to **column1** and the y axis to **column2**.
What happens when you `%>%` the `eventdat` table into `ggplot()`?
```{}
dataframe %>%
ggplot(aes(x = column1, y = column2)) +
geom_point() # Adds a scatterplot.
```
# Tutorials
In groups or alone, choose a tutorial to follow:
Further Introduction to R : [DataCamp basics](https://www.datacamp.com/courses/free-introduction-to-r%20)
From Excel background to R : [Using the Summer of Tech data](https://github.com/kimnewzealand/R-tutorials/blob/master/from-excel-tutorial-sotdata.Rmd)
Further Visualisation : [ggplot2](https://ggplot2.tidyverse.org/)
# Acknowledgements
- The installation instructions are [adapted from a previous workshop](https://github.com/rladies/meetup-presentations_london/blob/master/2016-04_Beginners_DropIn/April_DropIn.Rmd).
- The [R-Ladies presentation ninja template](https://alison.rbind.io/post/r-ladies-slides/) for the presentation.
- This workshop was adapted and shortened from an R-Ladies [useR! 2018 R-Curious workshop](https://github.com/softloud/rcurious). The [video](https://youtu.be/AmqxVDlfKQY) is highly recommended.