-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01-intro.qmd
522 lines (332 loc) · 14.3 KB
/
01-intro.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
---
title: "Introduction to R and RStudio"
author: "Bella Ratmelia"
format: revealjs
---
# Welcome!
```{r}
#| echo: false
#| warning: false
#| message: false
library(tidyverse)
```
## Preamble
- About me:
- Librarian, Research & Data Services team, SMU Libraries.
- Bachelor in Info Tech (IT), MSc in Info Studies.
- Have been with SMU since the pandemic era (2021).
- About this workshop:
- Live-coding format; code along with me!
- Goal of workshop: to give you enough fundamentals (at least to the point that ChatGPT can't bluff you so easily) and confidence to explore R on your own.
- Don't be afraid to ask for help! We are all here to learn.
## What is R? What is R Studio?
**R**: The programming language and the software that interprets the R script
**RStudio:** An IDE (Integrated Development Environment) that we use to interact more easily with R language and scripts.
. . .
You will need to install **both** for this workshop. Go to <https://posit.co/download/rstudio-desktop> to download and install both if you have not done so.
## 7 Reasons to learn R
1. R is free, open-source, and cross-platform.
2. R does not involve lots of pointing and clicking - you don't have to remember a complicated sequence of clicks to re-run your analysis.
3. R code is great for reproducibility - when someone else (including your future self) can obtain the same results from the same dataset and same analysis.
4. R is interdisciplinary and extensible
5. R is scalable and works on data of all shapes and sizes (though admittedly, it is not best at some scenarios and other languages such as python would be preferred.)
6. R produces high-quality and publication-ready graphics
7. R has a large and welcoming community - which means there are lots of help available!
## A Tour of RStudio
![R Studio layout](images/rstudio-tour.jpg){fig-align="center"}
## Working Directory
- Working directory -\> where R will look for files (scripts, data, etc).
- By default, it will be on your Desktop
- Best practice is to use **R Project** to organize your files and data into projects.
- When using R Project, the working directory = project folder.
## Creating the project for this workshop
1. Go to `File` \> `New project`. Choose `New directory`, then `New project`
2. Enter `intro-r-socsci` as the name for this new folder (or "directory") and choose where you want to put this folder, e.g. `Desktop` or `Documents` if you are on Windows. This will be your working directory for the rest of the workshop!
<!-- -->
4. Next, let's create 3 folders inside our working directory:
- `data` - we will save our raw data here. **It's best practice to keep the data here untouched.**
- `data-output` - if we need to modify raw data, store the modified version here.
- `fig-output` - we will save all the graphics we created here!
::: callout-warning
Don't put your R projects inside your OneDrive folder as that may cause issues sometimes.
:::
# Let's Code!
Create a new R script - `File` \> `New File` \> `R script`.
**Note: RStudio does not autosave your progress, so remember to save from time to time!**
## R Objects and Values
In this line of code:
```{r}
#| echo: true
name <- "Anya Forger"
```
- `"Anya Forger"` is a value. This can be either a character, numeric, or boolean data type. (more on this soon)
- `name` is the object where we store this value. This is so that we can keep this value to be used later.
- `<-` is the assignment operator to assign the value to the object.
- You can also use `=`, but generally in R, `<-` is the convention.
- Keyboard shortcut: `Alt` + `-` in Windows (`Option` + `-` in Mac)
## Refresher: Quantitative Data Types
- [**Non-Continuous Data**]{.underline}
- **Nominal/Categorical**: Non-ordered, non-numerical data, used to represent qualitative attribute.
- Example: nationality, neighborhood, employment status
- **Ordinal**: Ordered non-numerical data.
- Example: Nutri-grade ratings, frequency of exercise (daily, weekly, bi-weekly)
- **Discrete**: Numerical data that can only take specific value (usually integers)
- Example: Shoe size, clothing size
- **Binary**: Nominal data with only two possible outcome
- Example: pass/fail, yes/no, survive/not survive
------------------------------------------------------------------------
- [**Continuous Data**]{.underline}
- **Interval**: Numerical data that can take any value within a range. [It does not have a "true zero".]{.underline}
- Example: Celsius scale. Temperature of 0 C does not represent absence of heat.
- **Ratio**: Numerical data that can take any value within a range. [it has a "true zero".]{.underline}
- Example: Annual income. annual income of 0 represents no income.
## Data Types in R
The four basic data types are characters, numeric, boolean, and integer.
```{r}
#| echo: true
#| code-line-numbers: "|1|2|3|4"
chara_type <- "Hello World" # Character
num_type <- 123.45 # Numeric (also sometimes called Double)
bool_type <- TRUE # Boolean/Logical (true/false)
int_type <- 25L # Integer (whole numbers)
```
## Checking data type of a variable
You can use `str` or `typeof` to check the data type of an R object.
```{r}
#| echo: true
typeof(chara_type)
```
```{r}
#| echo: true
str(bool_type)
```
## Arithmetic operations in R
You can do arithmetic operations in R, like so:
```{r}
#| echo: true
100 / 3
```
```{r}
#| echo: true
11 ** 2
```
## Boolean operations in R
Boolean operations in R (will be handy for later):
**AND** operations (all sides needs to be TRUE for the result to be TRUE)
```{r}
#| echo: true
TRUE & FALSE
```
**OR** operations (only one side needs to be TRUE for the result to be TRUE)
```{r}
#| echo: true
TRUE | FALSE
```
**NOT** operations, which is basically flipping TRUE to FALSE and vice versa
```{r}
#| echo: true
!TRUE
```
## Functions in R
Functions is a block of reusable code designed to do specific task. Function take inputs (a.k.a arguments or parameters), do their thing, and then return a result. (this result can either be printed out, or saved into an object!)
```{r}
#| echo: true
round(123.456, digits = 2)
```
Saving the result to an object:
```{r}
#| echo: true
rounded_num = round(123.456, digits = 2)
print(rounded_num)
```
in the example above, `round` is the function. `123.456` and `digits = 2` are the arguments/parameters.
## How do I find out more about a particular function?
You can call the help page / vignette in R by prepending `?` to the function name.
E.g. if you want to find out more about the `round` function, you can run `?round` in your R console (bottom left panel)
## Packages in R
- Packages are a collections of R functions, datasets, etc. Packages extend the functionality of R.
- (Closest analogy I can think of is that they're equivalent of browser add-ons, in a way)
- Popular packages: `tidyverse`, `caret`, `shiny`, etc.
- Installation (you only need to do this once): `install.packages("package name")`
- Loading packages (you need to run this everytime you restart RStudio): `library(package name)` - let's try to load `tidyverse`!
## Data Structures in R: Vectors
- Basic objects in R can only contain one value. But quite often you may want to group a bunch of values together and save it in a single object.
- A vector is a data structure that can do this. It is the most common and basic data structure in R. (pretty much the workhorse of R!)
```{r}
#| echo: true
t1_courses <- c("IDIS110", "IDIS100", "PLE100", "PSYC111", "PSYC103")
str(t1_courses)
print(t1_courses)
```
Example of numeric vector:
```{r}
#| echo: true
t1_grades <- c(50, 70, 80, 95, 77)
str(t1_grades)
```
## Vector Manipulations: Retrieve and update items
```{r}
#| echo: true
# retrieve the 1st item in the vector
t1_grades[1]
# retrieves the 1st item up to the 3rd item
t1_grades[1:3]
# update the value of the 1st item
t1_grades[1] <- 65
print(t1_grades[1])
```
## Vector Manipulations: Retrieve items based on criteria
- Let's say we want to retrieve items that are larger than 75.
- The code below will create a boolean vector called `criteria` that basically keep tracks on whether each items inside `t1_grades` fulfil our condition.
- The condition is "value must be \> 75". e.g. if item 1 fulfils our condition, then item 1 is 'marked' as `TRUE`. Otherwise, it will be `FALSE`
```{r}
#| echo: true
criteria <- t1_grades > 75
print(criteria)
```
- This line of code applies the boolean vector `criteria` to `t1_grades`, and only retrieve items that fulfils the condition. i.e. items whose position is marked as `TRUE` by `criteria` vector
```{r}
#| echo: true
t1_grades[criteria]
```
## Shortened version of the previous code
- You can shorten the code like this too (notice that we are not creating a temporary dataframe called criteria here):
```{r}
#| echo: true
t1_grades[t1_grades > 75]
```
## Vector Manipulations: Handling NA values
- NA values indicate null values, or the absence of a value (0 is still a value!)
- Summary functions like `mean` needs you to specify in the arguments how you want it to be handled.
```{r}
#| echo: true
missing_vector <- c(21, 22, 23, 24, 25, NA, 27, 28, NA, 30)
# by default it will be confused and return NA
mean(missing_vector)
# indicate that we want the NA values to be removed entirely
mean(missing_vector, na.rm = TRUE)
```
## Vector Manipulations: Adding items
Several ways to add items to a vector
```{r}
#| echo: true
#| output-location: column
#| eval: false
t1_grades <- c(t1_grades, 92) # <1>
t1_grades <- c(t1_grades, 88, 95, 79) # <2>
t1_grades <- c(82, t1_grades) # <3>
t1_grades <- append(t1_grades, 92, after = 2) # <2> # <4>
```
1. Add a single grade to the end of the vector using c()
2. Add multiple grades to the end
3. Add a grade to the beginning
4. Insert a grade at a specific position using append()
## Vector Manipulations: Removing items
```{r}
#| echo: true
#| output-location: column
#| eval: false
t1_grades <- t1_grades[-c(2, 4)] # <1>
t1_grades <- t1_grades[t1_grades <= 90] # <2>
t1_grades <- na.omit(t1_grades) # <3>
```
1. Remove elements by index using "negative indexing"
2. Remove elements based on a condition using logical indexing
3. Remove NA values from the vector
## Data Structures in R: Factors
- Special data structure in R to deal with categorical data.
- Can be ordered (ordinal) or unordered (nominal).
- May look like a normal vector at first glance, so use `str()` to check.
Unordered (Nominal):
```{r}
#| echo: true
unordered_factor <- factor(c("SOA", "SOSS", "SCIS", "CIS", "YPHSOL"))
str(unordered_factor)
```
Ordered (Ordinal):
```{r}
#| echo: true
ordered_factor <- factor(c("Agree", "Disagree", "Neutral", "Disagree"),
ordered = TRUE,
levels = c("Disagree", "Neutral", "Agree"))
str(ordered_factor)
```
## Data Structures in R: Dataframe
- De facto data structure for tabular data in R, and what we use for data processing, plotting, and statistics.
- Similar to spreadsheets!
- You can create it by hand like so:
```{r}
#| echo: true
t1_data <- data.frame(
course_code = c("IDIS110", "IDIS100", "PLE100", "PSYC111", "PSYC103"),
grade = c(50, 70, 80, 95, 77)
)
print(t1_data)
```
## Creating dataframe from two (or more) vectors
Alternatively, here is how to create one using the two vectors that we created earlier:
```{r}
#| echo: true
t1_data <- data.frame(course_code = t1_courses, grade = t1_grades)
print(t1_data)
```
Most of the time, our dataframe will be generated by loading from external data file such as CSV, SAV, or XLSX file.
Let's try loading one from a CSV!
## Preparation: Get the CSV
::: {.callout-note appearance="simple" title="What is a CSV?"}
A CSV (Comma-Separated Values) file is a type of file that stores data in a plain text format. Each line in the file represents a row of data, and within each row, individual pieces of data (like numbers or words) are separated by commas. This format is commonly used for storing and transferring data, especially in spreadsheets and databases.
You can open CSV files in Excel, Google Sheets, or event Notepad!
:::
- Download and save `chile_voting.csv` from [**this URL**](https://raw.githubusercontent.com/bellaratmelia/introductory-r-socsci/main/data/chile_voting.csv)
- Save the CSV file into your `data` folder.
- Check out the data dictionary/explanatory notes to learn more about the data, including the column names, data type inside each columns [here](https://rdrr.io/cran/carData/man/Chile.html).
- We need to use `readr` package, which is part of `tidyverse` package. So please install `tidyverse` first if you have not done so.
## Loading data from CSV into a Dataframe
Load the CSV and save the content into a tibble/dataframe called `chile_data`
```{r}
#| echo: true
#| output: false
library(tidyverse) # <1>
chile_data <- read_csv("data/chile_voting.csv") # <2>
head(chile_data) # <3>
```
1. Load the `tidyverse` library (make sure to have it installed first!)
2. Read the CSV file that we placed inside the the data folder with the help of `read_csv` function, and save it into `chile_data` dataframe.
3. Print the first few rows of the data to check if it's loaded correctly.
## More ways to "peek" at the data
```{r}
#| echo: true
#| output: false
#| eval: false
dim(chile_data) # <1>
names(chile_data) # <2>
str(chile_data) # <3>
summary(chile_data) # <4>
head(chile_data, n=5) # <5>
tail(chile_data, n=5) # <6>
```
1. return a vector of number of rows and columns
2. inspect columns
3. inspect structure
4. print the summary stats of the entire dataframe
5. view the first 5 rows
6. view the last 5 rows
## Basic dataframe manipulations: Retrieving values
Some basic dataframe functions before we move on to data wrangling next week:
```{r}
#| echo: true
#| output: false
#| eval: false
chile_data["income"] # <1>
chile_data$income # <2>
chile_data[3] # <3>
chile_data[1, 4] # <4>
chile_data[3, ] # <5>
```
1. retrieve column by name (returns as tibble/dataframe)
2. another way to retrieve column by name (returns as vector)
3. get an entire column by index
4. get a cell at this row, column coord
5. get an entire row
# End of Session 1!
Next Session: Data wrangling with `dplyr` and `tidyr` packages