-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01-intro.qmd
597 lines (385 loc) · 16.9 KB
/
01-intro.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
---
title: "Introduction to R and RStudio"
author: "Bella Ratmelia"
format: revealjs
---
# Welcome!
```{r}
#| echo: false
#| warning: false
#| message: false
library(tidyverse)
```
## Preamble
- About me:
- Senior Librarian, Research & Data Services team, SMU Libraries.
- Bachelor in Info Tech (IT), MSc in Info Studies from NTU.
- Have been with SMU since the pandemic era (2021).
- Have been doing this workshop since Aug 2023.
- About this workshop:
- Live-coding format; code along with me!
- Goal of workshop: to give you enough fundamentals (at least to the point that ChatGPT can't bluff you so easily) and confidence to explore R on your own.
- Don't be afraid to ask for help! We are all here to learn.
## The outline for these workshops
The workshops are structured to follow this workflow when dealing with data

::: aside
Image is taken from [R for Data Science (2e)](https://r4ds.hadley.nz/intro) by Hadley Wickham.
:::
## The outline for these workshops (explained)
::: incremental
1. **Import** data into R, which means take data (stored in a file, via API, etc) and load it into a dataframe in R
2. **Tidy** the imported data.
- Tidy = storing it in a consistent form that matches the semantics of the dataset.
- Tidy data = each column is a variable, each row is an observation
3. Once a data it tidy, we can **transform** it. Transformation includes:
- narrowing in on observations of interest (like all people in one city or all data from the last year)
- creating new variables that are functions of existing variables (like computing speed from distance and time)
- calculating a set of summary statistics (like counts or means).
4. Once we have tidy data with the info we need, we can **visualize** it and **model** it.
5. **Communicate** the result. It doesn’t matter how well your models and visualization have led you to understand the data unless you can also communicate your results to others.
:::
## What is R? What is R Studio?
**R**: The programming language and the software that interprets the R script
**RStudio:** An IDE (Integrated Development Environment) that we use to interact more easily with R language and scripts.
. . .
You will need to install **both** for this workshop. Go to <https://posit.co/download/rstudio-desktop> to download and install both if you have not done so.
Check out the course website for a step-by-step guide.
## A Tour of RStudio
{fig-align="center"}
## Working Directory
- Working directory -\> where R will look for files (scripts, data, etc).
- By default, it will be on your Desktop
- Best practice is to use **R Project** to organize your files and data into projects.
- When using R Project, the working directory = project folder.
## Creating the project for this workshop
1. Go to `File` \> `New project`. Choose `New directory`, then `New project`
2. Enter `intro-r-socsci` as the name for this new folder (or "directory") and choose where you want to put this folder, e.g. `Desktop` or `Documents` if you are on Windows. This will be your working directory for the rest of the workshop!
<!-- -->
4. Next, let's create 3 folders inside our working directory:
- `data` - we will save our raw data here. **It's best practice to keep the data here untouched.**
- `data-output` - if we need to modify raw data, store the modified version here.
- `fig-output` - we will save all the graphics we created here!
::: callout-warning
Don't put your R projects inside your OneDrive folder as that may cause issues sometimes.
:::
# Let's Code!
Create a new R script - `File` \> `New File` \> `R script`.
**Note: RStudio does not autosave your progress, so remember to save from time to time!**
## R Objects and Values
In this line of code:
```{r}
#| echo: true
country_name <- "Singapore"
```
- `"Singapore"` is a **value**. This can be either a character, numeric, or boolean data type. (more on this soon)
- `country_name` is the **object** where we store this value. This is so that we can keep this value to be used later.
- `<-` is the assignment operator to assign the value to the object.
- You can also use `=`, but generally in R, `<-` is the convention.
- Keyboard shortcut: `Alt` + `-` in Windows (`Option` + `-` in Mac)
## Refresher: Quantitative Data Types
- [**Non-Continuous Data**]{.underline}
- **Nominal/Categorical**: Non-ordered, non-numerical data, used to represent qualitative attribute.
- Example: nationality, neighborhood, employment status
- **Ordinal**: Ordered non-numerical data.
- Example: Nutri-grade ratings, frequency of exercise (daily, weekly, bi-weekly)
- **Discrete**: Numerical data that can only take specific value (usually integers)
- Example: Shoe size, clothing size
- **Binary**: Nominal data with only two possible outcome
- Example: pass/fail, yes/no, survive/not survive
------------------------------------------------------------------------
- [**Continuous Data**]{.underline}
- **Interval**: Numerical data that can take any value within a range. [It does not have a "true zero".]{.underline}
- Example: Celsius scale. Temperature of 0 C does not represent absence of heat.
- **Ratio**: Numerical data that can take any value within a range. [it has a "true zero".]{.underline}
- Example: Annual income. annual income of 0 represents no income.
## Data Types in R
The four basic data types are characters, numeric, boolean, and integer. Let's look at examples using our WVS survey variables:
```{r}
#| echo: true
#| code-line-numbers: "|1|2|3|4"
country_code <- "SGP" # Character
life_satisfaction <- 8.5 # Numeric (also sometimes called Double)
is_religious <- TRUE # Boolean/Logical (true/false)
birth_year <- 1990L # Integer (whole numbers)
```
## Checking data type of a variable
You can use `str` or `typeof` to check the data type of an R object.
```{r}
#| echo: true
typeof(country_code)
```
```{r}
#| echo: true
str(is_religious)
```
## Arithmetic operations in R
You can do arithmetic operations in R. For example, let's calculate average satisfaction scores:
```{r}
#| echo: true
(8 + 7 + 9) / 3 # Average of three satisfaction scores
```
```{r}
#| echo: true
2025 - 1990 # Calculate age from birth year
```
## Boolean operations in R - Simple TRUE/FALSE statements
Boolean operations in R are useful for filtering survey data. Before that, let's look at how R evaluates simple TRUE/FALSE statements
Is life_satisfaction greater than 8?
```{r}
#| echo: true
life_satisfaction <- 8.5 # assign a value of 8.5 to life_satisfaction
life_satisfaction > 8 #
```
Is the country Singapore?
```{r}
#| echo: true
country_code == "SGP"
```
Is the country NOT Singapore?
```{r}
#| echo: true
country_code != "SGP"
```
## Boolean operations in R - AND operator
Sometimes, we may have multiple statements to evaluate. This is where the Boolean Operators will come handy.
**AND** operations (both conditions must be TRUE). In R, it is represented by ampersand `&`
Is the country New Zealand AND is the life satisfaction more than 8?
```{r}
#| echo: true
(country_code == "NZL") & (life_satisfaction > 8)
```
`country_code == "NZL"` is FALSE while `life_satisfaction > 8` is TRUE
The whole statement will return FALSE because not all conditions TRUE.
## Boolean operations in R - OR operator
**OR** operations (at least one condition must be TRUE). In R, it is represented by pipe symbol `|`
Is the country New Zealand OR is the life satisfaction more than 8?
```{r}
#| echo: true
(country_code == "NZL") | (life_satisfaction > 8)
```
As long as one condition is met, this will be TRUE.
## Functions in R
- A function is like a recipe in cooking.
- It takes some ingredients (inputs) and uses a set of instructions to produce a result (output).
- In R, a function is a pre-written set of recipes/instructions that performs a specific task. Function name will always be followed by round brackets `()`
Example: `round()` function in R will round up numbers.
```{r}
#| echo: true
round(3.1415926)
```
- `round()` is the "recipe", while `3.1415926` is the "ingredients"
Saving the result to an object:
```{r}
#| echo: true
rounded_pi <- round(3.1415926)
print(rounded_pi)
```
## Functions with Arguments in R
- Following the recipe analogy, arguments are the ingredients you provide to a function.
- Some arguments are required, while others are optional (they have default values).
- Each argument tells the function what to use or how to perform the task.
- Example: Think of a bubble tea order as a function. The possible arguments/ingredients here are:
- Tea - required ingredient
- Milk - optional, the default is to include
- Toppings - optional, the default choice is "pearls"
In R:
```{r}
#| echo: true
round(3.1415926, digits = 2)
```
- `3.1415926` is the required argument (if this is not provided, the function will not run)
- `digits` is an optional argument specifying how many decimal places to round to (the default is 0)
## How do I find out more about a particular function?
You can call the help page / vignette in R by prepending `?` to the function name.
E.g. if you want to find out more about the `round` function, you can run `?round` in your R console (bottom left panel)
## Packages in R
- Packages are a collections of R functions, datasets, etc. Packages extend the functionality of R.
- (Closest analogy I can think of is that they're equivalent of browser add-ons, in a way)
- Popular packages: `tidyverse`, `caret`, `shiny`, etc.
- Installation (you only need to do this once): `install.packages("package name")`
- Loading packages (you need to run this everytime you restart RStudio): `library(package name)` - let's try to load `tidyverse`!
## Data Structures in R
In today's session, we will explore 3 basic types of data **structures** in R:
:::: {.columns}
::: {.column width="40%"}
1. **Vector** - can hold multiple values in a single variable/object.
2. **Factor** - Special data structure in R to handle categorical variables.
3. **Data frame** - De facto data structure for tabular data in R, and what we use for data processing, plotting, and statistics.
:::
::: {.column width="60%"}

:::
::::
## Data Structures in R: Vectors
- Basic objects in R can only contain one value. But quite often you may want to group a bunch of values together and save it in a single object.
- A vector is a data structure that can do this. It is the most common and basic data structure in R. (pretty much the workhorse of R!)
```{r}
#| echo: true
countries <- c("CAN", "NZL", "SGP", "CAN", "SGP")
satisfaction_scores <- c(8, 7, 9, 6, 8)
employment_status <- c("Full time", "Student", "Part time", "Retired", "Full time")
```
## Vector Manipulations: Retrieve and update items
Retrieve the first country in the vector
```{r}
#| echo: true
countries[1]
```
Retrieves the first three satisfaction scores
```{r}
#| echo: true
#
satisfaction_scores[1:3]
```
Update the first satisfaction score
```{r}
#| echo: true
satisfaction_scores[1] <- 7
print(satisfaction_scores)
```
## Why square brackets and not round brackets?
Round brackets `()` are for running functions, like using a tool: `mean()` or `sum()`.
Square brackets `[]` are for accessing specific parts of your data, where we pass the index number(s) of the element(s) we want. For dataframes, we can use either index numbers or column names (more on this later!)
## Vector Manipulations: Retrieve items based on criteria
Let's find high satisfaction scores (above 7)!
- The code below will create a boolean vector called `criteria` that basically keep tracks on whether each items inside `satisfaction_scores` fulfil our condition.
- The condition is "value must be \> 7". e.g. if item 1 fulfils our condition, then item 1 is 'marked' as `TRUE`. Otherwise, it will be `FALSE`
```{r}
#| echo: true
# Create boolean vector for our condition
criteria <- satisfaction_scores > 7
print(criteria)
```
- This line of code applies the boolean vector `criteria` to `satisfaction_scores`, and only retrieve items that fulfils the condition. i.e. items whose position is marked as `TRUE` by `criteria` vector
```{r}
#| echo: true
# Use the boolean vector to filter satisfaction scores
satisfaction_scores[criteria]
```
## Vector Manipulations: Handling NA values
- NA values indicate null values, or the absence of a value (0 is still a value!)
- Summary functions like `mean` needs you to specify in the optional argument called `na.rm` on how you want it to be handled.
Survey data often contains missing values (NA):
```{r}
#| echo: true
financial_satisfaction <- c(8, 7, NA, 6, 9, NA, 7)
# By default, mean() will return NA if there are any NA values
mean(financial_satisfaction)
# Remove NA values before calculating mean by specifying that na.rm = TRUE
mean(financial_satisfaction, na.rm = TRUE)
```
## Vector Manipulations: Adding items
Several ways to add items to a vector
```{r}
#| echo: true
#| eval: false
satisfaction_scores <- c(satisfaction_scores, 7) # <1>
satisfaction_scores <- c(satisfaction_scores, 8, 9, 10) # <2>
satisfaction_scores <- c(8, satisfaction_scores) # <3>
satisfaction_scores <- append(satisfaction_scores, 9, after = 2) # <2> # <4>
```
1. Add a single score to the end of the vector using c()
2. Add multiple scores to the end
3. Add a score to the beginning
4. Insert a score at a specific position using append()
## Vector Manipulations: Removing items
```{r}
#| echo: true
#| eval: false
satisfaction_scores <- satisfaction_scores[-c(2, 4)] # <1>
satisfaction_scores <- satisfaction_scores[satisfaction_scores <= 7] # <2>
satisfaction_scores <- na.omit(satisfaction_scores) # <3>
```
1. Remove elements by index using "negative indexing"
2. Remove elements based on a condition using logical indexing
3. Remove NA values from the vector
## Data Structures in R: Factors
- Special data structure in R to deal with categorical data.
- Can be ordered (ordinal) or unordered (nominal).
- May look like a normal vector at first glance, so use `str()` to check.
Unordered (Nominal):
```{r}
#| echo: true
employment_factor <- factor(c("Full time", "Part time", "Student", "Retired", "Student"))
str(employment_factor)
```
Ordered (Ordinal):
```{r}
#| echo: true
importance_factor <- factor(
c("Very important", "Important", "Not very important", "Not at all important"),
ordered = TRUE,
levels = c("Not at all important", "Not very important", "Important", "Very important")
)
str(importance_factor)
```
## Data Structures in R: Dataframe
- De facto data structure for tabular data in R, and what we use for data processing, plotting, and statistics.
- Similar to spreadsheets!
- You can create it by hand like so:
```{r}
#| echo: true
survey_data <- data.frame(
country = c("SGP", "CAN", "NZL", "SGP", "CAN"),
life_satisfaction = c(8, 7, 9, 6, 8),
employment = c("Full time", "Student", "Part time", "Retired", "Full time")
)
print(survey_data)
```
## Downloading the World Values Survey (WVS) Dataset
For this workshop, we will try loading a dataset from a file.
Go to the course website and go to the ['Dataset'](https://bellaratmelia.github.io/introductory-r-socsci/dataset.html) tab to download the data file and information about this WVS data
Download this CSV and save it under your `data` folder in your R project!
## Loading the WVS Dataset
Let's load our actual World Values Survey dataset:
```{r}
#| echo: true
#| output: false
library(tidyverse)
wvs_data <- read_csv("data/wvs-wave7-sg-ca-nz.csv") #
head(wvs_data)
```
Make sure to save the CSV file in your data folder!
## Exploring the WVS Dataset
```{r}
#| echo: true
#| output: false
#| eval: false
dim(wvs_data) # <1>
names(wvs_data) # <2>
str(wvs_data) # <3>
summary(wvs_data) # <4>
head(wvs_data, n=5) # <5>
tail(wvs_data, n=5) # <6>
```
1. return a vector of number of rows and columns
2. inspect columns
3. inspect structure
4. print the summary stats of the entire dataframe
5. view the first 5 rows
6. view the last 5 rows
## Basic dataframe manipulations: Retrieving values
Some basic dataframe functions before we move on to data wrangling next week:
```{r}
#| echo: true
#| output: false
#| eval: false
wvs_data["country"] # <1>
wvs_data$country # <2>
wvs_data[3] # <3>
wvs_data[1, 4] # <4>
wvs_data[3, ] # <5>
```
1. retrieve column by name (returns as tibble/dataframe)
2. another way to retrieve column by name (returns as vector)
3. get an entire column by index
4. get a cell at this row, column coord
5. get an entire row
# End of Session 1!
Next Session: Data wrangling with `dplyr` and `tidyr` packages - we'll learn how to:
- Filter survey responses by country
- Calculate average satisfaction scores by demographic groups
- Create new variables from existing ones
- Handle missing values in survey data
- And much more!