01-intro.qmd

---
title: "Introduction to R and RStudio"
author: "Bella Ratmelia"
format: revealjs
---

# Welcome!

```{r}
#| echo: false
#| warning: false
#| message: false
library(tidyverse)
```

## Preamble

-   About me:

    -   Senior Librarian, Research & Data Services team, SMU Libraries.

    -   Bachelor in Info Tech (IT), MSc in Info Studies from NTU.

    -   Have been with SMU since the pandemic era (2021).
    
    -   Have been doing this workshop since Aug 2023.

-   About this workshop:

    -   Live-coding format; code along with me!

    -   Goal of workshop: to give you enough fundamentals (at least to the point that ChatGPT can't bluff you so easily) and confidence to explore R on your own.

    -   Don't be afraid to ask for help! We are all here to learn.

## The outline for these workshops

The workshops are structured to follow this workflow when dealing with data

![](images/data-workflow.png)

::: aside
Image is taken from [R for Data Science (2e)](https://r4ds.hadley.nz/intro) by Hadley Wickham.
:::

## The outline for these workshops (explained)

::: incremental
1.  **Import** data into R, which means take data (stored in a file, via API, etc) and load it into a dataframe in R
2.  **Tidy** the imported data.
    -   Tidy = storing it in a consistent form that matches the semantics of the dataset.
    -   Tidy data = each column is a variable, each row is an observation
3.  Once a data it tidy, we can **transform** it. Transformation includes:
    -   narrowing in on observations of interest (like all people in one city or all data from the last year)
    -   creating new variables that are functions of existing variables (like computing speed from distance and time)
    -   calculating a set of summary statistics (like counts or means).
4.  Once we have tidy data with the info we need, we can **visualize** it and **model** it.
5.  **Communicate** the result. It doesn’t matter how well your models and visualization have led you to understand the data unless you can also communicate your results to others.
:::

## What is R? What is R Studio?

**R**: The programming language and the software that interprets the R script

**RStudio:** An IDE (Integrated Development Environment) that we use to interact more easily with R language and scripts.

. . .

You will need to install **both** for this workshop. Go to <https://posit.co/download/rstudio-desktop> to download and install both if you have not done so.

Check out the course website for a step-by-step guide.

## A Tour of RStudio

![R Studio layout](images/rstudio-tour.jpg){fig-align="center"}

## Working Directory

-   Working directory -\> where R will look for files (scripts, data, etc).

    -   By default, it will be on your Desktop

    -   Best practice is to use **R Project** to organize your files and data into projects.

    -   When using R Project, the working directory = project folder.

## Creating the project for this workshop

1.  Go to `File` \> `New project`. Choose `New directory`, then `New project`

2.  Enter `intro-r-socsci` as the name for this new folder (or "directory") and choose where you want to put this folder, e.g. `Desktop` or `Documents` if you are on Windows. This will be your working directory for the rest of the workshop!

<!-- -->

4.  Next, let's create 3 folders inside our working directory:

    -   `data` - we will save our raw data here. **It's best practice to keep the data here untouched.**

    -   `data-output` - if we need to modify raw data, store the modified version here.

    -   `fig-output` - we will save all the graphics we created here!

::: callout-warning
Don't put your R projects inside your OneDrive folder as that may cause issues sometimes.
:::

# Let's Code!

Create a new R script - `File` \> `New File` \> `R script`.

**Note: RStudio does not autosave your progress, so remember to save from time to time!**

## R Objects and Values

In this line of code:

```{r}
#| echo: true
country_name <- "Singapore"
```

-   `"Singapore"` is a **value**. This can be either a character, numeric, or boolean data type. (more on this soon)
-   `country_name` is the **object** where we store this value. This is so that we can keep this value to be used later.
-   `<-` is the assignment operator to assign the value to the object.
    -   You can also use `=`, but generally in R, `<-` is the convention.
    -   Keyboard shortcut: `Alt` + `-` in Windows (`Option` + `-` in Mac)

## Refresher: Quantitative Data Types

-   [**Non-Continuous Data**]{.underline}

    -   **Nominal/Categorical**: Non-ordered, non-numerical data, used to represent qualitative attribute.

        -   Example: nationality, neighborhood, employment status

    -   **Ordinal**: Ordered non-numerical data.

        -   Example: Nutri-grade ratings, frequency of exercise (daily, weekly, bi-weekly)

    -   **Discrete**: Numerical data that can only take specific value (usually integers)

        -   Example: Shoe size, clothing size

    -   **Binary**: Nominal data with only two possible outcome

        -   Example: pass/fail, yes/no, survive/not survive

------------------------------------------------------------------------

-   [**Continuous Data**]{.underline}

    -   **Interval**: Numerical data that can take any value within a range. [It does not have a "true zero".]{.underline}

        -   Example: Celsius scale. Temperature of 0 C does not represent absence of heat.

    -   **Ratio**: Numerical data that can take any value within a range. [it has a "true zero".]{.underline}

        -   Example: Annual income. annual income of 0 represents no income.

## Data Types in R

The four basic data types are characters, numeric, boolean, and integer. Let's look at examples using our WVS survey variables:

```{r}
#| echo: true
#| code-line-numbers: "|1|2|3|4"

country_code <- "SGP" # Character
life_satisfaction <- 8.5 # Numeric (also sometimes called Double)
is_religious <- TRUE # Boolean/Logical (true/false)
birth_year <- 1990L # Integer (whole numbers)
```

## Checking data type of a variable

You can use `str` or `typeof` to check the data type of an R object.

```{r}
#| echo: true
typeof(country_code)
```

```{r}
#| echo: true
str(is_religious) 
```

## Arithmetic operations in R

You can do arithmetic operations in R. For example, let's calculate average satisfaction scores:

```{r}
#| echo: true
(8 + 7 + 9) / 3  # Average of three satisfaction scores
```

```{r}
#| echo: true
2025 - 1990  # Calculate age from birth year
```

## Boolean operations in R - Simple TRUE/FALSE statements

Boolean operations in R are useful for filtering survey data. Before that, let's look at how R evaluates simple TRUE/FALSE statements

Is life_satisfaction greater than 8?
```{r}
#| echo: true

life_satisfaction <- 8.5 # assign a value of 8.5 to life_satisfaction
life_satisfaction > 8  #
```

Is the country Singapore?
```{r}
#| echo: true

country_code == "SGP"  
```

Is the country NOT Singapore?
```{r}
#| echo: true

country_code != "SGP"  
```

## Boolean operations in R - AND operator

Sometimes, we may have multiple statements to evaluate. This is where the Boolean Operators will come handy.

**AND** operations (both conditions must be TRUE). In R, it is represented by ampersand `&`

Is the country New Zealand AND is the life satisfaction more than 8?
```{r}
#| echo: true

(country_code == "NZL") & (life_satisfaction > 8) 
```

`country_code == "NZL"` is FALSE while `life_satisfaction > 8` is TRUE

The whole statement will return FALSE because not all conditions TRUE. 

## Boolean operations in R - OR operator

**OR** operations (at least one condition must be TRUE). In R, it is represented by pipe symbol `|`

Is the country New Zealand OR is the life satisfaction more than 8?
```{r}
#| echo: true

(country_code == "NZL") | (life_satisfaction > 8) 
```
As long as one condition is met, this will be TRUE.

## Functions in R

-   A function is like a recipe in cooking.

-   It takes some ingredients (inputs) and uses a set of instructions to produce a result (output).

-   In R, a function is a pre-written set of recipes/instructions that performs a specific task. Function name will always be followed by round brackets `()`

Example: `round()` function in R will round up numbers.

```{r}
#| echo: true
round(3.1415926)
```

-   `round()` is the "recipe", while `3.1415926` is the "ingredients"

Saving the result to an object:

```{r}
#| echo: true
rounded_pi <- round(3.1415926)
print(rounded_pi)
```

## Functions with Arguments in R

-   Following the recipe analogy, arguments are the ingredients you provide to a function.

-   Some arguments are required, while others are optional (they have default values).

-   Each argument tells the function what to use or how to perform the task.

- Example: Think of a bubble tea order as a function. The possible arguments/ingredients here are:

    -   Tea - required ingredient

    -   Milk - optional, the default is to include

    -   Toppings - optional, the default choice is "pearls"

In R:
```{r}
#| echo: true
round(3.1415926, digits = 2)
```

-   `3.1415926` is the required argument (if this is not provided, the function will not run)

-   `digits` is an optional argument specifying how many decimal places to round to (the default is 0)

## How do I find out more about a particular function?

You can call the help page / vignette in R by prepending `?` to the function name.

E.g. if you want to find out more about the `round` function, you can run `?round` in your R console (bottom left panel)

## Packages in R

-   Packages are a collections of R functions, datasets, etc. Packages extend the functionality of R.

    -   (Closest analogy I can think of is that they're equivalent of browser add-ons, in a way)

-   Popular packages: `tidyverse`, `caret`, `shiny`, etc.

-   Installation (you only need to do this once): `install.packages("package name")`

-   Loading packages (you need to run this everytime you restart RStudio): `library(package name)` - let's try to load `tidyverse`!

## Data Structures in R

In today's session, we will explore 3 basic types of data **structures** in R:

:::: {.columns}

::: {.column width="40%"}
1.  **Vector** - can hold multiple values in a single variable/object.

2.  **Factor** - Special data structure in R to handle categorical variables.

3.  **Data frame** - De facto data structure for tabular data in R, and what we use for data processing, plotting, and statistics.
:::

::: {.column width="60%"}
![](images/data-structures.svg)
:::

::::


## Data Structures in R: Vectors

-   Basic objects in R can only contain one value. But quite often you may want to group a bunch of values together and save it in a single object.

-   A vector is a data structure that can do this. It is the most common and basic data structure in R. (pretty much the workhorse of R!)


```{r}
#| echo: true
countries <- c("CAN", "NZL", "SGP", "CAN", "SGP")
satisfaction_scores <- c(8, 7, 9, 6, 8)
employment_status <- c("Full time", "Student", "Part time", "Retired", "Full time")
```

## Vector Manipulations: Retrieve and update items

Retrieve the first country in the vector
```{r}
#| echo: true
countries[1]
```

Retrieves the first three satisfaction scores
```{r}
#| echo: true
# 
satisfaction_scores[1:3]
```

Update the first satisfaction score
```{r}
#| echo: true

satisfaction_scores[1] <- 7
print(satisfaction_scores)
```

## Why square brackets and not round brackets?

Round brackets `()` are for running functions, like using a tool: `mean()` or `sum()`.

Square brackets `[]` are for accessing specific parts of your data, where we pass the index number(s) of the element(s) we want. For dataframes, we can use either index numbers or column names (more on this later!)


## Vector Manipulations: Retrieve items based on criteria

Let's find high satisfaction scores (above 7)!

-   The code below will create a boolean vector called `criteria` that basically keep tracks on whether each items inside `satisfaction_scores` fulfil our condition.

-   The condition is "value must be \> 7". e.g. if item 1 fulfils our condition, then item 1 is 'marked' as `TRUE`. Otherwise, it will be `FALSE`

```{r}
#| echo: true
# Create boolean vector for our condition
criteria <- satisfaction_scores > 7
print(criteria)
```

-   This line of code applies the boolean vector `criteria` to `satisfaction_scores`, and only retrieve items that fulfils the condition. i.e. items whose position is marked as `TRUE` by `criteria` vector

```{r}
#| echo: true
# Use the boolean vector to filter satisfaction scores
satisfaction_scores[criteria]
```


## Vector Manipulations: Handling NA values

-   NA values indicate null values, or the absence of a value (0 is still a value!)

-   Summary functions like `mean` needs you to specify in the optional argument called `na.rm` on how you want it to be handled.

Survey data often contains missing values (NA):

```{r}
#| echo: true
financial_satisfaction <- c(8, 7, NA, 6, 9, NA, 7)

# By default, mean() will return NA if there are any NA values
mean(financial_satisfaction)

# Remove NA values before calculating mean by specifying that na.rm = TRUE
mean(financial_satisfaction, na.rm = TRUE)
```

## Vector Manipulations: Adding items

Several ways to add items to a vector

```{r}
#| echo: true
#| eval: false

satisfaction_scores <- c(satisfaction_scores, 7) # <1>
satisfaction_scores <- c(satisfaction_scores, 8, 9, 10) # <2>
satisfaction_scores <- c(8, satisfaction_scores) # <3>
satisfaction_scores <- append(satisfaction_scores, 9, after = 2) # <2> # <4>
```

1.  Add a single score to the end of the vector using c()
2.  Add multiple scores to the end
3.  Add a score to the beginning
4.  Insert a score at a specific position using append()

## Vector Manipulations: Removing items

```{r}
#| echo: true
#| eval: false

satisfaction_scores <- satisfaction_scores[-c(2, 4)] # <1>
satisfaction_scores <- satisfaction_scores[satisfaction_scores <= 7] # <2>
satisfaction_scores <- na.omit(satisfaction_scores) # <3>

```

1.  Remove elements by index using "negative indexing"
2.  Remove elements based on a condition using logical indexing
3.  Remove NA values from the vector

## Data Structures in R: Factors

-   Special data structure in R to deal with categorical data.

-   Can be ordered (ordinal) or unordered (nominal).

-   May look like a normal vector at first glance, so use `str()` to check.

Unordered (Nominal):

```{r}
#| echo: true 
employment_factor <- factor(c("Full time", "Part time", "Student", "Retired", "Student"))
str(employment_factor)
```

Ordered (Ordinal):

```{r}
#| echo: true 
importance_factor <- factor(
    c("Very important", "Important", "Not very important", "Not at all important"),
    ordered = TRUE,
    levels = c("Not at all important", "Not very important", "Important", "Very important")
)
str(importance_factor)
```

## Data Structures in R: Dataframe

-   De facto data structure for tabular data in R, and what we use for data processing, plotting, and statistics.

-   Similar to spreadsheets!

-   You can create it by hand like so:

```{r}
#| echo: true
survey_data <- data.frame(
    country = c("SGP", "CAN", "NZL", "SGP", "CAN"),
    life_satisfaction = c(8, 7, 9, 6, 8),
    employment = c("Full time", "Student", "Part time", "Retired", "Full time")
)
print(survey_data)
```

## Downloading the World Values Survey (WVS) Dataset

For this workshop, we will try loading a dataset from a file.

Go to the course website and go to the ['Dataset'](https://bellaratmelia.github.io/introductory-r-socsci/dataset.html) tab to download the data file and information about this WVS data

Download this CSV and save it under your `data` folder in your R project!

## Loading the WVS Dataset

Let's load our actual World Values Survey dataset:

```{r}
#| echo: true
#| output: false
library(tidyverse)

wvs_data <- read_csv("data/wvs-wave7-sg-ca-nz.csv") # 
head(wvs_data)
```

Make sure to save the CSV file in your data folder!

## Exploring the WVS Dataset

```{r}
#| echo: true
#| output: false
#| eval: false

dim(wvs_data) # <1>
names(wvs_data) # <2>
str(wvs_data) # <3>
summary(wvs_data) # <4> 
head(wvs_data, n=5) # <5>
tail(wvs_data, n=5) # <6>
```

1.  return a vector of number of rows and columns
2.  inspect columns
3.  inspect structure
4.  print the summary stats of the entire dataframe
5.  view the first 5 rows
6.  view the last 5 rows

## Basic dataframe manipulations: Retrieving values

Some basic dataframe functions before we move on to data wrangling next week:

```{r}
#| echo: true
#| output: false
#| eval: false

wvs_data["country"] # <1>
wvs_data$country # <2>
wvs_data[3] # <3>
wvs_data[1, 4] # <4> 
wvs_data[3, ] # <5>
```

1.  retrieve column by name (returns as tibble/dataframe)
2.  another way to retrieve column by name (returns as vector)
3.  get an entire column by index
4.  get a cell at this row, column coord
5.  get an entire row

# End of Session 1!

Next Session: Data wrangling with `dplyr` and `tidyr` packages - we'll learn how to:

-   Filter survey responses by country

-   Calculate average satisfaction scores by demographic groups

-   Create new variables from existing ones

-   Handle missing values in survey data

-   And much more!