CSC3007_Darkkhaki_Team_Project_Data_Preparation.qmd

---
title: "P3 Darkkhaki"
author: "Zhang XiangHui, Lin Weichen, Priscilla Thung, Tang Guan You, Arshad Bin Mazlan, Wong Jing Yong Shawn"
date: "2024-06-29"
format: html
---

# Introduction

This document contains the data engineering required to contruct the US map with the percentage increase in virtual restaurants by state between 2020 and 2023. The data engineering includes data cleaning, transformation, grouping, merging, and preparing the data for plotting. The data engineering is done using R programming language.

The code below requires the following packages:

```{r}
#| label: library
#| echo: false
#| message: false
#| warning: false
#| results: "hide"

library(dplyr)
library(ggplot2)
library(usmap)
library(scales)
library(tidyverse)
library(sf)
library(ggtext)
library(stringr)
```

# Data Cleaning and transformation

The data cleaning and transformation process involves loading the data, extracting the state abbreviation, cleaning it up, selecting the required columns, adding a new column, grouping the data by state and year, calculating the percentage difference, and preparing the data for plotting.

## Loading the data

The origin of `restaurant_2020.csv` and `restaurant_2023.csv` datasets are from Kaggle and Github respectively, https://www.kaggle.com/datasets/ahmedshahriarsakib/uber-eats-usa-restaurants-menus?resource=download&select=restaurants.csv , and  https://github.com/gsunit/Extreme-Uber-Eats-Scraping/blob/master/1.5M-Uber-Eats-Restaurants.zip

`restaurants_2020.csv` and `restaurants_2023.csv` are the two datasets used for this analysis. There are two columns in the `restaurants_2020.csv` dataset: `name` and `addressRegion`. 
The `restaurants_2023.csv` dataset contains several columns: `id`, `position`, `name`, `score`, `ratings`, `category`, `price range`, `full_address`, `zip_code`, `lat`, `lng`.

Therefore, data cleaning and transformation are required to prepare the data.

```{r}
# Load the restaurants 2020 data
restaurants_2020 <- read_csv("restaurants_2020.csv", show_col_types = FALSE) |>
  drop_na(addressRegion)

# Load the restaurants 2023 data
restaurants_2023 <- read_csv("restaurants_2023.csv", show_col_types = FALSE) |>
  drop_na(full_address)

# Display both datasets with head
head(restaurants_2020)
head(restaurants_2023)

```
## Cleaning the data

Since `restaurants_2023` dataset contains the state abbreviation in the `full_address` column, we need to extract the state abbreviation using a regular expression and clean it up. We will then select the id, state and name columns and add a new column `Year Partnered` to the `restaurants_2023` dataset.

This `year parterned` column will be used later to merge the two datasets and calculate the percentage difference.


```{r}
# Extract the state abbreviation using a regular expression and clean it up
extracted_restaurants_2023 <- restaurants_2023 %>%
  mutate(state = str_extract(full_address, ",\\s*([A-Z]{2})\\s*,") %>%
           str_replace_all("^,\\s*|\\s*,$", ""))

# Select the state and Year Partnered columns
extracted_restaurants_2023 <- extracted_restaurants_2023 %>%
  select(id, state, name)

# Add a year partnered column to the 2023 dataset
extracted_restaurants_2023 <- extracted_restaurants_2023 %>%
  mutate(`Year Partnered` = NaN)

# Group the states and display the number of restaurants
grouped_restaurants_2023 <- extracted_restaurants_2023 %>%
  group_by(state) %>%
  summarise(count = n())

grouped_restaurants_2023

```

For comparison, group and display the restaurants_2020 by state as well.

```{r}
# Group the restaurants by state
grouped_restaurants_2020 <- restaurants_2020 %>%
  group_by(addressRegion) %>%
  summarise(count = n())

grouped_restaurants_2020
```

## Group the restaurants by state

It is noticed that the `addressRegion` column in the `restaurants_2020` dataset is not named `state`. Therefore, we will rename the column to `state` and add a `Year Partnered` column to the `restaurants_2020` dataset.

We will also add a `Year Partnered` column to the `extracted_restaurants_2023` dataset to indicate the year the restaurant was partnered with Uber Eats.

```{r}
# Rename the addressRegion column to state for the 2020 dataset
restaurants_2020 %>%
  rename(
    state = addressRegion
  )

# Add a Year Partnered column to the 2020 dataset
restaurants_2020 <- restaurants_2020 %>%
  mutate(Year_Partnered = 2020)

# Add a Year Partnered column to the 2023 dataset
filtered_restaurants <- extracted_restaurants_2023 %>%
  mutate(`Year Partnered` = ifelse(name %in% restaurants_2020$name, 2020, 2023))

# display filtered_restaurants with only state and year partnered columns
filtered_restaurants <- filtered_restaurants %>%
  select(state, `Year Partnered`)

```

## Merge the two datasets and calculate the percentage difference

We merged the two datasets and then calculated the percentage difference between the number of restaurants in 2020 and 2023 for each state.

```{r}
# Group the data by state and year
grouped_restaurants <- filtered_restaurants |>
  group_by(state, `Year Partnered`) |>
  summarise(count = n())

# Calculate the percentage increase after each year
# Take NaN as 0 to show no increase in restaurants
filtered_data <- grouped_restaurants |>
  group_by(state) |>
  mutate(percentage_diff = if_else(is.na(lag(count)), 0,
                                   (count - lag(count)) / lag(count) * 100))

filtered_data
```
# Plotting

The data is now ready for plotting. We will prepare the data for plotting by calculating the centroids for each state, adjusting the coordinates for Florida (FL) and Hawaii (HI), finding the missing states, defining a list of states with smaller areas, and preparing the state labels and centroid labels for plotting.


```{r}
# Get state centroids for labeling
state_coords <- usmap::us_map(regions = "states")

# Convert to an sf object
state_coords_sf <- st_as_sf(state_coords, coords = c("long", "lat"), crs = 4326
                            , agr = "constant")

# Calculate centroids for each state
state_centroids <- state_coords_sf %>%
  st_centroid() %>%
  st_coordinates() %>%
  as_tibble() %>%
  bind_cols(state_coords_sf)

# Prepare data for centroid labels
state_centroids <- state_centroids %>%
  rename(lon = X, lat = Y) %>%
  select(abbr, lon, lat)

# Print state centroids before adjustments
print(state_centroids %>% filter(abbr %in% c("FL", "HI")))

# Adjust the coordinates for FL and HI
state_centroids <- state_centroids %>%
  mutate(
    lon = if_else(abbr == "FL", lon + 150000, lon),
    lat = if_else(abbr == "FL", lat - 100000, lat),
    lon = if_else(abbr == "HI", lon + 100000, lon),
    lat = if_else(abbr == "HI", lat - 75000, lat)
  )

# Print state centroids after adjustments
print(state_centroids %>% filter(abbr %in% c("FL", "HI")))

# Find missing states (states with no data)
all_states <- unique(state_centroids$abbr)
plotted_states <- unique(filtered_data$state)
missing_states <- setdiff(all_states, plotted_states)

# Define a list of states with smaller areas
# (For a smaller size of the label text)
small_area_states <- c("CT", "DE", "DC", "HI", "MD",
                       "MA", "NH", "NJ", "RI")

# Prepare state_labels for plotting
state_labels <- state_centroids %>%
  left_join(filtered_data, by = c("abbr" = "state")) %>%
  mutate(
    label_color = ifelse(is.na(percentage_diff) | percentage_diff <= 200
                         , "black", "white"),
    label_size = ifelse(abbr %in% small_area_states, 1.5, 3)
  )

# Prepare centroid_labels for states with no data
centroid_labels <- state_centroids %>%
  mutate(
    label_text = ifelse(abbr %in% missing_states, abbr, ""),
    label_size = ifelse(abbr %in% small_area_states, 1.5, 3)
  )

```

## Plot the US map with percentage increase

Finally, we plot the US map with the percentage increase in virtual restaurants by state between 2020 and 2023.

```{r}
# Define the breaks for the color scale
breaks <- c(-10, 0, 50, 100, 200, 300)

# Plotting the US map with the percentage increase
plot_usmap(data = filtered_data, values = "percentage_diff", color = "white") +
  theme(legend.position = "right") +
  scale_fill_fermenter(palette = "YlOrBr", breaks = breaks,
                       name = "Percentage Increase",
                       labels = percent_format(scale = 1),
                       direction = 1) +
  labs(title = "Increase in virtual restaurants by State",
       subtitle = "between 2020 and 2023",
       caption = "Grey areas represent states that do not have data<br><br>
       **Source**: Kaggle Dataset of Uber Eats USA Restaurants and 
       Github \"gsunit\" Uber Eats Scraping") +
  theme(
    plot.caption = element_markdown(),
    legend.title = element_text(size = 10, hjust = 0.5),
    legend.text = element_text(size = 8),
    plot.title = element_text(hjust = 0.5, size = 16),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    legend.position = "top",  # Position legend at the top
    legend.box = "vertical",  # Arrange legends vertically
    legend.key.width = unit(1, "cm"),  # Set legend key width
    legend.frame = element_rect(), # Add a frame around the legend
    legend.margin = margin(t = 1, unit = "pt"),
    legend.title.position = "top",
  )  +
  geom_text(data = state_labels, aes(
    x = lon, y = lat,
    label = abbr,
    color = label_color,
    size = label_size
  )) +
  geom_text(data = centroid_labels, aes(
    x = lon, y = lat,
    label = label_text,
    size = label_size,
  ), color = "black") +
  scale_color_identity() +
  scale_size_identity() +
  theme(
    # Change the panel background color
    panel.background = element_rect(fill = "aliceblue", color = NA),
    # Change the plot background color
    plot.background = element_rect(fill = "white", color = NA),
  )
```