-
Notifications
You must be signed in to change notification settings - Fork 0
/
CSC3007_Darkkhaki_Team_Project_Data_Preparation.qmd
265 lines (207 loc) · 9.57 KB
/
CSC3007_Darkkhaki_Team_Project_Data_Preparation.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
---
title: "P3 Darkkhaki"
author: "Zhang XiangHui, Lin Weichen, Priscilla Thung, Tang Guan You, Arshad Bin Mazlan, Wong Jing Yong Shawn"
date: "2024-06-29"
format: html
---
# Introduction
This document contains the data engineering required to contruct the US map with the percentage increase in virtual restaurants by state between 2020 and 2023. The data engineering includes data cleaning, transformation, grouping, merging, and preparing the data for plotting. The data engineering is done using R programming language.
The code below requires the following packages:
```{r}
#| label: library
#| echo: false
#| message: false
#| warning: false
#| results: "hide"
library(dplyr)
library(ggplot2)
library(usmap)
library(scales)
library(tidyverse)
library(sf)
library(ggtext)
library(stringr)
```
# Data Cleaning and transformation
The data cleaning and transformation process involves loading the data, extracting the state abbreviation, cleaning it up, selecting the required columns, adding a new column, grouping the data by state and year, calculating the percentage difference, and preparing the data for plotting.
## Loading the data
The origin of `restaurant_2020.csv` and `restaurant_2023.csv` datasets are from Kaggle and Github respectively, https://www.kaggle.com/datasets/ahmedshahriarsakib/uber-eats-usa-restaurants-menus?resource=download&select=restaurants.csv , and https://github.com/gsunit/Extreme-Uber-Eats-Scraping/blob/master/1.5M-Uber-Eats-Restaurants.zip
`restaurants_2020.csv` and `restaurants_2023.csv` are the two datasets used for this analysis. There are two columns in the `restaurants_2020.csv` dataset: `name` and `addressRegion`.
The `restaurants_2023.csv` dataset contains several columns: `id`, `position`, `name`, `score`, `ratings`, `category`, `price range`, `full_address`, `zip_code`, `lat`, `lng`.
Therefore, data cleaning and transformation are required to prepare the data.
```{r}
# Load the restaurants 2020 data
restaurants_2020 <- read_csv("restaurants_2020.csv", show_col_types = FALSE) |>
drop_na(addressRegion)
# Load the restaurants 2023 data
restaurants_2023 <- read_csv("restaurants_2023.csv", show_col_types = FALSE) |>
drop_na(full_address)
# Display both datasets with head
head(restaurants_2020)
head(restaurants_2023)
```
## Cleaning the data
Since `restaurants_2023` dataset contains the state abbreviation in the `full_address` column, we need to extract the state abbreviation using a regular expression and clean it up. We will then select the id, state and name columns and add a new column `Year Partnered` to the `restaurants_2023` dataset.
This `year parterned` column will be used later to merge the two datasets and calculate the percentage difference.
```{r}
# Extract the state abbreviation using a regular expression and clean it up
extracted_restaurants_2023 <- restaurants_2023 %>%
mutate(state = str_extract(full_address, ",\\s*([A-Z]{2})\\s*,") %>%
str_replace_all("^,\\s*|\\s*,$", ""))
# Select the state and Year Partnered columns
extracted_restaurants_2023 <- extracted_restaurants_2023 %>%
select(id, state, name)
# Add a year partnered column to the 2023 dataset
extracted_restaurants_2023 <- extracted_restaurants_2023 %>%
mutate(`Year Partnered` = NaN)
# Group the states and display the number of restaurants
grouped_restaurants_2023 <- extracted_restaurants_2023 %>%
group_by(state) %>%
summarise(count = n())
grouped_restaurants_2023
```
For comparison, group and display the restaurants_2020 by state as well.
```{r}
# Group the restaurants by state
grouped_restaurants_2020 <- restaurants_2020 %>%
group_by(addressRegion) %>%
summarise(count = n())
grouped_restaurants_2020
```
## Group the restaurants by state
It is noticed that the `addressRegion` column in the `restaurants_2020` dataset is not named `state`. Therefore, we will rename the column to `state` and add a `Year Partnered` column to the `restaurants_2020` dataset.
We will also add a `Year Partnered` column to the `extracted_restaurants_2023` dataset to indicate the year the restaurant was partnered with Uber Eats.
```{r}
# Rename the addressRegion column to state for the 2020 dataset
restaurants_2020 %>%
rename(
state = addressRegion
)
# Add a Year Partnered column to the 2020 dataset
restaurants_2020 <- restaurants_2020 %>%
mutate(Year_Partnered = 2020)
# Add a Year Partnered column to the 2023 dataset
filtered_restaurants <- extracted_restaurants_2023 %>%
mutate(`Year Partnered` = ifelse(name %in% restaurants_2020$name, 2020, 2023))
# display filtered_restaurants with only state and year partnered columns
filtered_restaurants <- filtered_restaurants %>%
select(state, `Year Partnered`)
```
## Merge the two datasets and calculate the percentage difference
We merged the two datasets and then calculated the percentage difference between the number of restaurants in 2020 and 2023 for each state.
```{r}
# Group the data by state and year
grouped_restaurants <- filtered_restaurants |>
group_by(state, `Year Partnered`) |>
summarise(count = n())
# Calculate the percentage increase after each year
# Take NaN as 0 to show no increase in restaurants
filtered_data <- grouped_restaurants |>
group_by(state) |>
mutate(percentage_diff = if_else(is.na(lag(count)), 0,
(count - lag(count)) / lag(count) * 100))
filtered_data
```
# Plotting
The data is now ready for plotting. We will prepare the data for plotting by calculating the centroids for each state, adjusting the coordinates for Florida (FL) and Hawaii (HI), finding the missing states, defining a list of states with smaller areas, and preparing the state labels and centroid labels for plotting.
```{r}
# Get state centroids for labeling
state_coords <- usmap::us_map(regions = "states")
# Convert to an sf object
state_coords_sf <- st_as_sf(state_coords, coords = c("long", "lat"), crs = 4326
, agr = "constant")
# Calculate centroids for each state
state_centroids <- state_coords_sf %>%
st_centroid() %>%
st_coordinates() %>%
as_tibble() %>%
bind_cols(state_coords_sf)
# Prepare data for centroid labels
state_centroids <- state_centroids %>%
rename(lon = X, lat = Y) %>%
select(abbr, lon, lat)
# Print state centroids before adjustments
print(state_centroids %>% filter(abbr %in% c("FL", "HI")))
# Adjust the coordinates for FL and HI
state_centroids <- state_centroids %>%
mutate(
lon = if_else(abbr == "FL", lon + 150000, lon),
lat = if_else(abbr == "FL", lat - 100000, lat),
lon = if_else(abbr == "HI", lon + 100000, lon),
lat = if_else(abbr == "HI", lat - 75000, lat)
)
# Print state centroids after adjustments
print(state_centroids %>% filter(abbr %in% c("FL", "HI")))
# Find missing states (states with no data)
all_states <- unique(state_centroids$abbr)
plotted_states <- unique(filtered_data$state)
missing_states <- setdiff(all_states, plotted_states)
# Define a list of states with smaller areas
# (For a smaller size of the label text)
small_area_states <- c("CT", "DE", "DC", "HI", "MD",
"MA", "NH", "NJ", "RI")
# Prepare state_labels for plotting
state_labels <- state_centroids %>%
left_join(filtered_data, by = c("abbr" = "state")) %>%
mutate(
label_color = ifelse(is.na(percentage_diff) | percentage_diff <= 200
, "black", "white"),
label_size = ifelse(abbr %in% small_area_states, 1.5, 3)
)
# Prepare centroid_labels for states with no data
centroid_labels <- state_centroids %>%
mutate(
label_text = ifelse(abbr %in% missing_states, abbr, ""),
label_size = ifelse(abbr %in% small_area_states, 1.5, 3)
)
```
## Plot the US map with percentage increase
Finally, we plot the US map with the percentage increase in virtual restaurants by state between 2020 and 2023.
```{r}
# Define the breaks for the color scale
breaks <- c(-10, 0, 50, 100, 200, 300)
# Plotting the US map with the percentage increase
plot_usmap(data = filtered_data, values = "percentage_diff", color = "white") +
theme(legend.position = "right") +
scale_fill_fermenter(palette = "YlOrBr", breaks = breaks,
name = "Percentage Increase",
labels = percent_format(scale = 1),
direction = 1) +
labs(title = "Increase in virtual restaurants by State",
subtitle = "between 2020 and 2023",
caption = "Grey areas represent states that do not have data<br><br>
**Source**: Kaggle Dataset of Uber Eats USA Restaurants and
Github \"gsunit\" Uber Eats Scraping") +
theme(
plot.caption = element_markdown(),
legend.title = element_text(size = 10, hjust = 0.5),
legend.text = element_text(size = 8),
plot.title = element_text(hjust = 0.5, size = 16),
plot.subtitle = element_text(hjust = 0.5, size = 12),
legend.position = "top", # Position legend at the top
legend.box = "vertical", # Arrange legends vertically
legend.key.width = unit(1, "cm"), # Set legend key width
legend.frame = element_rect(), # Add a frame around the legend
legend.margin = margin(t = 1, unit = "pt"),
legend.title.position = "top",
) +
geom_text(data = state_labels, aes(
x = lon, y = lat,
label = abbr,
color = label_color,
size = label_size
)) +
geom_text(data = centroid_labels, aes(
x = lon, y = lat,
label = label_text,
size = label_size,
), color = "black") +
scale_color_identity() +
scale_size_identity() +
theme(
# Change the panel background color
panel.background = element_rect(fill = "aliceblue", color = NA),
# Change the plot background color
plot.background = element_rect(fill = "white", color = NA),
)
```