-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathp8105_hw2_yc4195.RMD
185 lines (164 loc) · 8.1 KB
/
p8105_hw2_yc4195.RMD
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
title: "p8105_hw2_yc4195.RMD"
author: "Yimin Chen yc4195"
date: '`r format(Sys.time(), "%Y-%m-%d")`'
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, message = FALSE)
library(tidyverse)
library(readxl)
library(dplyr)
```
## Problem 1
```{r}
nyc_transit_data =
read_csv(file = "data/NYC_Transit_Subway_Entrance_And_Exit_Data.csv") %>%
janitor::clean_names()%>%
select(line, station_name, station_latitude, station_longitude, route1:route11, entry, exit_only,vending, entrance_type, ada)%>%
mutate(
entry = recode(entry,`YES`= TRUE,`NO`=FALSE))
```
**A brief overview of NYC Transit data**:
The NYC Transit Subway Data contains the fundamental information in the NYC subway: the Subway line , station name, latitude and longitude of each station, entrance types, exits,the routes served, whether each station has admission or vending, and whether it is ADA compliant.For Dimension, it has 1868 rows and 20 columns.
My data cleaning steps:
1.import data and clean the variable
2.retain required variables
3.recode the 'entry' variable from a character variable with "YES" or "NO" values to a logical variable with "TRUE" or "FALSE" values.
At this point, the data are not tidy enough. For routes, the variables are spreading over 11 column instead of a united variable name such as "route name/number". Route number and route should both be variables, which means we need to convert 'route' variables from wide to long format.
```{r}
# Distinct on station name and line
distinctstation1 <- nyc_transit_data %>% distinct(line,station_name,.keep_all=TRUE)
distinctstation <- nyc_transit_data %>% distinct(line,station_name)%>%nrow()
distinctstation
```
There are `r distinctstation` distinct stations.
```{r}
ADA_compliant = filter(nyc_transit_data, ada == 'TRUE')
ADA_compliant<-ADA_compliant %>%distinct(line,station_name)%>%nrow()
ADA_compliant
```
`r ADA_compliant` stations ADA compliant.
```{r}
proportion <-nyc_transit_data %>%filter(entry == 'TRUE',vending == 'NO')%>%nrow()/nyc_transit_data %>%filter(vending == "NO") %>%nrow()
proportion
```
`r proportion`% of station entrances/exits without vending allow entrance.
```{r}
nyc_tidy_data <- nyc_transit_data %>%
mutate_all(as.character)%>%
pivot_longer(route1:route11,names_to = "route_name", names_prefix = "route_", values_to = "route_number")
nyc_tidy_data %>%
filter(route_number == "A") %>%
distinct(line, station_name) %>%
nrow()
a =nyc_tidy_data %>%
filter(route_number == "A") %>%
distinct(line, station_name) %>%
nrow()
```
There are `r a ` distinct stations which serve the A train.
```{r}
nyc_tidy_data %>%
filter(ada == TRUE) %>%
filter(route_number == "A") %>%
distinct(line, station_name) %>%
nrow()
b= nyc_tidy_data %>%
filter(ada == TRUE) %>%
filter(route_number == "A") %>%
distinct(line, station_name) %>%
nrow()
```
Of the stations that serve the A train, `r b ` are ADA compliant.
## Problem 2
Read and clean the Mr. Trash Wheel sheet.
```{r}
trashwheel <- read_excel("data/Trash Wheel Collection Data.xlsx", sheet = 1, range = cell_cols("A:N")) %>%
janitor::clean_names() %>%
rename(dumpster_number = dumpster) %>%
filter(!is.na(dumpster_number)) %>%
mutate(sports_balls = as.integer(round(sports_balls, digits = 0)))%>%
mutate(year = as.integer(year))%>%
mutate(dumpster_number = as.character(dumpster_number))%>%
add_column(dataset = "Mr. Trash Wheel")
trashwheel
```
Use a similar process to import, clean, and organize the data for Professor Trash Wheel, and combine this with the Mr. Trash Wheel dataset to produce a single tidy dataset.
```{r}
proftrashwheel <- read_excel("data/Trash Wheel Collection Data.xlsx", sheet = 2, range = cell_cols("A:M")) %>%
janitor::clean_names() %>%
rename(dumpster_number = dumpster) %>%
filter(!is.na(dumpster_number)) %>%
mutate(sports_balls = NA)%>%
mutate(year = as.integer(year))%>%
mutate(dumpster_number = as.character(dumpster_number))%>%
add_column(dataset = "Professor Trash Wheel")
proftrashwheel
combined_data =
full_join(trashwheel,proftrashwheel,by = colnames(trashwheel)[1:15])
combined_data
```
```{r}
summary(combined_data)
str(combined_data)
weightoftrash_professor<-
combined_data %>%
filter(dataset == "Professor Trash Wheel")
sum_weightoftrash_professor<-sum(weightoftrash_professor$weight_tons)
sum_weightoftrash_professor
Mr_2020<-
combined_data %>%
filter(dataset == "Mr. Trash Wheel")%>%
filter(year == 2020)
number_sportsballs_Mr_2020<- sum(Mr_2020$sports_balls)
number_sportsballs_Mr_2020
```
The combined dataset has `r nrow(combined_data) ` observations, each represents a distinct dumpster. There are `r length(combined_data) ` (or 14) variables in total include the dumpster number, the date collected,weight, volume, kinds of trash, the value of homes powered, and at last I have added one variable called "dataset" indicates this data is from Mr. Trash Wheel's or Professor Trash Wheel's. If you ignore this added variable, there are 14 variables in total.For available data, the total weight of trash collected by Professor Trash Wheel is `r sum_weightoftrash_professor ` tons and the total number of sports balls collected by Mr. Trash Wheel in 2020 is `r number_sportsballs_Mr_2020 `.
## Problem 3
First part within pols-month.csv.
```{r}
pols =
read_csv(file = "data/pols-month.csv") %>%
janitor::clean_names()%>%
separate(mon, into = c("year", "month","day"), sep = "-") %>%
mutate(month = as.integer(month),year = as.integer(year),day=as.integer(day))%>%
mutate(month = month.abb[month]) %>%
mutate(president = ifelse(prez_dem > 0, "dem","gop"))%>%
select(-prez_gop, - prez_dem)%>%
select(-day)
pols
```
Second part within snp.csv.
```{r}
snp =
read_csv(file = "data/snp.csv") %>%
janitor::clean_names()%>%
mutate(date = as.Date(date, format="%m/%d/%y"))%>%
separate(date, into = c("year", "month","day"), sep = "-") %>%
mutate(month = as.integer(month),year = as.integer(year),day=as.integer(day))%>%
mutate(year = ifelse(year>2046, year-100, year))%>%
arrange(year, month)%>%
mutate(month = month.abb[month]) %>%
select(year, month,day,close)
snp
```
Third part within unemployment.csv.
```{r}
unemployment = read_csv(file="data/unemployment.csv")%>%
pivot_longer(
Jan:Dec,
names_to = "month",
values_to = "unemployment_rate"
) %>% select(year = Year, everything())
unemployment
combineddataset= left_join(pols, snp, by = c("month","year"))
combineddataset
combineddataset_final=left_join(combineddataset,unemployment, by = c("month", "year"))
combineddataset_final
```
For dataset *pols*, there are `r nrow(pols) ` entries and `r length(pols) ` total columns. The year ranges from 1947 to 2015.Key variables include year-month,
gov_gop(the number of republican governors on the associated date), sen_gop(the number of republican senators on the associated date), rep_gop( number of republican representatives on the associated date), gov_dem(the number of democratic governors on the associated date), sen_dem(the number of democratic senators on the associated date), rep_dem(the number of democratic representatives on the associated date),president(the president was democratic or republican on the associated date .
For dataset *snp*, there are `r nrow(snp) ` entries and `r length(snp) ` total columns. The year ranges from 1950 to 2015.Key variables include date(year-month-day) and close (the closing values of the S&P stock index on the associated date).
For dataset *unemployment*, there are `r nrow(unemployment) ` entries and `r length(unemployment) ` total columns. The year ranges from 1948 to 2015.Key variables include year, month, and unemployment rate (the percentage of unemployment of the marked month and year).
For dataset *combineddataset_final*, which merges snp into pols and then merge unemployment into the result, there are `r nrow(combineddataset_final) ` entries and `r length(combineddataset_final) ` total columns. The year ranges from 1947 to 2015.Key variables include year-month-day, gov_gop,sen_gop,rep_gop, gov_dem, sen_dem,rep_dem,president,close,unemployment_rate. Meanings of these variables are all explained above.