-
Notifications
You must be signed in to change notification settings - Fork 4
/
MSilva_Exercise11.Rmd
92 lines (71 loc) · 3.16 KB
/
MSilva_Exercise11.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
title: "DATA 605 Week 11"
author: "Mike Silva"
date: "10 April 2019"
output:
rmdformats::readthedown:
highlight: kate
---
## Introduction
As part of the Fundamentals of Computational Mathematics course work at the CUNY School of Professional Studies, we needed to complete a simple linear regression of data of our choice. I have selected to look at the relationship between the total arcade revenue and the number of PhD computer science graduates.
## Data Acquisition
The data was scraped from [Tyler Vigen's site](http://www.tylervigen.com/view_correlation?id=97). The total revenue from arcades is originally from the U.S. Census Bureau. The computer science PhD's originally came from the National Science foundation.
```{r message=FALSE}
library(rvest)
library(stringr)
library(dplyr)
library(tidyr)
library(tibble)
df <- read_html("http://www.tylervigen.com/view_correlation?id=97") %>%
html_nodes(xpath = "//table[@class='alldata']") %>%
html_table(fill = TRUE) %>%
as.data.frame() %>%
select(-X12) %>% # Drop empty column
slice(1:3) %>% # Keep the first 3 rows
mutate(X1 = ifelse(X1 == "", "Year", str_extract(X1, "(.*?)\\s?\\(.*?\\)"))) %>% # Cleanup first column
mutate(X1 = str_remove(X1, " \\(.*\\)")) %>% # More first column cleanup
mutate(X1 = str_replace_all(X1, " ", ".")) %>% # Last bit of first column cleanup
rownames_to_column %>% # Transpose dataframe
gather(var, value, -rowname) %>%
spread(rowname, value) %>%
select(-var) # Drop unneeded column
# Pull name from first row
names(df) <- df[1,]
df <- df %>%
slice(2:11) %>% # Drop first row
mutate(Total.revenue.generated.by.arcades = gsub(",", "", Total.revenue.generated.by.arcades)) %>% # Remove the commas
mutate(Computer.science.doctorates.awarded = gsub(",", "", Computer.science.doctorates.awarded)) %>% # Remove the commas
mutate_if(is.character, as.numeric) # Convert to numeric
```
The end result is the following data
```{r echo=FALSE, message=FALSE}
library(kableExtra)
df %>%
rename("Computer Science Doctorates Awarded" = Computer.science.doctorates.awarded) %>%
rename("Total Revenue Generated by Arcades" = Total.revenue.generated.by.arcades) %>%
kable() %>%
kable_styling()
```
## Model Building
Now that the data is cleaned up we will fit a line to the data
```{r}
model <- lm(Total.revenue.generated.by.arcades ~ Computer.science.doctorates.awarded, data = df)
```
## Model Evaluation
```{r echo = FALSE}
library(ggplot2)
ggplot(df, aes(Computer.science.doctorates.awarded, Total.revenue.generated.by.arcades)) +
geom_point() +
geom_abline(slope = model$coefficients[2], intercept = model$coefficients[1], color="red") +
ylab("Total Revenue Generated by Arcades (millions $)") +
xlab("Computer Science Doctorates Awarded")
```
```{r comment=NA}
summary(model)
```
### Residual Analysis
```{r}
plot(model)
```
## Conclusion
This model explains almost all of the variablitiy. The R-squared is about 0.97. The residual plots do not raise any concern. The Q-Q is fairly linear. The coefficients are statistically significant. The only problem with this model is that the relationship is a [spurious correlations](http://www.tylervigen.com/view_correlation?id=97).